Monet/RDF: Self-organizing Structured RDF store
- What? Efficient RDF stores.
The Resource Description Framework (RDF) is becoming an unrivaled standard behind the vision of global data standardization (LOD cloud, http://linkeddata.org), and widely accepted in many domains such as life sciences. It has been used as the main data model for the semantic web and Linked Open Data, providing great flexibility for users to represent and evolve data without need for a prior schema. This flexibility, however, poses challenges in implementing efficient RDF stores. It i) leads to query plan with many self-joins in triple tables, ii) blocks the use of advanced relational physical storage optimization such as clustered indexes and data partitioning, and iii) the lack of a schema sometimes makes it problematic for users to comprehend the data and formulate queries.
<Motivation example: Use a graph with webcrawl data to show that 85\% of the dataset can be covered by less than 1000 tables>
Goals & contributions:
Our goal is to automatically recover large part of the implicit class structure underlying data stored in RDF triples which can be represented by using a compact and understandable UML class diagram. Here "compact" means that the number of classes or tables is not large, i.e., less than 1000. The relationships between tables need to be semantically and structurally understandable to users. Besides, the name assigned to each tables (class) as well as to each column (property) must reasonably convey the information about the data instances of that table or that column. Based on the discovered UML class diagram, the structured data then will be efficiently stored in a physical relational storage.
1) We present methods for detecting basic class structure and the relationships between them in RDF data, and propose several approaches to semantically and structurally optimize the detected structure to make it compact.
2) We propose heuristic approaches to assign human-readable and human-understandable names to tables/columns and their FK relationships.
3) We design a hybrid physical storage combining a triple store table and relational tables that can efficiently store RDF data.
4) We conduct experiments using dbpedia --- the largest real RDF dataset ---, and the considered-to-be-the-most-dirty RDF data, webcrawl dataset, in order to evaluate the compactness and effectiveness of our self-organizing structured RDF store.
II. Basic CS's & Relationships exploration
- Describe the algorithms for detecting frequent CSs, highly referred CSs, their property data types, and detecting relationships between them.
Using the example data for illustrating the algorithms.
- Discuss about dimension table, and show the algorithm for detecting dimension table by computing the indirect references to that CS.
Figures & Statistics to show:
<The performance metric (e.g., compactness = # of tables/# of triples or #of columns/# of triples, compression_ratio = databasesize/datasetsize) may need to be defined here before we show any statistics>
- Showing the figures about the number of base CS's and their cumulative coverage with webcrawl dataset, and dbpedia dataset.
(May be surprising to audiences that in the basic structure exploration, dbpedia dataset contains much more "classes" than webcrawl dataset).
- Show the number of frequent CSs, number of not-frequent but highly referenced CSs.
To show how large the schema will be with basic structure:
+ Number of CS's (==\> number of tables) needed to cover 80\% of each data set.
+ Total number of properties ==\> Number of columns
+ Max/Min number of property (or draw a graph showing the number of CSs per number of property)
To show the diversity of the literal types for each property as well as how messy the dataset is.
+ Avg number of data types per property
+ Avg number of subCSs (type-specific CS with a specific datatype for each of its property) per CS.
III. Labeling base CS's
- Describe 3 heuristics for labeling CS's
+ Using real ontology + Using available rdf:type property + Use the foreign key relationships between CS's
Use the example dataset to illustrate each heuristic.
Statistic to show:
- Show statistics on the number of the labels assigned in each experimental dataset: Number of CSs that have labels, number of CSs that do not have any label.
- Show how many labels are assigned by using each heuristic approach [NOT AVAILABLE]
IV. Filtering/Optimizing the basic discovered structure
IVa. Describe merging rules and use example data to illustrate each rule
Rule1. Merge two CS's being assigned the same label
Rule2. Merge CS's with subset-superset relationship
Rule3. Merge CS's referenced by a same CS via a particular property
Rule4. Merge CS's if their first common ancestor (in the ontology hierarchy information) is not too general. (Define the "generality" score for each element in the ontology hierarchy to identify whether that element is general or not)
Rule5. Merge CS's that have tf-idf similarity score higher than a threshold.
Describe how the each label is chosen for the merged CS after applying each rule (Linnea?)
IVb. Merging Performance Evaluation
- Show number of CSs, number of columns remaining after each rule.
(Unsolved PROBLEM: Max number of columns in a CS is too large)
- Remove columns with many NULLs (infrequent property).
- Remove rows with lots of Nulls to get Non-nullabel columns, keys.
- Remove small size tables.
- Remove infrequent relationship between CS's
V. UML class diagram
- Using multiplicity (e.g., 1: 0/1, , 1:n, n:m) for describing the relationships between CS's, and the availability of missing properties or multi-valued properties.
- Describe and show a figure of class diagram [NOT COMPLETED].
V. Physical Relational Storage
- How to handle multi-data-types property
- How to handle multi-valued (MV) columns and type diversity of each MV property?
- How to handle FK relationship
Illustrate the relational schema with example data.
- Show user's evaluation results on the labels assigned to tables [The current quality is not good!!]
- Show statistics about the final relational schema:
+ Number of tables (and number of tables for multi-valued properties, number of tables for non-default type properties).
+ Number of columns
+ Number of Multi-valued columns.
+ Min/Max number of columns
+ Average number of types per property
+ Number of FK relationships
+ Percentage of data stored in relational tables
- Show the size of the tables and the database
+ Average table size
+ Min/Max table size
+ Database size
(and more ...???)