MonetRDFSPARQL RelatedWork

From MonetDB
Jump to navigationJump to search

RDF storage schemes

RDF storage schemes can be classified into three categories: 1) Triple store, 2) Vertical partitioning, and 3) Property tables. In triple store scheme, all RDF triples are stored in a big three-column table. Vertical partitioning approach or decomposed storage schema partitions all the RDF triples according to the predicate in each triple (each triple consists of subject, predicate, object), and then, stores all the triples having the same predicate in one binary table. Each tuple in this binary table consists of the subject and the object values of an RDF triple. Both of these two approaches have many joins in query plan

%For alleviating the problem of having many self-joins For reducing the number of join operations, the property table approach has been used in many RDF store \cite{}. A property table is a multi-column table in which each attribute corresponds to a predicate in an RDF triple. In this approach, each property table contains all the triples having same subject.

Even though the triple store and vertical partitioning schemes have the problem of joins redundancy, Abadi et al.\cite{abadi2007scalable} showed that vertical partition approach in column-store database system outperforms systems that use the property tables. However, this conclusion was verified by Sidirourgos \cite{sidirourgos2008column} who by using different data sets showed that it is not always the case.

Thus, using column-store can be a better solution. What if we combine the column-store and property tables.

Schema Detection/Summarization

There have been several efforts in discover the schema for property tables.

A decade ago, Alexaki et al.\cite{alexaki2001ics} exploited the class and property defined in RDF/S schema (e.g., rdf:Property, rdfs:Class, rdfs:subClassOf,...) to map data represented in RDF/S model into ORDBMS with four tables, namely, Class, Property, Subclass and SubProperty. This approach, however, heavily depends on the availability of RDF/S schema in the input dataset, and can not apply for recognizing schema of un-structure as well as web crawl rdf data sets.

One of the early open-source system, namely Jena, \cite{wilkinson2006jena} loads RDF triples into Jena property tables that looks like conventional application database tables in relational model. However, this approach is not flexible as the physical design needs application tuning expert. Then, it is of course not feasible for storing heterogeneous rdf data sets which are too difficult for a human to discover the schema such as web-crawl data or a scientific dataset with thousands of different predicates.

In addition to a classical triples store, Oracle\cite{chong2005efficient} uses materialized joins views (called subject-property matrix) to create property tables, and then, uses them as an auxiliary storage structure to speed-up query processing. However, the performance of the system heavily depends on the right choice of materialized joins views which is also not determined before running the database services, but relies on users' demand and query workload.

Recently, Levandoski et al.\cite{levandoski2009rdf} provided automated method for building a tailored schema (i.e., the schema that balances the for RDF triples by using two steps.

Clustering: Find groups (clusters) of properties that appear together with the same subject and satisfy a predefined support threshold and null threshold. Output of this step contains: 1) properties that are not found in generated clusters, 2) property clusters do not contain any overlapping property with other clusters, and 3) Other clusters that do not belong to 1) and 2).

Partitioning: Get the set of clusters in 3), and partition them into set of non-overlapping clusters. To do this, they use a greed algorithm that tries to keep cluster with highest support, while pruning lower support clusters containing overlapping properties.

Wang et al.\cite{wang2010flextable} proposed FlexTable that Pays attention to changing the schema as new triples are inserted, proposes a new record-wise storage scheme to facilitate schema evolution. In this approach, two schemas with maximum similarity value will be merged while a new RDF tuple is inserted.

Recently, Matono et al.\cite{matono2012paragraph} proposed a variant of the property table approach, called paragraph table, that gathers adjacent triples in a structured RDF document into so called RDF paragraph, and then, stores them into their corresponding relational tables without decomposing or connection. This approach heavily depends on the order RDF triples in an RDF document. Thus, in such RDF web crawl data or in many dbpedia datasets (e.g., infoboxes) where rdf triples collected are stored in un-structured document without any specific order, this approach cannot perform well.

Recent works\cite{levandoski2009rdf,wang2010flextable,matono2012paragraph} provided automated methods for exploring schema, however none of them try to discover the foreign key relationship or correlations between schema tables.

RDF Graph Visualization tools