From MonetDB
Jump to: navigation, search

End of 2010 there has been some brainstorming for MonetDB/SPARQL and an initial plan. In Q1/Q2 2012 the LOD2 project will provide the context to work on this in full. Meanwhile, there has been additional thinking that led to a partial redesign.

Motivational Ramblings: Locality in RDF Storage[edit]

One of the biggest problems in SPARQL query processing seems to be low locality. RDF is just a data format and data represented in it will have just as much structure as when that data would have been represented in a different format. This structure typically surfaces as follows (correlations):

  • certain kinds of subjects have the same set of properties (belong to the same class)
  • certain classes are connected, over the same kind of of property paths (foreign key relationships)

The important paper, which we will extend, is:

Of course, it is possible to store less structured data in RDF, but even in web crawled RDF there is quite a bit of regularity. The irregularity typically surfaces as:

  • missing properties that would have been expected (incomplete data),
  • multi-valued properties (more than one triple with same subject and property),
  • unexpected properties that often occur at low frequency (properties that seem out of place for a subject, or properties that occur very infrequently in the dataset -- often spelling errors) .

Current RDF systems store data with triples sorted on the various permutations get some locality:

  • SPO: for each subject (object) we get all the attributes near each other (NSM). However, the records (subjects) of all different "tables" (in the NSM analogy) are stored interspersed in seemingly random order.
  • PSO: an order that resembles DSM: for each property we get subjects and values. Given that certain properties only occur for a single class of subjects, it is quite near to DSM. But not fully, as certain properties do occur for multiple subject classes, and the subjects are thus still somewhat interspersed.
  • OPS: similar to value indexes in relational systems

So while it seems that RDF systems with this exhaustive indexing would get near to the efficiency of relational database engines, this is not really the case. In fact, in the best case this indexing gets RDF systems back to un-indexed locality in relational systems. Because queries will combine different patterns (SPO, PSO, OPS) and making the joins between those does not have locality. This is similar to relational query processing without index locality. Even if the lookups can use index structures (B-tree), we get large amounts of random fetches, which on current hardware does not really scale.

One of the things that could be done to reduce joins is to at least cover the SPO multi-attribute star pattern. Rather than mapping it in a sequence of mergejoins, it seems natural to have a special RDFscan that delivers a single stream of records according to a star pattern. This RDFscan could work both from an SPO (NSM) or from PSO (DSM style). A slight variant is the RDFjoin, which does the same, but receiving a stream of candidate subjects. The latter operator was recently proposed and termed "Pivot Index Scan".

To get real locality we would like to order the S identifiers in a meaningful way:

  • group them by characteristic sets
  • within a characateristic set, order them on some index keys (i.e. property values)
  • or even more extreme: adopt a multi-table clustering strategy for this (BDC like).

Similarly, we should order the O identifiers in a meaningul way. Preferably, such that the O identifiers can be used for value range-predicates. See the weird string mapping scheme in:

Regarding P, there are typically few values that are very frequent. But there may be a long tail of infrequent P's (often spelling errors or low-value noisy properties). Such P's could be treated differently than the frequent P's.

Critique of the Original Design[edit]

The original MonetDB RDF design lacked on four fronts, novelty, compactness, updates and advanced SPARQL features.


The chosen approach of full replication in all permutations has been demonstrated many times. As described above, all this indexing still does not lead to real locality that can match properly designed relational data warehousing solutions.

On the other hand, the work on characteristics sets shows perspectives to get all the indexing and locality benefits that are within reach for relational practitioners. Bringing this to the world of RDF would truly enhance the sorely lacking state-of-the-art in RDF stores.


Data storage size in RDF systems is of crucial importance, due to this lower locality. RDF stores typically only perform well if the hot-set of the database fits in the main memory; because random access to disk-resident data is very slow. Random access to RAM is tolerable (although not efficient either as there will be many TLB and CPU cache misses).

The database hot-set usually means the SPO/POS/...etc../OPS integer tables (or B-trees). Apart from these integers, there are mappings from integer to URIs and values. These mapping tables are typically less hot; for OLAP workloads that count things they often are not even queried; and for OLTP queries the access is very sparse such that a few I/Os are tolerable.

Thus, replicating S,P,O information many times has the drawback that it increases data volume. Another critical feature in RDF stores is therefore compression: an ordered table like PSO is highly compressible.

Both replication and compression were a drawback of the original MonetDB RDF plan: the shredder creates all 6 orders of PSO. Further, it does not apply any compression, storing the identifies as (8-byte) oid numbers; thus the original design needs 144 bytes per triple; exlcuding strings and excluding the URI mapping table. We note that the URI mapping does apply compression, and is relatively efficient.


Updates in SPARQL consist of triple deletes and inserts. The original MonetDB SPARQL with its replicated ordered tables made updates quite expensive due to this replication. Also, it is not really possible to update ordered tables without fully copying them, in the MonetDB data structures. All in all that design was quite update unfriendly.

One might argue whether SPARQL workloads really need updates. There seem to be quite a few that do not. So, this is not a must-have features. Still a database system without updates is a bit like a car without steering wheel; and many applications do need updates.

Advanced Features[edit]

Finally, the original MonetDB RDF design did not have support for:

  • graphs: the notion that triples belong to a set (a graph) is by now assumed standard by SPARQL users
  • inference: support for rdfs:subClassOf, owl:equivalentClass, owl:equivalentProperty, rdfs:subPropertyOf and owl:sameAs in SPARQL evaluation (i.e. the ability to use property synonyms, rdfs:type synonyms and subject synonyms).

Extending Characteristic Sets[edit]

A Characteristic Set (CS) is a set of properties that typically occur with the same subject. In fact, we can also call them object "classes"; and the properties are object "attributes". Just like in UML, these attributes can have multiplicities like: 0..1, 1..1 and 0..n

In RDF terms, one can also see a CS as a RDFS Class definition (and property definitions) that is derived from the data. RDFS does not have support to specify property multiplicities (0..1, 1..1, 0..n).

The initial paper describes a classification algorithm, where we group subjects by the properties they tend to have. The most frequent of such co-occurring combinations is called a characteristic set (CS).

We extend the initial algorithm in two ways:

  • support for typed properties. We do not only consider properties, but the type of those properties as well, to split characteristic sets.
  • support for foreign keys between CS's.

In contrast to the original CS algorithm, we further analyze the type of the properties that have been grouped. This means that we look at the literal type of the property. In case of URI properties, we type them using CS membership. This means that we split characteristic sets such that their types become homogeneous. We only keep those CS's that are still frequent.

The above process also leads to a foreign key graph between CS's. That is if a property of one CS always refers in the object field to another CS, this is a foreign key between these two CS's.

New Design: basic + clustered PSOG[edit]

The storage scheme uses two representations, namely a basic representation and a clustered representation. Triples are stored in one of these two representations. Typically, a dataset is first loaded in basic representation, and then (partially) moved to clustered representation following a reorganization.

We propose, in the spirit of MonetDB, to target analytical RDF applications. For such purposes, the PSOG order is best as it resembles the column-store approach.

Basic storage: pSOg[edit]

Basic triple storage consists of PSOG; where S is of type oid (64-bits), O is of type lng (also 64-bits) and P and G are int-s (32-bit). Hence MonetDB SPARQL has a restriction on max 2G properties and graphs.

pSOg order is in fact just pS order: we will not enforce sub-order on Og.

The representation of the objects represents every O as a lng (64-bits integer). This is done such that whatever the expression/subquery and intermediate result, we can represent values in a single and efficiently processable column.

The highest 2 bits are exploited as follows:

  • 00 URI (an oid that points into Lefteris' adaptive URI mapping).
  • 01 dateTime (lowest 62 bits numeric)
  • 10 numeric (lower 58 bits is the number, lowest 4 bits is lexical type)
  • 1000000000000...0000000000 is a special value that indicates a deleted triple (should be ignored)
  • 11 string (an oid that is must me looked up in another table) bits 2..16 (14 bits) are an index to a string mapping table.

During string shredding, we will not store all strings in a single mapping table; rather value-partition the strings in multiple buckets (at most 8192 buckets in the shredder). During query processing, intermediate expressions may create additional new string values, which will be added in new temporary buckets. The even buckets are the shredded buckets, and the odd buckets are the temporary buckets used during query processing. This is a way to (i) provide isolation yet benefit from global string mapping tables (ii) enhance string lookup using the bucket approach (the matching bucket allows to prefilter a range restriction on string using a range restriction on the lng-s).

XML is as yet unknown and will just be treated as xsd:string. Strings with a language are supported by prefixing them with a <lang>@tag (normal strings start with @).

For the numeric types, the lower four bits are exploited as follows:

  • 0 xsd:double
  • 1 xsd:decimal
  • 2 xsd:integer
  • 3 xsd:boolean
  • 4 xsd:nonPositiveInteger
  • 5 xsd:negativeInteger
  • 6 xsd:long
  • 7 xsd:int
  • 8 xsd:short
  • 9 xsd:byte
  • 10 xsd:nonNegativeInteger
  • 11 xsd:unsignedLong
  • 12 xsd:unsignedInt
  • 13 xsd:unsignedShort
  • 14 xsd:unsignedByte
  • 15 xsd:positiveInteger

note that this mapping respects integer range order. Bot equi- and range-selections map to integer (lng) ranges. The integers are stores as multiples of 100000 (5 digits) so that decimals and integers are comparable. xsd:double is stored identically to xsd:decimal (rounded to nearest on shred).

Given that we use two bits to encode a lng to represent some numeric, and then 4 bits for its lexical type; there are 58 bits left for the actual value. All in all this means that we have 58-bits signed integers, representing decimals with 5 places behind the comma. So the biggest decimal is 1441151880758.55871 and the smallest is -1441151880758.55872

xsd:float will be parsed and can be cast to, but all such information is stored and treated as xsd:double

In all, the typical storage cost of the basic scheme is 24 bytes per triple; plus the string size; not counting the URI mapping tables.

updates: Log Structured Merge Tables[edit]

Updating sorted tables in MonetDB is not really possible, as data is stored in arrays and not in B-trees. Therefore we propose the implementation of a module that stores data in multiple tables, that each differ from each by a large factor (X>>2) in size. As such, there will be logX(N) tables Ti, each with intended size c.X^i tuples. Updates go to the smallest table. If a table gets to be twice its intended size, it is merged with the next biggest table, all tables smaller than it move up one place and a new smallest empty table is created.

This approach is called Log Structured Merge Tables, a minor variant of Log Structured Merge Trees.

As the SPOG table will not be read by normal MAL commands, but by our RDFscan command instead, the fact that there are now multiple tables to worry about can be hidden.

A final idea for handling updates is targeted at the fact that in our scheme the basic storage in SPOG will have to deal with the long tail of irregular and not often used properties. The more common regular cases will migrate to the structured storage scheme.

Clustered Storage: |p|*sOg[edit]

We implement a reorganization operator that runs the characteristic set algorithm on the full dataset. Subjects qualify as members of a CS if the subject possesses all "required" (1..*) properties. It may qualify for multiple CS's in which case it is classified to just one (with most properties). Its other triples go into basic storage then.

The metadata that we store keeps track of all detected characteristic sets, their attributes and their attribute multiplicities (0..1, 1..1, 1..n etc). We also store statistics such as the CS cardinalities that can be useful for query optimization.

The clustered storage most resembles conceptually PSOG , but split for each property in a separate table, such that we only store SOg (P constant). This table itself may also be split into multiple partitions ie a Log Structured Merge Table; but typically there is a limited number of them (this is done to ease updates, because we must respect S order). You can also see our PSOG as an S-table with one column for each P ("property tables" in SPARQL jargon), though of course fully automatic.

We will not enforce OG sub-order in our PSOG. To accomodate updates, each table (we have one per property) is again stored as a Log Strucured Merge Table (multiple bats).

in detail:

  • s: we divide the S-es in runs that are of two types: (i) dense or (ii) ascending. We only store an oid with the base of each run. The highest bit contains whether it is dense or ascending. The next 7 bits contain a run length. Dense sequences are (base,base+1,..,base+size-1). Ascending sequences are created by using a second column of bytes that contains deltas. Thus, we can handle gaps of up to 255 (a large gap must use a new base, hence ends the run).
  • O: just like in the basic storage, we store everything as a lng. However, we employ compression by storing these as an int column with the lower 32 bits and a RLE representation of the high bits in a combination of (int:value,byte:count). This exploits the fact that the upper 32-bits bits are often equal (or zero) and leads to more compact storage.
  • g: similarly, the graph ID is also RLE stored as (int,byte), exploiting the fact that graphs will often be repeating

In all, the typical storage cost of the clustered scheme is just over 4 bytes per triple (almost all spent in O); plus the string size; not counting the URI mapping tables.

CS reorganization[edit]

When the primary basic SPOG table gets too big, we run the characteristic set algorithm and move triples into clustered storage. This will lead to URI renumbering; which is a headache. One simple strategy to limit impact is to never renumber a subject that is also used as a property - so we never need to maintain property columns (or metadata). It limits the work to objects.

Object renumbering does have to occur, both in the basic storage as well as in the clustered storage.

Simple Inference Support[edit]

For inferencing, I propose to add the classical trick of expand-on-load.

  • rdfs:subClassOf/rdfs:subPropertyOf: generate extra triples implied by the subclass/subproperty relationship. Store these in a special graph that contains the derived triples for the original graph.
  • owl:sameAs/owl:equivalentClass: choose one of the URIs as the primary one, and add the sameAs to a translation table that is used to map incoming URI literals in queries to the primary one.

Nothing really spectacular, but enough in the short term to be at least not clearly deficient wrt the RDF store competition.

Multi-dimensional Indexing[edit]

The advanced part of characteristic set clustering is the order of the S identifiers. S-identifiers are large integers (64-bits oids). We are playing with their values. Most other systems just map URI strings to oid-s on an insertion order basis. We exploit them for creating locality.

Two questions arise:

  • what criteria determine the S-oid order?
  • how to exploit this order in query processing?

Dimensions and Their Uses (over Foreign Keys)[edit]

The first question is generalized by rough multi-dimensional clustering. The idea is that we define a number of dimensions using range-binning.

For each characteristic set, a number of such dimensions can be used over path expressions. A path expression is a fixed-length sequence of properties, with the restriction that these must be a [0..1] or [1..1] attribute in the relevant characteristic sets. Note that using paths of length more than one, we are traversing foreign keys between characteristics sets.

Each characteristic set can thus be divided in cells, according to the multi-dimensional clustering. The subjects inside a cell have an arbitrary identifier order, but the bin numbers of the dimensions form the major bits of the S-oids (so dimensions determine a rough ordering on S). Taking TPC-H (RDF-H) as an example, we can create dimensions:

  • nation
  • date

And cluster suppliers and customers (two CS's) on nation. We can cluster Orders (a CS) on date; but also on customer-nation (i.e. following a CS foreign key). We can cluster Lineitem on order-date, customer-nation and supplier-nation. This co-clustering on dimensions can be exploited in queries that restrict supplier to certain regions, to not only limit the scan for suppliers to the relevant cells, but also the scan for lineitems.

There are some SPARQL specific caveats to this kind of multi-dimensional indexing:

  • we might also want to treat the graph to which a triple belongs as a special kind of dimension. We have essentially one order: PSOG, and the multi-dimensional indexing allows us to also get the benefits of PO* order, but we also want PG* order sometimes.
  • it is natural to use RDFS subClassOf to represent the membership of a fact to a dimension hierarchy. Expanding all the inferred triples (if productX has rdfs:type mp3player, then it also has rdfs:type audio and also rdfs:type electronics) will lead to rdfs:type to be a 0..n attribute that cannot be used for multi-dimensional indexing. We could possible create a stratification of elements of a certain rdf:type that form a rdfs:subClassOf hierarchy with each other. Membership of one layer of this layer could be made explicit in generated RDF triples; S? productLevel1 classX (productLevel1 is the invented name for the inferred level of the hierarchy and classX is the URI of the class). This would be a 0..1 attribute that could be used for multi-dimensional indexing.

Ideally, we would want to monitor a workload for some time (e.g. working on the basic scheme) and then infer from that workload for all the characteristic sets the most important access paths, and use these in the CS reogranization phase to steer the generation of S-oids and thus query locality.

Query Processing[edit]

We propose the addition of the following operations:

  • generate all triple (S,O1,..On) bindings for all properties Pj
  • each Pi is string that holds a property URI.
  • if the Pi string starts with ? this is an optional property, otherwise it is required.
  • if the Pi string starts with - it is required, but the binding does not have to be returned.
  • each Pi may be suffixed by two constants Hi,Li: Pi=[?|-]URI Li Hi
  • Hi/Li are literal values given as as a string (e.g. "42^xsd:integer"). G and Pj are URI strings.
  • we push down selections Li<=Pi<Hi using zone-maps, hence the implementation may omit out-of-range bindings, but may produce still false positives
  • The result is a set of BATs. We have many command signatures; one for each #n parameter (it determines the amount of result bats)
  • G is a restriction on the graphs to be queried (or *)
  • S is a restriction on the subjects of interest, typically used in case of joins; where th O identifiers of one kind of data items is used to match S identifiers of another (a "foreign-key join" in relational terms).

We will also need some functions to convert from our lng representation of literals to real MonetDB literals and back. In case of string-mappings, we will use a system with multiple string lookup tables (given by the major 2..X bits) such that we can add query-local mapping tables for newly created bits. The string-fetch fetches the strings depending on the mapping table bits from the right bats.

RDFscan Algorithm[edit]

Despite the many command signatures, and the difference between RDFscan and RDFjoin on the syntactical level (which internally is only a rather minor difference), there will be one operator implementing them all. This operator is rather complicated, however singles out the optimized case that a segment of bindings can be exclusively produces from data within one Characteristic Set (CS). That segment than runs an optimized result-producing routine that just appends the segment to result columns without any mergejoin overhead.

differences with the D2.5 LOD2 deliverable documents:

  • TRIPLES (bsic storage) in PSog; because this is really handy for the mergejoin
  • our CS's now consist of 1-1 properties. This causes more CS diversity, but makes result generation easier and faster as within a segment everything is regular

sketch of an algorithm

input: RPx[uri,lo,hi] = array of property URIs, with lo-high bounds (maybe open-ended)
       OPx[uri] = array of optional property URIs
       JOINSEL = sorted S-OIDs (non-NULL in case of RDFjoin) that restrict the output
output: a table of shape SUBJ | RVx | OVy bindings
  lowbound = 0
  we get for each property a INVLIST_p ordered on subject
  for all CS
    if this CS does not cover all required properties
      PIVOT = merge with the required properties from the INVLIST_p inside this CS area
              note: order required properties by selectivity
      PIVOT = NULL
    if the properties of this CS area do not intersect with the scan
       for all segments of this CS area
          if (PIVOT == NULL || PIVOT intersects segment) && (JOINSEL == NULL || JOINSEL intersects segment)
             rangelist = <segment>
             for all required properties p
                rangelist = insersect_range_from_zonemap(rangelist,p)
             for all range in rangelist
                if (PIVOT == NULL || PIVOT intersects range) && (JOINSEL == NULL || JOINSEL intersects range)
                   CSS = decompress subjects
                   if (no required properties in CS)
                      SUBJS = PIVOT
                   else if (PIVOT)
                      SUBJS = intersect(PIVOT,CSS)
                      SUBJS = CSS
   while S in LIST
      if (lowbound && CSS==PIVOT)
            advance PIVOT+LIST to lowbound
      for all properties
         advance/get matchptrs[p] of S from INVLIST_p (update lowbound)
         advance/get pos of S from SUBJLIST and add to matchptrs[p]
      if all required properties have at least one matchptr
   if (propnr == totprops)
      for(i=0;i<propnr;i++) if (returnprop(i))
         EMIT(rescol[i], (pos[i] == -1)?NILPTR:matchptr[i][pos[i]])
   else if (matches[propnr] > 0) {
      for(pos[propnr]=0; pos[propnr]<matches[propnr]; pos[propnr]++) {
   else if (optional prop)
      pos[propnr]=-1 # emit a NIL
   maxpos = lookup last subject <= lowbound
   if (JOINSEL)
      sel = compute matching positions
      for (i in properties inside CS)  if (returnprop(i))
         # decompress with quickskip over nonselected items
         for v in decompress(sel,maxpos) EMIT(rescol[i],&v)
      for (i in properties not in CS)  if (returnprop(i))
         for(j in sel) EMIT(rescol[i],NILPTR)
      for (i in CS properties) if (returnprop(i))
         for v in decompress(maxpos) EMIT(rescol[i],&v)
      for (i in properties not in CS) if (returnprop(i))
         for(j <= maxpos) EMIT(rescol[i],NILPTR)