MonetDB SPARQL/rdfss

From MonetDB
Jump to: navigation, search

Algorithm for RDF schema exploration:

We assume that a basic triples table e.g., SPO has been created. Using this table, the RDF schema can be recognized by following a multi-pass algorithm, each pass sequentially scans all the triples in the SPO table.

- Pass 1:

+ Build a hashmap in which each element corresponds to a set of properties, namely "csMap". (Note that properties have been sorted by each subject, e.g., subject S1 has properties {P1, P2, ..., Pk})


key = hashfcn({P1, P2, ..., Pk})
value = <csOid, frequency>

The frequency is updated when a same set of properties found.

+ Build a hashmap storing each property and its frequency, namely "propMap". This hashmap can help in detecting "dirty" properties with very low frequent (< dirtyThreshold). The idea here is to check whether a set of properties (stored in csMap) has a low frequency because of a infrequent/dirty property.


key = hashfcn(P1)
value = <propId, frequency>

(Note that we can build the propMap easily from PSO table. As P is sorted in PSO table, the frequency value in hashmap does not need to be updated for every single P scanned, but for the new appeared P)

While passing along the SPO triple table to build the hashmap of CS, the support of a CS is updated whenever there exists one CS having the same hash value. During this process, a set of frequent CSs (coined freqCSset) can also be extracted from all the CSs. Specifically, this can be done by adding each CS that reach the frequent threshold to freqCSset. Since the number of items in a freqCSset is small (< 10,000), it doesn't cost much time for running several statistic analysis algorithms, e.g., Finding Maximum CSs.

/* After the first pass, an initial set of CSs can be recognized by iterating over all the elements in csMap and keep the ones having the frequency value > a freqThreshold. */

- Pass 2:

For each set of properties (corresponding to a subject) with low frequency (received by looking at csMap), we check which properties in that set are dirty properties. Then, we remove the dirty property one by one and check the frequency of the set of remaining properties.

Set of <must-have> or <always-co-occur> properties:

We observe that, for a certain type of subject, a set of fundamental properties always appears (Which may form the primary key in the sql table of that subject).

The heuristic here is: Check the support for each property in a CS. If one property has exceptional small support comparing to the others (which have high support values), then that property is a dirty one.

- Pass 3: Detecting Foreign Keys

The foreign key appears in case that a CS always refers to another CS via Object and Subject oid.

For each frequent CS, store one list of object oids corresponding to that CS.

 CS1:  P11, P12, ..., P1h
     O11, O12, ..., O1h
 CS2:  P21, P22, ..., P2m
     O21, O22, ..., O2m
 CSk:  Pk1, Pk1, ..., Pkn
     Ok1, Ok2, ..., Okn

We believe that only Object oid of an URI can possible be the Subject oid in another CS. Thus, for each CS, we only check the object oid of URI in finding Foreign key relationship. Note that, the number of objects that are URIs is also very small, thus, it can be done quite quickly.

 For each CS cs
     For each obj in cs.objOids
        if (obj < RDF_MIN_LITERAL){
            if (BATfind(obj, s, SPO) != 0){
                /* Refer to one subject */
                /* Then process further */

PROBLEM: When we find a Subject that an object refers to, how can we know the CS that the Subject belongs to? Note that we do not store the array of <subject, list of predicate> since it consumes too much storage space.

Assign different oid to different types of Object -------

The current Map bat is used for assigning oid to an Object value. However, we now want to modify the oid of each Object depend on its type.