User:Linnea Passing/Notes

From MonetDB
Jump to navigationJump to search

Schedule[edit]

Original Schedule
April 1st -- May 10th Literature Research and Familiarization with MonetDB
May 11th -- May 31st Labelling Algorithms
June 1st -- June 21st Search Result Ranking
June 22nd -- August 2nd Prototype and First Evaluation
August 3rd -- August 16th Code Integration
August 17th -- August 30th Evaluation and User Testing
August 31st -- September 30th Buffer

Updates:

  • Phase "Labeling Algorithms" extended, but most of "Code Integration" and "Prototype" is already done for the labeling algorithms.

Literature[edit]

Mining subtopics from text fragments for a web query Wang2013[edit]

Information Retrieval, Automated labeling, Table names

Web search queries are too short to extract the users intent / expected subtopic / expected facet. Wang et. al. analyze results of the original query to cluster the results into subtopics using the context (words around the search term words). For each cluster, they generate a topic by searching a distinguishing core term and expanding it into a topic. This increases the readability of topics compared to selection of salient words as topic (search result clustering).

Conclusion: Interesting algorithm to generate topics. Takes both frequency of words and IDF (inverse document frequency) into account. Expanding core terms is probably not adoptable to RDF table naming as RDF tables do not provide sentence-like context information but single keywords. Maybe the clustering algorithm described can also be adopted to find similar RDF tables (e.g. products with integer price and products with float price) and to give them equal table names.

WordNet[edit]

Information Retrieval, Lexical database, Hierarchy, Synonyms

Database for relations between words; finding hyponyms and hypernyms; word frequency. License allows commercial use if license text is shipped with all copies.

A Study on Automated Relation Labelling in Ontology Learning Kavalec2009[edit]

Information Retrieval, Relation labeling, Verbs

A relation is formed by two concepts. A verb within n words from both concepts is a potential label for this relation. By lifting the verb to a more abstract level (hypernym) support improves. This leads to better results. Comparison with a human-tagged ontology is suggested with judgment either by humans or WordNet.

Conclusion: Deals with sentences only, application in RDF table naming probably not possible.

Open Information Extraction from the Web Banko2007[edit]

Open Information Extraction, TextRunner, Relations

Web-crawled data is heterogeneous and domain-independent, which leads to bad performance and accuracy of prior tools. The newly introduced tool TextRunner needs no hand-tagged training data, uses noise-tolerant algorithms and only few linguistic assumptions. TextRunner creates relations in fewer time and with fewer errors than previous tools. The paper suggests the following steps for measuring correctness of relations: well-formedness of the relation, well-formedness of the two arguments, distinction between abstract and concrete tuples, and truth of the tuple. Synonyms are hard to detect, furthermore verbs might overlap by some senses only.

Conclusion: In my thesis' setting, relations are already defined. The correctness crtieria might be useful to judge the labels of (foreign key) attributes.

Visual Browser, RDF-Gravity, tFacet, gFacet[edit]

Visualization Tools, Hierarchical Visualization, Graphical Visualization, Filtering

Interesting Features: Hiding edges based on type. Hiding instances of nodes/edges. Full text search. Showing all outgoing/incoming nodes ("neighborhood" of a node). Hide/Show all instances (schema view vs. data view). Some relations are highlighted: subClassOf, hypernym, partOf. Applying multiple filters at a time.

Conclusion: Which features should be added to my work, which can be outsourced to SQL, SQL visualization tools?

  • Provided by SQL: Hiding edges based on type, hiding instances, filtering
  • To be implemented: Full text search, showing "neighborhood", data view vs. schema view
  • tbd: highlighting

Characteristic Sets: Accurate Cardinality Estimation for RDF Queries with Multiple Joins Neumann2011[edit]

Characteristic Sets, Cardinality Estimation, Correlation

Cardinality estimation on RDF data is different from cardinality estimation in RDBMSs because RDF data has many correlation between tripes or between P and O within a triple, whereas RDBMSs cardinality estimation assumes independence. Triples that are combined into a CS are frequently joined (because they tend to describe semantic types). Furthermore, similar characteristics regarding correlation can be assumed withing CS's. Using CS's for cardinality estimation and therefore plan building leads to better performance of the SPARQL queries. CS's consisting of few entities can be merged into superset CS's.

Conclusion: Paper introduces CS's, but uses them for cardinality estimation only. We plan to use CS's to improve storage and querying as well as readability and searchability of RDF data. The handling of unfrequent CS's is resumed by Ducs idea to merge similar CS's by introducing NULL values.

When Simple is (more than) Good Enough: Effective Semantic Search with (almost) no Semantics Neumayer2012[edit]

Information Extraction, URI content extraction, Semantic search, Heuristic

Very similar to our research. Quality of RDF data search increases if textual content is extracted from URIs. Furthermore, quality increases if values of predicates like "title", "label", or "name" are assigned higher weight. By assigning higher weight to data from trustworthy sources (e.g. dbpedia), quality increases even more. Quality of results is measured using the Semantic Search Challenge.

Conclusion: This paper can be seen as a rationale for our ideas to create labels from URIs. If content from URIs leads to better search results, it should also lead to better comprehensibility of the schema. The Semantic Search Challenge could be used as a benchmark for our search result ranking, to prove that successful keyword search is still possible even though the data has been transferred to relational tables.

Simple BM25 Extension to Multiple Weighted Fields Robertson2004[edit]

BM25F, BM25, Weighting function, Search result ranking

An extension to BM25 to support structured documents (consisting of e.g. title, abstract, body) by assigning a weight to each type. To transform the structured document to an unstructured, the text within each type is repeated several times according to each weight. For example, a title with weight 3 will be repeated three times in the unstructured version of the document. The tf-idf-like BM25 ranking formula is then applied to the unstructured document.

Conclusion: Title of tables should have a higher weight than column names. Names of relations ("incoming edges") should be taken into account, too.

Recovering Semantics of Tables on the Web Venetis2011[edit]

Automated labelling, Table names, External ontologies, isA database

HTML tables lack attribute names and table names. By adding such names, tables can be used to answer queries of the format (<property> of <class>), like "locations of hurricanes". An isA database and a relationship database are used to create labels by looking at the instances. The databases were crawled from the internet and lead to better results than using existent ontologies like Wikipedia category network, YAGO or WordNet. More than one label can be assigned to each column. On the other hand, a large number of columns have no label assigned as the databases cannot find a suitable name. The quality of the labels is rated by humans. The more specific a label is the better is its rating.

Conclusion: Uses external sources to generate labels. The relationship database provides names for relations between two columns and will probably lead to the same results as the isA database if assigned to relations between subject and column X. Crawling our own isA database is probably not a good approach. Therefore, existing databases have to be used, which will lead to poor recall according to this paper. In contrast to the paper, we will not have vertical tables, tables with different formattings, tables used for layouts, and empty tables. This should increase the quality of labelling and searching.

ObjectRank: Authority-Based Keyword Search in Databases Balmin2004[edit]

ObjectRank, PageRank, Authority Transfer

ObjectRank is an incremental ranking algorithm, similar to Google PageRank.

Conclusion: Not relevant for our research: incremental algorithm, domain-specific authority transfer rates necessary.

Automatically Labeling Hierarchical Clusters Treeratpituk2006[edit]

Automated Labelling, Ranking of label candidates, Hierarchy

Hierarchical clusters are labelled using the documents in the cluster, the parents cluster, and a general English corpus. Label candidates have to be frequent in the current cluster but less frequent in its parent cluster and the corpus. To rank label candidates, a linear model containing TFIDF and phrase length is computed ("DScore").

Conclusion: Parameters for the linear model are fitted using training data what is not possible in our scenario. Good idea: using synonyms in WordNet to measure correctness of suggested labels.

Enhancing Cluster Labeling Using Wikipedia Carmel2009[edit]

Automated Labelling, Wikipedia, Ranking of label candidates

The introduce a general framework for clustering. Beside the steps needed for clustering documents, it contains the following steps: important terms extraction, candidate label extraction, candidate evaluation. All important terms are candidate labels, in addition some wikipedia categories are added to the candidate pool. Several judges are used and their scores are combined using a linear model. The evaluation framework introduced by Treeratpituk2006 is used.

Conclusion: General framework might be adopted for our research. Judges cannot be adopted as they use document ranking and centroid documents.

Data[edit]

Analysis of 600MB Webcrawl Data (4 million triples)[edit]

  • 13 CS with at least 10,000 subjects
  • Very similar lists of attributes, for example (title, description, url, type) and (title, url, type) and (title, url, type, facebook_app_id)
  • 9 out of 13 CS have a "type" attribute
  • Type attribute contains useful data (e.g. "article", "blog", "sports_team"), but there is more than one type per CS as the attribute list of the CSs are similar (No specific attributes for specific values of "type")
  • Conclusion: "type" is a good source for naming tables, but (at least for webcrawled data and a very high frequency threshold) there is not only one type in a table and some types appear at more than one table ("article", the most common one, appears at 7 out of 9 CS)
  • Foreign keys: no fks to subjects within the webcrawl, but some links to ontologies

Analysis of dbpedia "persondata EN" dataset (6 million triples)[edit]

  • Very similar lists of attributes, for example (name, birthPlace, deathPlace, givenName, surname) and (name, birthPlace, givenName, surname)
  • Foreign keys: some fks, but seem to be errors in the data (e.g. a person born in the city "Brecht, Belgium" has birthPlace="Bertolt Brecht")

Analysis of dbpedia "ontology infobox properties EN" dataset (20 million triples)[edit]

  • Foreign keys: >700,000 objects appear as subjects, too
  • freqThreshold 10,000
    • CS cover three topics: sportspersons, animals, geographical locations
    • Only the CS on locations contain type attributes, but with values that can not be used for labelling ("Cities in China", "Neighborhoods in San Francisco", ...)
  • freqThreshold 1,000
    • sportspersons, sportevent, animals, flowers, cities, music albums, movies, battles, persons, geographical locations, chemical compounds, ships, roads/routes, military units, video games, artists, military personnel, elections, fictional characters, languages, national parks, historic places, books, rivers, lakes, bands, songs, asteroids, buildings, mountains
    • The CS on cities, music albums and ships contain type attributes
  • Conclusion: Many similar CS that should be joined (e.g. persons with and without a deathDate), missing type attribute for most of the CS

Labeling[edit]

  • Data source for attribute names: URIs
  • Data sources for table names:
    • values of rdf:type and other "type" attributes
    • names of incoming foreign key links
    • names of ontology classes
  • Statistics for "type" attribute values and incoming foreign keys:
    • How many percent of the subjects in this table use this "type" attribute value? --> Assign value to the "type" attribute value
    • How many percent of the subjects in this table link have a foreign key link to a specific table? --> Assign value to the foreign key relationship
    • Discriminatory power of a "type" value attribute: sum up the values (in percent) of the value used in other tables. If the sum is lower than 10, the "type" attribute value is a candidate for a table name
    • Discriminatory power of a foreign key link name: sum op the usage (in percent) of the foreign key name as link to other tables. If the sum is lower than 10, the foreign key name is a candidate for a table name (of the table where it is an incoming link)
    • Lookup the ontology classes that have the properties of a table as class members. If the result contains multiple classes in a hierarchy, use the topmost class name as a candidate for a table name.

Per category ("type", fks, ontologies), multiple label candidate may exist.

Rules for assigning a table name[edit]

  1. if ontology lookup leads to exactly one ontology class that fits to the table, use the name of that class
  2. if ontology lookup returns multiple candidates, intersect with the list of type values
    1. if only one ontology class is left, use it
    2. if more than one ontology classes remain, use the with the highest similarity score
  3. if there are type values:
    1. if there is only one type attribute: use the most frequent one
    2. if there are multiple type attributes per table: take the most frequent one per attribute, then take the one with fewest occurances throughout the schema
  4. if there is no type information: choose an incoming FK as name: the name that is used in the highest number of incoming FKs. If there are still multiple FK names, choose the one that has the highest frequency
  5. if there is absolutely no information: "DUMMY"

Notes:

  • Ontologies are considered only if more than 20% of the attributes of the table belong to that ontology (TODO)
  • FKs are considered only if they cover more than 10% of the outgoing links of a table/attribute
  • It is not important how often a type is used throughout the whole data set
  • May be an issue: the most frequent type (used as table name in rule 3) might cover only 10% or 20% of the data if there are many different type values
  • Ontology lookup uses a fuzzy tfidf-approach to compare tables and ontology classes

"Errors" in DBPedia ontology[edit]

  • properties that do not belong to any class (e.g. "homeStadium", "league")
    • --> do not take properties without classes into account when matching CS with ontology classes
  • to describe animal/plant taxonomy, the dbpedia ontology uses the latin word "classis", but the dbpedia data uses the english word "class"
  • stations do not have an "address" in the ontology, but they do have one in the data
  • bands do not have a "hometown" in the ontology, but they do have one in the data
  • historicPlaces do not have an "architecturalStyle" in the ontology, but they do have one in the data

Log[edit]

  • June 3rd 2013: using "type" attributes, incoming foreign key links, and the goodrelations ontology, 28 out of 44 tables have one or more label candidates. The result will be better if more ontologies are included and the merged CS (implemented by Duc) are used.
  • June 10th 2013: using merged CS, the number of tables shrinks from 44 to 35

Searching, Ranking and Retrieval[edit]

Subgraph selection[edit]

  • Task: given a "root" table, select the root table plus n-1 tables that are reachable from the root table (using foreign keys) and cover as many subjects (rows, tuples) as possible.
  • Analsyis: "(Rooted) Node-Weight Connected Subgraph" Problem with fixed number of vertices. NP-hard (see [1] (B-RMWCS with budget n and edge-weight 1))
  • Heuristic 1:
    • per node: store frequency, predecessor, number of steps on the path to this node, summed frequency of all nodes on the path
    • BFS: if a node can be reached over several pathes: use the path with highest (summed frequency / steps)
    • take the "top" node: highest (summed frequency / steps) and include it in the list of nodes that will be presented to the user. Repeat for every node on the way to the "top" node
    • reassign the steps and summed frequency
    • repeat until the maximum number of chosen nodes is reached
  • Analysis heuristic 1: situation where the heuristic is not the optimal solution (image below):
    • left: from node X (root), there is a path to a very frequent node (freq 1000) over two infrequent node (freq 10 each). (summed frequency / steps) is 1020/3 = 340 for the frequent one
    • right: two frequent nodes with freq 650 each, one infrequent node with freq 10. (summed frequency / steps) = 660/2 = 330 fro both frequent nodes
    • the algorithm decides to include the three nodes on the left because 340>330. The right side contains more tuples than the left side and would therefore have been the optimal choice.
    X
   / \
  /   \
 o     o
 |    / \
 o   O   O
 |
( )
  • Heuristic 2:
    • compute a weight for every node that is reachable in one step (= children of the nodes that are already seelcted for the sub schema), select node with highest weight
    • weight of a node: frequency of itself plus frequencies of its children (="capability" of the node to contribute to the goal)
  • Heuristic 3:
    • heuristic 2, but weight contains also grandchildren
  • Comparison: compute a sub schema (size = 10), for every node in the dbpedia schema (size = 189)
    • heuristic 1 covers avg. 59% of the data in the schema, heuristic 2 38%, heuristic 3 29%
    • all computations take 610 -- 680 ms, no large differences between the heuristics
  • Heuristic 4: (uses edges instead of table size)
    • Search edges that begin at a table in the overview schema end ends at a table that is not in the overview schema. Chose the 'best' edge among these.
    • 'Best edge' = either (frequency of links) or (frequency of links times number of columns of table)
  • Overview algorithm:
    • use BFS to split graph into disconnected subgraphs using the "weakly connected" concept (ignore direction of edges), O(n)
    • per subgraph: compute transitive closure (Floyd-Warshall-Algorithm), O(n^3)
    • per subgraph: per node: count ho many tables can be reached, O(n^2). Sort descending, O(n log n). If equal to the size of the subgraph, one nodes covers the whole subgraph. Use this node in the overview and continue with the next subgraph. If no single node covers the whole subgraph: Order nodes by the number of other nodes they can reach. Use node that can reach most other nodes in the subgraph. Continue taking nodes until the whole subgraph is covered, O(n).
    • Amalysis: All nodes that have no incident FK (this includes all nodes in subgraphs of size 1) are added to the overview schema. Nodes not covered by this are organized in cycles. Per cycle, one node has to be added to the overview, too. This algorithm leads to a schema that makes every table visible (after some zooming steps), but leads to a large schema (for example dbpedia with 189 tables: overview schema size 94 tables)
  • Interface:
    • Input: root node, number of nodes n, list of nodes and their frequencies, list of edges
    • Output: list of n chosen nodes

Call via SQL interface[edit]

  • SQL procedure
    • Input: number of tables in subschema
    • Output: new schema is generated containing the view of the chosen tables (using the heuristics above)
  • todo
    • choice of root table (search engine, table with most tuples, ...)
    • choose a different schema name every time the procedure is called
    • inform the user about the schema title
    • schema deletion: when?
    • set number of tables automatically instead of asking the user?
    • how to include in ODBC/JDBC interface?

Interface[edit]

(Old) interface description

  • Interface between Creation of CS and Labeling (my input): defined
  • Interface between Labeling and SQL generation (my output): to be defined