Difference between revisions of "Obi4wan Meeting Notes 2016-09-16"
(Created page with "== Attendees == * Peter Boncz * Eric de Kruijf * Dean De Leo * Frank Smit == Minutes == Peter:<br/> - overview of the SQL extension to perform shortest path queries, current...")
|Line 5:||Line 5:|
* Frank Smit
* Frank Smit
== Minutes ==
== Minutes ==
Latest revision as of 12:53, 16 September 2016
- Peter Boncz
- Eric de Kruijf
- Dean De Leo
- Frank Smit
- overview of the SQL extension to perform shortest path queries, currently being implemented for MonetDB.
- the run-time operator for the shortest path queries using Dijkstra has been implemented for MonetDB, together with a tentative parallel algorithm. At the time it can only be accessed using the monetdb assembly language. The SQL front-end to handle the operator is not present yet.
- using FiloDB on top of Cassandra. This resolved the previous issue of hanging the workload seen due to lazy evaluation of updates.
- the validation set from the LDBC benchmark now runs successfully.
- developed a mini benchmark to experiment with large sequences of updates in FiloDB: 10000 updates over 1 node with 1 GB data overall took roughly 1 hour.
- expecting a potential issue related to the garbage collection mechanism in Cassandra. It takes significant time to execute. While this has been still acceptable for the current experiments, it might become a blocking hindrance with larger workloads. Restarting the JVM does not seem to alleviate the issue.
- looked into several optimisations:
1) executing updates in batches over FiloDB. Updates are cached using Java ad-hoc code, eventually flushed at a certain threshold or when a SELECT query needs to be executed.
2) also caching the edge weights for certain queries. The caching implementation relies on Java code written by Eric rather than a mechanism provided by Spark SQL.
3) in the benchmark, posts are organised into hierarchies started by a single root post which is required in certain queries. As optimisation, Eric attempted to precompute and store the root post. However, his solution still involves a join operation that allegedly eliminates the benefit of this approach. Peter suggested to directly attach the root post as additional property of the posts.
- in general it has been observed that queries involving joins are not efficient.
- a reason should be that Spark cannot make assumptions on how data is partitioned by Cassandra. Also Cassandra relies on a dynamic partition mechanism (DHT) which might not properly match the static mechanism employed by Spark. Overall this requires data to be broadcasted and repartitioned by Spark, more than actually necessary, dramatically decreasing the overall performance.
- an idea might be to split the data into two different groups, one involving completely read-only/static data completely handled by Spark, and a second group involving updates/dynamic data to be handled by Cassandra.
- rearranging queries to cope with different groups (static, dynamic) is very complex, the number of combinations to consider in join operations grows substantially.
- Spark Streaming has been considered but not tried. The problem is that queries cannot span multiple windows.
- Another idea might be to investigate a platform such as Impala Kudu. Kudu, like Cassandra, is a key value store, but in this case, the engine (Impala) has a full view of where data is stored and may perform additional optimisations.
- the real workload from Obi4Wan will involve significantly less real-time updates compared to the LDBC benchmark. In the Obi4Wan workload only post updates occur frequently while the other relations, such as the followers, are usually updated once a month or so. In this sense the LDBC benchmark is not much representative of their workload.
- Adopted the FAIR scheduling in Spark, so that long running jobs do not block smaller jobs to execute. As consequence the implementation has been made thread-safe. Using synchronized java blocks did not affect much the performance.
- Decreased the stop condition (max depth) in the shortest path execution to 5, 6 (before it was 10). The issue was when a path between two nodes did not exist, a query would take considerable time to complete.
- Implementation of the SQL front-end for the shortest operator in MonetDB.
- Initial experiments.
- Executing the LDBC benchmarks
- Set up of the hardware at their premises, the firewall has been configured.
- Working on the report for the COMMIT commission
- Friday 14/10/2016, Obi4Wan Zandaam