Sinterklaas programming challenge

From MonetDB
Jump to: navigation, search

“Fashions fade, style is eternal.” ― Yves Saint-Laurent

Assignment[edit]

The dataset on which the assignment is based on a csv file, structured with 4 fields:

  1. a trip identifier
  2. longitude
  3. latitude
  4. timestamp

Each line describes a point belonging to a trajectory.

The assignment is to retrieve the points contained in a rectangle identified by its bottom left and top right corner coordinates (i.e. range query on latitude and longitude). Points laying on an edge of the rectangle are included.

The file can be accessed here:

/net/warsaw/export/scratch2/sellam/sinterklaas-challenge/data.gz

The precision of the coordinates is 7, with 5 decimals (XX.XXX XX).

Some basic information here:

Nb of points Nb of trajectories Width Height Tmin Tmax
240,540,781 443,912 42.27402 43.04272 1,223,244,000,000 1,223,848,799,000


Longitude Latitude
Bottom left corner -12.62427 27.09371
Top right corner 29.64975 70.13643

Implementation[edit]

Any solution based on an publicly available code will be accepted. However, the proposition should be integrable in the MonetDB stack. The interface should be as follows:

  1. one script load.sh called once at the beginning of an evaluation session. It assumes the data set was copied and unzipped without modification to the candidate's desktop machine. The data file will be erased after this phase.
  2. one script query.sh reading query rectangles from the standard input:
query.sh < filename

filename refers to a CSV list of rectangles, organized as follows:

lon1, lat1, lon2, lat2
lon1, lat1, lon2, lat2
...

(lon1, lat1) identifies the bottom left corner of the query, (lon2, lat2) the top right.
The output is the set of points in CSV format, each point being separated by a line break. The output of each query ends with a line break. For instance, if 2 queries return 4 points each, the output will have the following format:

tripid, lon, lat, time
tripid, lon, lat, time
tripid, lon, lat, time
tripid, lon, lat, time 

tripid, lon, lat, time
tripid, lon, lat, time
tripid, lon, lat, time
tripid, lon, lat, time

Example[edit]

Here is an example of solution:

  • load.sh:
#!/bin/bash
db_path=/net/gunnarr/export/scratch0/sellam/documents/data-sets/tomtom-example/data

creation_script="
   create table trips (tripid bigint, lon decimal(8,5), lat decimal(7,5), time bigint);
   copy 240540781 records into trips from '$db_path' using delimiters ',','\n';"

db_name=sinterklaas

monetdb create $db_name
monetdb release $db_name
mclient $db_name -lsql -s"$creation_script"
  • query.sh:
#!/bin/bash

OLD_IFS=$IFS
IFS=','
db_name=sinterklaas

while read x1 y1 x2 y2; do

	query="
	   select * from trips 
	   where lon between $x1 and $x2	
	   and lat between $y1 and $y2;
	"
	mclient $db_name -fcsv -lsql -s"$query"
	echo ""
done
IFS=$OLD_IFS


Calling this script returns:

bash-4.1$ ./query.sh < myqueries
1006020485718345396,4.69634,47.59831,1223253874000
1006020485718345396,4.69729,47.59888,1223253869000
1006020485718345396,4.69869,47.59921,1223253864000
1006020485718345396,4.70005,47.59956,1223253859000
1006020485718345396,4.70120,47.60010,1223253854000
...

Reward[edit]

Something French.

Evaluation[edit]

Technical considerations[edit]

  • There will be 2 rounds. During the first round, any technology may be used (including C, MonetDB, Vectorwise, Smalltalk...). A demo will be made, along with a proposal of implementation into the MonetDB stack (short oral presentation). During the second round, the selected solutions will be implemented in MonetDB.
  • The new desktops machines will be used. Contestants are trusted to use fair hardware.
  • Use of vintage gear will be acknowledged
  • Use of multithreading is welcome, CPU and/or GPU. One machine only should be used
  • Demos will take place on the competitor's machine
  • One measurement will be made. It is the end-to-end time measurement of the following:
  1. One execution of load.sh
  2. Total time needed to process a 300 query rectangles file
  3. Total time needed to process a 300 query rectangles file
  4. Total time needed to process a 300 query rectangles file

A memory flush will be performed between each of these steps.

  • For the first run, 300 random queries will be generated and 300 will be picked among those, thus generating duplicates. The operation will be repeated for the 2 remaining runs. The length and width of the query rectangles will not exceed 0.04 degree. Coordinates will be positive.

The query generator may be found here:

/net/warsaw/export/scratch2/sellam/sinterklaas-challenge/query-generator.sh random-seed [data-file-path]

The first argument is a seed for the random number generator, the second is the path to the decompressed data file (my copy by default). The output is 300 queries.
For instance, if the generator and query.sh are in the same file :

./query-generator.sh 131313 /export/scratch0/donaldduck/data-set.csv | ./query.sh

(this is not recommended for performance though...)

  • Memory flush can be performed with Meat. You may find the source code here:
/net/warsaw/export/scratch2/sellam/sinterklaas-challenge/Meat.c

The tool should be used as follows:

./Meat <number_of_times> <block_size_KB>

For example 21 times 1GB will make all memory on your desktop being wiped

./Meat 21 $[1024*1024]

Organization[edit]

(dates may be subject to changes)

  • During round 1, the performance and ability to integrate in MonetDB of each solution is assessed. During round 2, elegance is considered.
  • After round 1, the top 50% candidates wrt to timing are selected. Candidates with no convincing integration plan in MonetDB are not considered. Deadline is 15 January included, announcements will take place on 23 January (during MADAM).
  • During the second round, the solutions will be implemented in MonetDB, possibly after a team reshuffling (this should allow combining different level of experience with MonetDB). The final winner will be chosen by a majority vote during the Madam of 20 Februray.
  • Participants may upload their scores anonymously on the following page: Sinterklaas score board