MAL profiler

MAL profiler mk Tue, 03/30/2010 - 12:18

A key issue in the road towards a high performance application is to understand where resources are being spent. This information can be obtained using different tools and at different levels of abstraction. A coarse-grained insight for a particular application can be obtained using injection of the necessary performance capturing statements in the MAL instruction sequence. Fine-grained, platform specific information can be obtained using existing profilers, like valgrind (http://www.valgrind.org), or hardware performance counters.

MonetDB comes with a set of profiling tools, including stethoscope, tomograph and tachograph, geared at understanding the internal working, scheduling, and potential resource bottlenecks raised by running queries. They are all based on a profiler event stream started by the system upon request. Together, they facilitate both online and off-line inspection of database queries.

The execution profiler is supported by hooks in the MAL interpreter. The default strategy is to ship an event record immediately over a stream to a separate performance monitor, formatted as a tuple.

SECURITY WARNING

Profiling a system is a potential security leak. Therefore, at any time there can only be one user accessing the profile information.

Trace Format

Trace Format zhang Sun, 07/19/2015 - 18:05

Upon request, a MonetDB server will produce an event stream with performance profiling information.

The event stream consists of individual text lines with comma-separated values and decorative tuple format square brackets (‘[’ and ‘]’). The field types are fixed. The bulk of the event records are recordings made at the “start” and “done” state of a MAL instruction call. The last column “stmt” contains the actual call and its arguments. It aids in the analysis to understand why some instructions are taking a lot of time. The prelude of the event record contains:

  Field Description Example
1 event An event counter 38
2 time The wall-clock time in microseconds "13:11:16.710180",
3 pc Name of the query execution plan name; followed by the program counter in the execution plan (denoted in the square brackets ‘[‘ and ‘]’) indicating the position of the MAL instruction in its defining block; and finally a unique call identifier to distinguish recursion. "user.s5_1[14]12",
4 thread Id of the worker thread processing this MAL instruction. 3
5 state The instruction status ("start" or "done") *. The "wait" event signals that a worker thread has stumbled upon an empty instruction queue, e.g. all remaining MAL instructions are blocked by an instruction to produce an intermediate result. "done",
6 usec The estimated (at "start") or actual execution time (at "done") for the MAL instruction, measured in microseconds 207,
7 rssMB Memory Resident Set Size (RSS), i.e., the portion of memory occupied by a process that is held in main memory, in MB. 54
8 tmpspace Estimated cumulative query plan footprint, in MB. For a query, the maximal value of “tmpspace” gives a fairly good estimation of the memory consumption of this query. 0,
9 inblock The number of disk blocks read ** 0,
10 outblock The number of disk blocks written ** 0,
11 majflt The number of major page faults 0,
12 nswap The number of swaps 0,
13 switch The number of context switches 0,
14 stmt The MAL statement being executed, with argument statistics "sql.exportResult(X_21=\"104d2\":streams,X_16=4:int);”

 

* In addition, the “state” field denotes two special events. The “wait” event is received when the worker threads cannot find an eligible MAL instruction and have to wait for other worker threads to deliver results. The “ping” event provides a synopsis of the CPU processing loads.

** Please be aware that these values reflect system wide activities, thus they include not only MonetDB activities. Additionally, these I/O counts do not reflect the actual amount of data read/written from/to the hardware drives, as this information is generally not available to all users of a Linux system.

Stethoscope

Stethoscope mk Tue, 03/30/2010 - 12:20

The program stethoscope is a simple Linux application that can attach itself to a running MonetDB server and extracts the profiler events from concurrent running queries. Stethoscope is an online-only inspection tool, i.e., it only monitors the execution state of the current queries and outputs the information in STDOUT for immediate inspection. For example, the following command tracks the microsecond ticks for all database instructions denoted in MAL on a database called “voc”:

    $ stethoscope -u monetdb -P monetdb -d voc

A synopsis of the calling conventions:

    stethoscope [options]
      -d | --dbname=<database_name>
      -u | --user=<user>
      -P | --password=<password>
      -p | --port=<portnr>
      -h | --host=<hostname>
      -o | --output=<local file name>
      -b | --beat=<delay> in milliseconds (default 50)

Before stethoscope  starts parsing command line options, it first tries to read configuration information from a “.monetdb” file.  If the environment variable DOTMONETDBFILE is set, it reads the file pointed to by that variable instead. When unset, stethoscope searches for a “.monetdb” file in the current working directory, and if that does not exist, in the current user‘s home directory. This file can contain defaults for command line keys, e.g., user, password, language, save_history, format and width.

The output produced by stethoscope consists of heartbeat events and MAL instruction traces. The heartbeat is issued at regular intervals and contains as main piece of information the recent CPU load of all cores as made available by the (Linux) operating system. The text below shows an excerpt of the stethoscope output for the query “SELECT id FROM _tables;”:

$ stethoscope -d voc -b 50000
-- opened UDP profile stream numa-db:50010 for localhost
# event,        time,   pc,     thread, state,  usec,   rssMB,  tmpspace,       inblock,        oublock,        majflt, nswap,  switch, stmt,   # name
...
[ 6,    "13:11:15.060773",      "user.ping[0]0",        4,      "ping", 0,      58,     0,      0,      0,      0,      0,      1,      "[  0.00  0.00  0.00  0.00  0.01  0.01  0.01  0.00  1.00  1.00  0.01  0.01  0.01  1.00  1.00  1.00  1.00  0.01  1.00  0.01  0.00  0.07  0.00  0.00  0.01  0.01  0.00  0.01  0.00  0.00  0.05  0.02  0.00  0.00  0.00  0.04  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.00  0.00  0.00  0.01 ]",  ]
...
[ 8,    "13:11:16.703220",      "user.s5_1[0]12",       3,      "start",        0,      54,     0,      0,      0,      0,      0,      484,     "function user.s5_1{autoCommit=true}():void;",       ]
...
[ 10,   "13:11:16.704600",      "user.s5_1[1]12",       3,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_23=0@0:void := querylog.define(\"select id from _tables;\":str,\"default_pipe\":str,16:int);",    ]
[ 11,   "13:11:16.704742",      "user.s5_1[1]12",       3,      "done ",        140,    54,     0,      0,      0,      0,      0,      0,       "X_23=0@0:void := querylog.define(\"select id from _tables;\":str,\"default_pipe\":str,16:int);",    ]
[ 12,   "13:11:16.704881",      "user.s5_1[2]12",       3,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "barrier X_37=false:bit{transparent} := language.dataflow();",       ]
[ 13,   "13:11:16.705111",      "user.s5_1[3]12",       28,     "start",        0,      54,     0,      0,      0,      0,      0,      2,       "X_2=0:int := sql.mvc();",     ]
[ 14,   "13:11:16.705305",      "user.s5_1[3]12",       28,     "done ",        193,    54,     0,      0,      0,      0,      0,      0,       "X_2=0:int := sql.mvc();",     ]
[ 15,   "13:11:16.705419",      "user.s5_1[4]12",       28,     "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_3=nil:bat[:oid,:oid] := sql.tid(X_2=0:int,\"sys\":str,\"_tables\":str);", ]
[ 16,   "13:11:16.705468",      "user.s5_1[5]12",       6,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_6=nil:bat[:oid,:int]{file=\"/home/demouser27/tst/./bat/07/736\", rows=45, hlb>=0@0:oid, hub<45@0:oid} := sql.bind(X_2=0:int,\"sys\":str,\"_tables\":str,\"id\":str,0:int);",      ]
[ 17,   "13:11:16.705459",      "user.s5_1[7]12",       22,     "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_12=nil:bat[:oid,:int]{file=\"/home/demouser27/tst/./bat/07/736\", hlb>=45@0:oid, hub<1048621@0:oid} := sql.bind(X_2=0:int,\"sys\":str,\"_tables\":str,\"id\":str,1:int);",        ]
[ 18,   "13:11:16.705719",      "user.s5_1[5]12",       6,      "done ",        250,    54,     0,      0,      0,      0,      0,      1,       "X_6=[45]:bat[:oid,:int]{file=\"/home/demouser27/tst/./bat/07/736\", rows=45, hlb>=0@0:oid, hub<45@0:oid} := sql.bind(X_2=0:int,\"sys\":str,\"_tables\":str,\"id\":str,0:int);",    ]
[ 19,   "13:11:16.705799",      "user.s5_1[7]12",       22,     "done ",        339,    54,     0,      0,      0,      0,      0,      2,       "X_12=[0]:bat[:oid,:int]{file=\"/home/demouser27/tst/./bat/07/736\", hlb>=45@0:oid, hub<1048621@0:oid} := sql.bind(X_2=0:int,\"sys\":str,\"_tables\":str,\"id\":str,1:int);",      ]
[ 20,   "13:11:16.706113",      "user.s5_1[4]12",       28,     "done ",        693,    54,     0,      0,      0,      0,      0,      5,       "X_3=[43]:bat[:oid,:oid] := sql.tid(X_2=0:int,\"sys\":str,\"_tables\":str);",      ]
[ 21,   "13:11:16.705459",      "user.s5_1[6]12",       12,     "start",        0,      54,     0,      0,      0,      0,      0,      1,       "(X_9=nil:bat[:oid,:oid]{file=\"/home/demouser27/tst/./bat/07/736\", rows=45, hlb>=0@0:oid, hub<45@0:oid},r1_9=nil:bat[:oid,:int]) := sql.bind(X_2=0:int,\"sys\":str,\"_tables\":str,\"id\":str,2:int);",    ]
[ 22,   "13:11:16.706800",      "user.s5_1[6]12",       12,     "done ",        1339,   54,     0,      0,      0,      0,      0,      2,       "(X_9=[0]:bat[:oid,:oid]{file=\"/home/demouser27/tst/./bat/07/736\", rows=45, hlb>=0@0:oid, hub<45@0:oid},r1_9=[0]:bat[:oid,:int]) := sql.bind(X_2=0:int,\"sys\":str,\"_tables\":str,\"id\":str,2:int);",      ]
[ 23,   "13:11:16.706948",      "user.wait[0]0",        6,      "wait", 0,      54,     0,      0,      0,      0,      0,      3,      "[  0.01  0.00  0.00  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.00  0.01  0.00  0.00  0.01  0.01  0.01  0.00  0.01  0.00  0.00  0.01  0.01  0.00  0.00  0.00  0.00  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01 ]",      ]
...
[ 25,   "13:11:16.708063",      "user.s5_1[8]12",       45,     "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_14=nil:bat[:oid,:int] := sql.delta(X_6=[45]:bat[:oid,:int],X_9=[0]:bat[:oid,:oid],r1_9=[0]:bat[:oid,:int],X_12=[0]:bat[:oid,:int]);",      ]
[ 26,   "13:11:16.708329",      "user.s5_1[8]12",       45,     "done ",        265,    54,     0,      0,      0,      0,      0,      1,       "X_14=[45]:bat[:oid,:int] := sql.delta(X_6=[45]:bat[:oid,:int],X_9=[0]:bat[:oid,:oid],r1_9=[0]:bat[:oid,:int],X_12=[0]:bat[:oid,:int]);",    ]
[ 27,   "13:11:16.708445",      "user.s5_1[9]12",       45,     "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_15=nil:bat[:oid,:int] := algebra.leftfetchjoin(X_3=[43]:bat[:oid,:oid],X_14=[45]:bat[:oid,:int]);",    ]
[ 28,   "13:11:16.708561",      "user.s5_1[9]12",       45,     "done ",        116,    54,     0,      0,      0,      0,      0,      1,       "X_15=[43]:bat[:oid,:int] := algebra.leftfetchjoin(X_3=[43]:bat[:oid,:oid],X_14=[45]:bat[:oid,:int]);",  ]
...
[ 30,   "13:11:16.708994",      "user.s5_1[2]12",       3,      "done ",        4112,   54,     0,      0,      0,      0,      0,      3,       "barrier X_37=false:bit{transparent} := language.dataflow();",       ]
[ 31,   "13:11:16.709419",      "user.s5_1[11]12",      3,      "start",        0,      54,     0,      0,      0,      0,      0,      2,       "X_16=0:int := sql.resultSet(1:int,1:int,X_15=[43]:bat[:oid,:int]);",       ]
[ 32,   "13:11:16.709529",      "user.s5_1[11]12",      3,      "done ",        109,    54,     0,      0,      0,      0,      0,      0,       "X_16=4:int := sql.resultSet(1:int,1:int,X_15=[43]:bat[:oid,:int]);",       ]
[ 33,   "13:11:16.709617",      "user.s5_1[12]12",      3,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "sql.rsColumn(X_16=4:int,\"sys._tables\":str,\"id\":str,\"int\":str,32:int,0:int,X_15=[43]:bat[:oid,:int]);",       ]
[ 34,   "13:11:16.709717",      "user.s5_1[12]12",      3,      "done ",        100,    54,     0,      0,      0,      0,      0,      0,       "sql.rsColumn(X_16=4:int,\"sys._tables\":str,\"id\":str,\"int\":str,32:int,0:int,X_15=[43]:bat[:oid,:int]);",       ]
[ 35,   "13:11:16.709816",      "user.s5_1[13]12",      3,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "X_21=nil:streams := io.stdout();", ]
[ 36,   "13:11:16.709896",      "user.s5_1[13]12",      3,      "done ",        79,     54,     0,      0,      0,      0,      0,      0,       "X_21=\"104d2\":streams := io.stdout();",    ]
[ 37,   "13:11:16.709973",      "user.s5_1[14]12",      3,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "sql.exportResult(X_21=\"104d2\":streams,X_16=4:int);",      ]
[ 38,   "13:11:16.710180",      "user.s5_1[14]12",      3,      "done ",        207,    54,     0,      0,      0,      0,      0,      0,       "sql.exportResult(X_21=\"104d2\":streams,X_16=4:int);",      ]
[ 39,   "13:11:16.710264",      "user.s5_1[15]12",      3,      "start",        0,      54,     0,      0,      0,      0,      0,      0,       "end user.s5_1;",      ]
[ 40,   "13:11:16.710325",      "user.s5_1[15]12",      3,      "done ",        61,     54,     0,      0,      0,      0,      0,      0,       "end user.s5_1;",      ]
...
[ 43,   "13:11:16.711494",      "user.s5_1[0]12",       3,      "done ",        8273,   54,     0,      0,      0,      0,      0,      0,       "function user.s5_1{autoCommit=true}():void;",       ]
...
^Csignal 2 received

 

Tomograph

Tomograph mk Fri, 08/31/2012 - 13:03

The program tomograph uses the information in the event stream to illustrate the parallel processing behavior of individual queries. The information can be used to identify possible expensive and blocking portions in a query. It helps the developers to understand the actual parallel behavior and may spot areas for improvement. Tomograph is an off-line only inspection tool and it is only available on Linux platforms.

Assume a MonetDB server is running to serve the database “voc”. First, in terminal 1 start tomograph on the database as the following:

    $ tomograph -d voc
    -- Output directed towards cache/voc_*
    -- Stop capturing with or after 32 pages

Then, in terminal 2, start a mclient to execute a query:

    $ mclient -d voc -s “select count(*) from tables;”

This triggers tomograph to respond with something like below in terminal 1:

    -- page 0 :set time zone interval '+02:00' hour to minute
    -- page 1 :select count(*) from tables

This indicates that tomograph has captured two queries. The first query “set time zone...” is executed automatically once at the start of each mclient to set the time zone for this client. For each captured query, tomograph will generate one page with its execution information. The tomograph output of the set time zone query can be safely ignored.

After our real SQL query has finished, go to terminal 1 and press Ctrl-c to terminate tomograph, which will try to plot the execution information of each captured query into a separated PDF file, and glue the individual PDF files into one file. This requires the tools gnuplot and gs. If this succeeds, tomograph will terminate with a message like below:

    ^Csignal 2 received
    -- exec:gnuplot cache/voc_00.gpl;
    -- exec:gs -q -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=cache/voc.pdf -dBATCH cache/voc_00.pdf
    -- done: cache/voc.pdf

Otherwise, tomograph will terminate with an error message and instructions to manually generate the plots and glue the PDF files, like below:

    ^Csignal 2 received
    -- exec:gnuplot cache/voc_00.gpl;
    <error message>

    To finish the atlas make sure gnuplot is available and run:
    gnuplot cache/voc_00.gpl
    gnuplot cache/voc_01.gpl
    gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=cache/voc.pdf -dBATCH cache/voc_00.pdf cache/voc_01.pdf

For each captured query, tomograph produces three files:

    $ ls cache
    voc_00.dat  voc_00.gpl  voc_00.trace  voc_01.dat  voc_01.gpl  voc_01.trace

The “.trace” files contain the event stream with raw query execution information. The “.dat” and “.gpl” files are generated by tomograph by extracting information from the “.trace” files. These files can be regenerated by passing the corresponding “.trace” file to the “--input” of tomograph. The figure below shows an example tomograph graph for one query:

In a tomograph graph, the top most part (“memory in GB” and “in/oublock”) illustrates the memory RSS (shown as blue dots) and I/O activities (reads as grey dots, writes as red dots) as reported by the OS during each heartbeat event. Note that all values displayed here reflect system wide activities, thus they include not only MonetDB activities. Also note that the I/O counts shown here do not reflect the actual amount of data read/written to the hardware drives, as this information is generally not available to all users of a Linux system.

The second part (“cores”) shows a heat map of the available CPU cores in the system, one line per CPU. This information is gathered from a heartbeat monitor in the server. The arrival rate can be controlled using the “--beat” option. The colours show the level of core utilization (white: <2%, yellow: >2%, gold: >25%, orange: >50%, orangered: >75%, red: >90%). Please be aware that the information given here includes the activities of the whole system, instead of only that of MonetDB.

The third part (“worker threads”) shows the activities of all MonetDB worker threads. Therefore, this part contains the most interesting information to find the bottlenecks in query performance. Note that the activities of the software threads here are not directly linked to the CPU activities shown in the second part, as tomograph does not give any information about which core is executing which thread (although some deductions can be made). Along the time axes, we show for each software thread the MAL instructions in progress. A hollow box with smaller height indicates that the thread was waiting for the next instruction to become available for execution. The little red-filled boxes, often placed at the end of a thread, indicate there is no instruction left at that moment.

Finally, at the bottom of a tomograph, we provide detailed statistics on the most frequently executed MAL instructions. Among others, the “parallelism” is calculated by: (sum of the maximum worker running time) × (numbers of cores) ÷ (reported number of ticks spent on each MAL instruction).

A synopsis of the calling conventions:

    tomograph [options]
      -d | --dbname=<database_name>
      -u | --user=<user>
      -P | --password=<password>
      -p | --port=<portnr>
      -h | --host=<hostname>
      -T | --title=<plot title>
      -r | --range=<starttime>-<endtime>[ms,s]
      -i | --input=<profiler event file>
      -o | --output=<dir/file prefix > (default 'cache/<dbname>')
      -b | --beat=<delay> in milliseconds (default 5000)
      -A | --atlas=<number> maximum number of queries (default 1)
      -D | --debug
      -? | --help

The tomograph trace files can be also used for off-line inspection of specific time ranges, without having to contact the MonetDB server and re-run a query. Most notably is to zoom into a range of the execution trace using the option “--range”. The ranges may be expressed in seconds (s), milliseconds (ms), or microseconds (default). The trace file can also be used to locate the details of each instruction.

Tachograph

Tachograph zhang Sun, 07/19/2015 - 21:30

The program tachograph monitors the event stream to determine the progress of a query execution. This tools is available as of Jul2015 release.

A synopsis of the calling conventions:

    tachograph [options]
      -d | --dbname=<database_name>
      -u | --user=<user>
      -p | --port=<portnr>
      -h | --host=<hostname>
      -b | --beat=<delay> in milliseconds (default 5000)
      -i | --interactive=<0 | 1> show trace on stdout (default 1)
      -o | --output=<webfile>
      -c | --cache=<query pool location> (default ‘cache’)
      -q | --queries=<query pool capacity>
      -w | --wait=<delay time> in milliseconds (default 5000)
      -? | --help

Assume a MonetDB server is running to serve the database “voc”. First, in terminal 1 start tachograph on the database as the following:

    $ tachograph -d voc
    -- opened UDP profile stream numa-db:50010 for localhost

Then, in terminal 2, start an mclient to execute a (long-running) query:

    $ mclient -d voc -s “select * from tables a, tables b, tables c, tables d, tables e;”

This triggers tachograph to respond with something like below in terminal 1:

    CACHE ID:0
    set time zone interval '+02:00' hour to minute

    [##################################################] 100%  00:00:00
    CACHE ID:1
    select * from tables a, tables b, tables c, tables d, tables e
    [##################################................>  68%  00:00:20      language.dataflow()

This indicates that tachograph has captured two queries. The first query “set time zone...” is executed automatically once at the start of each mclient to set the time zone for this client. For each captured query, tachograph shows on the command line:

  1. progress bar,
  2. percentage of completion,
  3. elapsed time, and
  4. MAL instruction currently being executed (if any).

The progress bar and percentage of completion are computed using the ETC (Expected Time to Completion) of the MAL instructions. To terminate tachograph at any time, press Ctrl-c. For each query, tachograph saves the execution information in four files (by default in the directory “cache”):

    $ ls cache
    tachograph_0.json      tachograph_0.trace    tachograph_1_stmt.csv
    tachograph_0_mal.csv   tachograph_1.json     tachograph_1.trace
    tachograph_0_stmt.csv  tachograph_1_mal.csv  tachograph.trace

The events streams are saved in the “.trace” files. The “.csv” files contain more readable information of the MAL statements extracted from the “.trace” files stored in CSV format. The “.csv” files can be easily loaded into a database for in-database analysis of the query execution, which are currently not used yet.

The most interesting output files here are the “.json” files, which contain the same information as the “.trace” files, but converted into JSON format. In addition, each “stmt” is accompanied by a “beautystmt”, in which the variables in the MAL statement are mapped back to tables and columns. This makes the event streams much more human readable. The table below shows an excerpt of the “.json” file generated for the query “select count(*) from tables;”:

    { "tachograph":0.1,
      "system":{ MonetDBversion:"11.20.0", release:"unreleased", host:"x86_64-unknown-linux-gnu", threads:"48", memory:"93.793 GB", oid:"64", packages:["huge"]}",   ],

      "qid":"user.s1_2{autoCommit=true}",
      "tag":"337",
      "query":"select count(*) from tables",
      "started": 85885322413,
      "duration":0,

      "instructions":34
    },
    {
    "qid":"user.s1_2{autoCommit=true}",
    "tag":337,
    "pc":1,
    "time": 1191,
    "status": "start",
    "estimate": 0,
    "stmt": "X_49=0@0:void := querylog.define(\"select count(*) from tables;\":str,\"default_pipe\":str,34:int)",
    "beautystmt": "querylog.define(\"select count(*) from tables;\",\"default_pipe\",34)",
    "prereq":[]
    },
    ...
    {
    "qid":"user.s1_2{autoCommit=true}",
    "tag":337,
    "pc":27,
    "time": 5139,
    "status": "start",
    "estimate": 0,
    "stmt": "X_38=0:wrd := aggr.count(X_37=<tmp_710>[0]:bat[:oid,:int])",
    "beautystmt": "count(tmp._tables.id[0])",
    "prereq":[, 26]
    },
    ...

Each “.json” file starts with a special block containing meta-information, e.g., hardware resources and software versions. Then it contains one block for each line of event in the “.trace” file.