Server Programming

Server Programming giulia Mon, 02/24/2020 - 10:55

Extending SQL

Extending SQL giulia Mon, 02/24/2020 - 10:58

GeoSpatial

GeoSpatial mk Sun, 03/28/2010 - 23:35

Spatial support

MonetDB/SQL comes with an interface to the Simple Feature Specification of the Open Geospatial Consortium (formerly known as Open GIS Consortium) which opens the route to develop GIS applications.

The MonetDB/SQL/GIS module supports all objects and functions specified in the OGC "Simple Features for SQL" specification. Spatial objects can, however, for the time being only expressed in the Well-Known Text (WKT) format. WKT includes information about the type of the object and the object's coordinates.

Installation

The GIS functionality is packaged as a separate MonetDB module called geom. To benefit from the geometry functionality you first have to download and install geos. It is a well-known and sound library to built upon. The next step is to (re-)build MonetDB with the --enable-geom configure argument.  This will build the necessary extension modules and activate them upon the first start of the server. Note that databases created before you configured with support for geom will lack geom functions in SQL. We recommend you to start on a new database.

Get Going

The spatial extension of MonetDB requires the user to simply use geom data types from SQL.

Example The script below creates and populates a 'forest' table and a 'buildings' table followed by a spatial query in this fictive landscape.

CREATE TABLE forests(id INT,name TEXT,shape MULTIPOLYGON);
CREATE TABLE buildings(id INT,name TEXT,location POINT,outline POLYGON);

INSERT INTO forests VALUES(109, 'Green Forest',
'MULTIPOLYGON( ((28 26,28 0,84 0,84 42,28 26), (52 18,66 23,73 9,48 6,52 18)), ((59 18,67 18,67 13,59 13,59 18)))');

INSERT INTO buildings VALUES(113, '123 Main Street',
	'POINT( 52 30 )',
	'POLYGON( ( 50 31, 54 31, 54 29, 50 29, 50 31) )');
INSERT INTO buildings VALUES(114, '215 Main Street',
	'POINT( 64 33 )',
	'POLYGON( ( 66 34, 62 34, 62 32, 66 32, 66 34) )');

SELECT forests.name,buildings.name
FROM forests,buildings
WHERE forests.name = 'Green Forest' and
    Overlaps(forests.shape, buildings.outline) = true;

Acceleration Spatial Operations

There are no special accelerators to speed up access to Spatial Objects yet. However, we can use the Minimum Bounding Rectangle (mbr) datatype to accelerate operations considerably. This requires a small query rewrite. In the example above the performance of the query can be improved in the following manner:

ALTER TABLE forests ADD bbox mbr;
UPDATE forests SET bbox = mbr(shape);
ALTER TABLE buildings ADD bbox mbr;
UPDATE buildings SET bbox = mbr(outline);

SELECT forests.name,buildings.name
FROM forests,buildings
WHERE forests.name = 'Green Forest' AND
    mbroverlaps(forests.bbox,buildings.bbox) = TRUE AND
    Overlaps(forests.shape, buildings.outline) = TRUE;

In this way the mbr operation acts as a filter. Upon request, and availabilty of resources, we will develop MAL optimizers to automate this process.

 

Limitations

This is the first implementation of OGC functionality in MonetDB. It is based on libgeos 3.3.0. Further development will be based on concrete external requests and availability of manpower. The shortlist of open issues is:

  • development of a JDBC extension to map the geometry datatypes to their Java counterparts. (UDFs/type mapping)
  • support for 3D types.
  • spatial optimizers in the MAL optimizer toolkit to exploit spatial search predicates

 

Spatial data types

Spatial data types mk Sat, 01/14/2012 - 15:17

Spatial Types

MonetDB supports the Open Geospatial Consortium (OGC, formerly OpenGIS) types: Point, Curve, LineString, Surface, Polygon, MultiPoint, MultiCurve, MultiLineString, MultiSurface, MultiPolygon, Geometry and GeomCollection. One non-OpenGIS type for fast access using pre-filtering is used. This type 'mbr' is used for storing a 2D box. Functions to create these boxes are specified in following sections.

Conversion from and to Well-known Text

Convert a Well-Known-Text string to a spatial object. The SRID parameter is a reference to the Spatial Reference System in which the coordinates are expressed.

CREATE FUNCTION GeomFromText(wkt string, srid SMALLINT) RETURNS Geometry
CREATE FUNCTION PointFromText(wkt string, srid SMALLINT) RETURNS Point
CREATE FUNCTION LineFromText(wkt string, srid SMALLINT) RETURNS LineString
CREATE FUNCTION PolyFromText(wkt string, srid SMALLINT) RETURNS Polygon
CREATE FUNCTION MPointFromText(wkt string, srid SMALLINT) RETURNS MultiPoint
CREATE FUNCTION MLineFromText(wkt string, srid SMALLINT) RETURNS MultiLineString
CREATE FUNCTION MPolyFromText(wkt string, srid SMALLINT) RETURNS MultiPolygon
CREATE FUNCTION GeomCollectionFromText(wkt string, srid SMALLINT) RETURNS MultiPolygon
-- alias
CREATE FUNCTION PolygonFromText(wkt string, srid SMALLINT) RETURNS Polygon
 
CREATE FUNCTION AsText(p Point) RETURNS STRING
CREATE FUNCTION AsText(c Curve) RETURNS STRING
CREATE FUNCTION AsText(l LineString) RETURNS STRING
CREATE FUNCTION AsText(s Surface) RETURNS STRING
CREATE FUNCTION AsText(p Polygon) RETURNS STRING
CREATE FUNCTION AsText(p MultiPoint) RETURNS STRING
CREATE FUNCTION AsText(c MultiCurve) RETURNS STRING
CREATE FUNCTION AsText(l MultiLineString) RETURNS STRING
CREATE FUNCTION AsText(s MultiSurface) RETURNS STRING
CREATE FUNCTION AsText(p MultiPolygon) RETURNS STRING
CREATE FUNCTION AsText(g Geometry) RETURNS STRING

Analysis functions on Geometry

The following functions perform analysis operations on geometries:

CREATE FUNCTION Area(g Geometry) RETURNS FLOAT
CREATE FUNCTION Length(g Geometry) RETURNS FLOAT
CREATE FUNCTION Distance(a Geometry, b Geometry) RETURNS FLOAT
CREATE FUNCTION Buffer(a Geometry, distance FLOAT) RETURNS Geometry
CREATE FUNCTION ConvexHull(a Geometry) RETURNS Geometry
CREATE FUNCTION Intersection(a Geometry, b Geometry) RETURNS Geometry
CREATE FUNCTION "Union"(a Geometry, b Geometry) RETURNS Geometry
CREATE FUNCTION Difference(a Geometry, b Geometry) RETURNS Geometry
CREATE FUNCTION SymDifference(a Geometry, b Geometry) RETURNS Geometry
 
CREATE FUNCTION Dimension(g Geometry) RETURNS integer
CREATE FUNCTION GeometryTypeId(g Geometry) RETURNS integer
CREATE FUNCTION SRID(g Geometry) RETURNS integer
CREATE FUNCTION Envelope(g Geometry) RETURNS Geometry
CREATE FUNCTION IsEmpty(g Geometry) RETURNS BOOLEAN
CREATE FUNCTION IsSimple(g Geometry) RETURNS BOOLEAN
CREATE FUNCTION Boundary(g Geometry) RETURNS Geometry
 
CREATE FUNCTION Equals(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Disjoint(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION "Intersect"(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Touches(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Crosses(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Within(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Contains(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Overlaps(a Geometry, b Geometry) RETURNS BOOLEAN
CREATE FUNCTION Relate(a Geometry, b Geometry, pattern STRING) RETURNS BOOLEAN
SQL functions on Point
CREATE FUNCTION X(g Geometry) RETURNS double
CREATE FUNCTION Y(g Geometry) RETURNS double
 
CREATE FUNCTION Point(x double,y double) RETURNS Point
 
SQL functions on Curve
CREATE FUNCTION IsRing(l LineString) RETURNS BOOLEAN
CREATE FUNCTION StartPoint(l LineString) RETURNS Point -- not yet implemented
CREATE FUNCTION EndPoint(l LineString) RETURNS Point -- not yet implemented
 
SQL functions on LineString
CREATE FUNCTION NumPoints(l LineString) RETURNS integer -- not yet implemented
CREATE FUNCTION PointN(l LineString,i integer) RETURNS Point -- not yet implemented
 
SQL functions on Surface
CREATE FUNCTION PointOnSurface(s Surface) RETURNS Point -- not yet implemented
CREATE FUNCTION Centroid(s Surface) RETURNS Point -- not yet implemented
 
SQL functions on Polygon
CREATE FUNCTION ExteriorRing(s Surface) RETURNS LineString -- not yet implemented
CREATE FUNCTION NumInteriorRing(s Surface) RETURNS integer -- not yet implemented
CREATE FUNCTION InteriorRingN(s Surface,n integer) RETURNS LineString -- not yet implemented
 
SQL functions on GeomCollection
-- Unimplemented Documentation
CREATE FUNCTION NumGeometries(GeomCollection c) RETURNS integer -- not yet implemented
CREATE FUNCTION GeometryN(GeomCollection c,n integer) RETURNS Geometry -- not yet implemented

SQL functions on spatial objects

The following functions return the minimum bounded rectangle (or boolean) of a given geometry:

CREATE FUNCTION mbr (p Point) RETURNS mbr
CREATE FUNCTION mbr (c Curve) RETURNS mbr
CREATE FUNCTION mbr (l LineString) RETURNS mbr
CREATE FUNCTION mbr (s Surface) RETURNS mbr
CREATE FUNCTION mbr (p Polygon) RETURNS mbr
CREATE FUNCTION mbr (m multipoint) RETURNS mbr
CREATE FUNCTION mbr (m multicurve) RETURNS mbr
CREATE FUNCTION mbr (m multilinestring) RETURNS mbr
CREATE FUNCTION mbr (m multisurface) RETURNS mbr
CREATE FUNCTION mbr (m multipolygon) RETURNS mbr
CREATE FUNCTION mbr (g Geometry) RETURNS mbr
CREATE FUNCTION mbr (g GeomCollection) RETURNS mbr
CREATE FUNCTION mbroverlaps(a mbr, b mbr) RETURNS BOOLEAN

Life Science

Life Science zhang Thu, 03/27/2014 - 14:15

MonetDB/SQL comes with a SAM/BAM module, the de-facto standard for managing DNA sequence alignment data. The module contains the following features to assist SAM/BAM data analysis:

  • Features to load a single SAM/BAM file, a user specified selection of SAM/BAM files, or a complete repository of SAM/BAM files into the database.
  • A database cleanup feature to remove individual SAM/BAM files from the database.
  • A  collection of  SQL functions commonly used in SAM/BAM data analysis, such as computing the reverse complement of a DNA string, and filtering alignment records by their flags.
  • A SAM formatter that renders a database result set into the SAM format

The SAM/BAM modules forms a cornerstone for building lifescience applications around DNA sequences. A quick introduction of the features can be seen in the screencast below, which also demonstrates how MonetDB/BAM works together with popular genomic tools, such as IGV, the Integrative Genomics Viewer.

A specification of the SAM/BAM formats can be found here. Some auxiliary tools for manipulating alignments in the SAM/BAM formats can be found here.

Installation

Installation zhang Tue, 07/08/2014 - 18:01

Installation of SAMtools

If you want to use the SAM/BAM module of MonetDB, you must have the SAMtools (development) library pre-installed. On Ubuntu, you need to install the libbam-dev package. On Fedora, you need to install the samtools-devel package. On any other OSs, you need to install the SAMtools library yourself and tell MonetDB where to find it. The details of the steps are shown below:

  1. Create a folder for your SAMtools installation. We refer to the absolute path of this folder as $SAMTOOLS_BASEDIR.
  2. Download and save the latest version of SAMtools from here in $SAMTOOLS_BASEDIR. We used version 0.1.19 for this manual, so we have a file $SAMTOOLS_BASEDIR/samtools-0.1.19.tar.bz2.
  3. Unzip the bzip2 file and rename the resulting directory:
    $ cd $SAMTOOLS_BASEDIR
    $ tar xf samtools-0.1.19.tar.bz2
    $ mv samtools-0.1.19 samtools
  4. Create two symbolic links $SAMTOOLS_BASEDIR/include and $SAMTOOLS_BASEDIR/lib:
    $ ln -s $SAMTOOLS_BASEDIR $SAMTOOLS_BASEDIR/include
    $ ln -s $SAMTOOLS_BASEDIR/samtools $SAMTOOLS_BASEDIR/lib
  5. Open the file $SAMTOOLS_BASEDIR/samtools/Makefile with your favourite editor and add the flag -fPIC to the CFLAGS variable. This edit is necessary for MonetDB to use the SAMtools libraries during its installation process. NB: we do not know what effects this change might have on other software that also use the SAMtools library directly.
  6. Now the SAMtools library can be compiled without any other argument.
    $ cd $SAMTOOLS_BASEDIR/samtools && make
  7. For more detailed information on the compilation and installation, see the file $SAMTOOLS_BASEDIR/samtools/INSTALL.
  8. Add the SAMtools library we have just installed to the library path:
    $ export LD_LIBRARY_PATH=$SAMTOOLS_BASEDIR/samtools:$LD_LIBRARY_PATH
  9. When installing MonetDB, it needs to know where to find the SAMtools library. This is achieved by passing the option --with-samtools=$SAMTOOLS_BASEDIR to the configure script of MonetDB.

Contact MonetDB Solutions for additional support options or a reference to a prepared Cloud instance.

Loading SAM/BAM data

Loading SAM/BAM data zhang Tue, 07/08/2014 - 18:06

To demonstrate the MonetDB SAM/BAM library, we prepared four sample input data files: file1.bam and file2.sam are unsorted; while file1-sorted.bam and file2-sorted.sam are sorted by the QNAME fields of the alignment records. They can also be found in the subdirectory sql/backends/monet5/bam/Tests/files of your MonetDB source directory. Assume these files are saved on your computer in directory "/tmp". There are several ways to load the SAM/BAM files into MonetDB.

Load individual files

The SQL function bam_loader_file(bam_file STRING, dbschema SMALLINT) allows loading one SAM/BAM file at a time:

sql> CALL bam.bam_loader_file('/tmp/file1.bam', 0);
sql> CALL bam.bam_loader_file('/tmp/file2.sam', 0);
sql> CALL bam.bam_loader_file('/tmp/file1-sorted.bam', 1);
sql> CALL bam.bam_loader_file('/tmp/file2-sorted.sam', 1);

The first argument is the absolute path to the SAM/BAM file. The second argument is either 0 for the sequential schema, or 1 for the pairwise schema (see the definition of both storage schemas in SAM/BAM storage). Note that to use the pairwise schema, the SAM/BAM files must be sorted by the QNAME field.

To check which SAM/BAM files have been loaded into the database, you can consult the database catalogue table "bam.files":

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)

Load SAM/BAM repositories

The SQL function bam_loader_repos(bam_repos STRING, dbschema SMALLINT, nr_threads SMALLINT) allows loading a repository of SAM/BAM files in the given directory:

sql> CALL bam.bam_loader_repos('/tmp', 0, 4);

The first argument is the absolute path to the SAM/BAM repository (note that the loading function does not search in the subdirectory for SAM/BAM files). The second argument is either 0 for the sequential schema, or 1 for the pairwise schema (see SAM/BAM storage). The third argument is the number of threads to use during the loading process. Note: the last argument is removed in the bamloader branch of MonetDB, since a user on SQL level should not have to deal with such a low-level detail. This change will be merged into the next feature release of MonetDB. 

Load list of SAM/BAM files

The SQL function bam_loader_files(bam_files STRING, dbschema SMALLINT, nr_threads SMALLINT) loads all SAM/BAM files listed in the given file:

sql> CALL bam.bam_loader_files('/tmp/files.txt', 0, 4);

The first argument is the absolute path of the file which contains a list of absolute paths to the SAM/BAM files to load:

$ cat /tmp/files.txt
/tmp/file1.bam
/tmp/file2.sam

The second argument is either 0 for the sequential schema, or 1 for pairwise schema (see SAM/BAM storage). The third argument is the number of threads to use during the loading process. Note: the last argument is removed in the bamloader branch of MonetDB, since a user on SQL level should not have to deal with such a low-level detail. This change will be merged into the next feature release of MonetDB. 

Remove files

The SQL function bam_drop_file(file_id BIGINT, dbschema SMALLINT) allows removing all data contained in a SAM/BAM file from the database:

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)
sql>CALL bam.bam_drop_file(2, 0);

The first argument is the file_id of the file that will be removed. The second argument is the dbschema of this file, which can be either zero or one for the sequential or the pairwise schema respectively.

sql>SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
3 tuples (5.334ms)

 

Known issues

There exists a loading issue in the Oct2014 release of MonetDB. When loading a list or a repository of files, using respectively the SQL commands bam_loader_files and bam_loader_repos, too many file descriptors will be opened. This might cause the loading process to fail with the error message that the bam_wrapper code could not open a certain file. This issue has been fixed in the bamloader branch and will be merged into the next feature release of MonetDB.

 

The storage schemas

The storage schemas zhang Tue, 07/08/2014 - 18:19

The MonetDB SAM/BAM module supports two types of schemas to store SAM/BAM data. With the sequential schema, all alignment records are simply read, parsed and stored subsequently, comparable to how they are stored in the SAM/BAM files. With the pairwise schema, pairs of alignment records (as determined by the information in the alignment records) are stored together in single database records.

Meta-data tables

Regardless of the database schema in which the alignment records are stored, headers of the SAM/BAM files are stored in a fixed number of meta-data tables. Their specifications and relationships are show in the figure below:

The table "bam.files" stores which BAM files are loaded into the database.

  • file_id: every file tuple in this table contains a unique file ID, handed out by the BAM loader during the loading process;
  • file_location: location of the SAM/BAM file on disk, from where it was loaded;
  • dbschema: the database storage schema used, 0 for sequential schema, 1 for pairwise schema;
  • format_version: the format version, as stored in the SAM/BAM header
  • sorting_order: by which field are the alignment records in this file sorted; "unsorted" otherwise.
  • comments: the comments that were found inside CO tags in the file header.

The database design defines additional tables to store the remainder of the SAM/BAM file header, where a really straightforward approach was used (e.g. table "bam.sq" for SQ and table "bam.rg" for RG header records). For more information on the exact meaning of these fields, please consult the SAM specification.

Alignment data tables

For the alignment data, the database design defines a separate set of tables for every SAM/BAM file that has been loaded into the database. This design choice was made to speed up analyses on only a small number of SAM/BAM files, since this often occurs in practice. This speed up is realized by not having to filter out alignment data of specific BAM files as a first analysis step.

When loading SAM/BAM alignment records into the database, you can choose to store these data in one of two sets of predefined database tables, referenced to as the sequential table set and the pairwise table set. These table sets are presented in the next figures, for a BAM file with file_id i.

The database design uses the virtual offset of every alignment record as a primary key. By storing this virtual offset, specific alignments can easily be found back in the original SAM/BAM file using e.g. SAMtools.

Sequential schema

The sequential table set is aimed towards a straightforward mapping from the alignment data as it occurs in a SAM/BAM file. Every alignment field gets stored in a database column. Furthermore, the extra information that is contained in alignment data is parsed and stored in a separate table alignments_extra_i.

Pairwise schema

The pairwise table set aims at reducing the performance overhead when many operations are done on paired alignment data. During the loading process, both primary and secondary alignment pairs are automatically recognized and stored accordingly in the appropriate database tables. The columns in the paired tables have either an 'l' or an 'r' prefix (except for the 'qname', since two alignment records in the same pair always have the same QNAME). Columns with an 'l' prefix store data from alignment records that have their 'first segment' flag set to 1 and columns with an 'r' prefix do this for alignments with their 'last segment' flag set to 1. All alignments that can not be paired are stored in the unpaired alignments table, which has the same structure as the regular alignments table in the straightforward table set.

In addition to physical tables, the pairwise table set also defines some views over the data, which are also automatically created. These views offer ways to access the data as if it was stored in a sequential fashion, e.g., any query that is aimed towards the alignments table from the straightforward table set can be fired at the unpaired_all_alignments_i view.

In the above figure, arrows are used to indicate over which physical tables the different views are defined. 

Note: currently, if you want to load a SAM/BAM file into the pairwise schema, the file has to be sorted by QNAME. If this is not the case yet, please sort it first using e.g. SAMtools before trying to load it into the database. Another way would be to load the file into the sequential schema, write the ordered alignments back to a SAM file (see SAM export for more information on exporting functionality) and then load that SAM file. However, unless you already have your file loaded into a sequential schema this is not adviced, since this sequence of actions will in general take longer than simply sorting the file with e.g. SAMtools before inserting it.

Query DNA sequence data

Query DNA sequence data zhang Tue, 07/08/2014 - 18:11

The use of the MonetDB SAM/BAM module to query DNA alignment data is shown using a dump of an actual session. More examples can be found in the MonetDB test directory code sql/backends/monet5/bam/Tests for further inspiration.

We start off by loading some SAM and BAM files into MonetDB (see Loading SAM/BAM data):

sql> CALL bam.bam_loader_file('/tmp/file1.bam', 0);
sql> CALL bam.bam_loader_file('/tmp/file2.sam', 0);
sql> CALL bam.bam_loader_file('/tmp/file1-sorted.bam', 1);
sql> CALL bam.bam_loader_file('/tmp/file2-sorted.sam', 1);

All loaded SAM and BAM files are stored in the database table bam.files:

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)

A summary of the database now looks as follows:

sql>set schema bam;
auto commit mode: on
sql>\d
TABLE  bam.alignments_1
TABLE  bam.alignments_2
TABLE  bam.alignments_extra_1
TABLE  bam.alignments_extra_2
TABLE  bam.alignments_extra_3
TABLE  bam.alignments_extra_4
TABLE  bam.export
TABLE  bam.files
TABLE  bam.paired_primary_alignments_3
TABLE  bam.paired_primary_alignments_4
TABLE  bam.paired_secondary_alignments_3
TABLE  bam.paired_secondary_alignments_4
TABLE  bam.pg
TABLE  bam.rg
TABLE  bam.sq
TABLE  bam.unpaired_alignments_3
TABLE  bam.unpaired_alignments_4
VIEW   bam.unpaired_all_alignments_3
VIEW   bam.unpaired_all_alignments_4
VIEW   bam.unpaired_primary_alignments_3
VIEW   bam.unpaired_primary_alignments_4
VIEW   bam.unpaired_secondary_alignments_3
VIEW   bam.unpaired_secondary_alignments_4
sql>set schema sys;
auto commit mode: on

Now we can use SQL queries to conduct various analyse steps on the alignment data. The module has several distinguished features to facilitate this:

  • Auxilliary functions for common SAM/BAM analysis operations (see also SAM/BAM SQL functions).
  • Possibility of doing analyses on data from multiple files, which is impossible with many other tools.
  • Both unpaired (i.e., sequential) and pairwise storage schemas to ease and accelerating data analysis depending on the types of queries.

Queries using sequential schema

Q1. Count the number of alignment records and chromosomes contained in "file1.bam":

sql> SELECT COUNT(*), COUNT(DISTINCT rname) FROM bam.alignments_1;
+------+------+
| L1   | L2   |
+======+======+
|   71 |    7 |
+------+------+
1 tuple (9.272ms)

Q2. Count the number of alignment records for every chromosome:

sql> SELECT rname, COUNT(*) FROM bam.alignments_1 GROUP BY rname;
+-------+------+
| rname | L1   |
+=======+======+
| chr13 |    1 |
| chr14 |    2 |
| chr18 |    1 |
| chr22 |   63 |
| chr2  |    1 |
| chr3  |    1 |
| chr9  |    2 |
+-------+------+
7 tuples (9.179ms)

Q3. Find all alignments with a mapping quality above 200 and marked as reversed, sort the results by their rname and pos:

sql> SELECT * FROM bam.alignments_1 WHERE mapq > 200 AND bam.bam_flag(flag, 'segm_reve') = TRUE ORDER BY rname, pos;
+----------+------------+-----+-------+----------+-----+----------+-----+----------+----+----------------------------+-----------------------------+
| virtual_ | qname      | fla | rname | pos      | map | cigar    | rne | pnext    | tl | seq                        | qual                        |
: offset   :            : g   :       :          : q   :          : xt  :          : en :                            :                             :
+==========+============+======+=======+==========+======+========+=====+==========+====+============================+=============================+
| 34538632 | sim_22_1_a | 147 | chr22 | 15047749 | 255 | 100=     | =   | 15047509 |  0 | ATGGTTTTTGCCTGGTACTGTTGAAG | ##A#A####D5A>F#E@:G2#3ADBD: |
:          :            :     :       :          :     :          :     :          :    : TTAGGCTTAATTTTGAACCAGTAGCT : ?##@CE-?D>BEG?D5DEABG3B5EC? :
:          :            :     :       :          :     :          :     :          :    : TTGTTGTTTACCTTATGTGGTTTTGG : CGDEDE=B@AE5@BDDGD>/DGBAE5@ :
:          :            :     :       :          :     :          :     :          :    : GTTCATTTGTTCTATAAGTATA     : EGFEGEEEFEAD;BB?FGC         :
| 34539304 | sim_22_1_9 |  83 | chr22 | 15600266 | 255 | 100=     | =   | 15600021 |  0 | TTTAAAATATTAAAAGATGAATTACT | 5<###E?##AB#E=DB#<==;#E#?=A |
:          :            :     :       :          :     :          :     :          :    : ATCAATTGTTTTGAATTTTAAACTAA : FEE@C@AEA6GDGFFG?DE??::?=@B :
:          :            :     :       :          :     :          :     :          :    : AAATCAGTAGTTACTATAAAATTATT : C#G>GB>GEEDA=G>ADEFDEGABEDD :
:          :            :     :       :          :     :          :     :          :    : ATTAAATGTTCTAATAATTGTA     : EBCGGEFFF7;=GEGGFGC         :
...
| 34555209 | sim_22_1_7 |  83 | chr22 | 45060375 | 255 | 41=1X56= | =   | 45060162 |  0 | ACCAGCCTGGCCAACATGGTGAAACG | 5#>?CG=E,A=#AG#?#>GAEA@=DB< |
:          :            :     :       :          :     : 1x1=     :     :          :    : CTGTCTCTACTAAAAGTACAAAAAAA : >ED@?CFC@A?=EG65GGD:C@EE(BF :
:          :            :     :       :          :     :          :     :          :    : TTAGCTAGGCGTGGTGGTGGGCACCT : ABFG5B?G;EFE?@5EG5GAGD5@;DF :
:          :            :     :       :          :     :          :     :          :    : GTAATCCCAGCTACTCGGGGAG     : GG?F?7GAGGAFEFA>E-E         :
| 34555657 | sim_22_1_4 | 147 | chr22 | 46558385 | 255 | 1X49=1X9 | =   | 46558150 |  0 | TGACTCCAGGGAAGTTAAATAAATGC | #E###BF#?;;#>@=1#@=E=B#E-ED |
:          :            :     :       :          :     : =1x39=   :     :          :    : CAAGAGGATTCACGTGGCAGGGCCAT : F?DBC#EDEF?;6FE>+GE@GDE.AED :
:          :            :     :       :          :     :          :     :          :    : GGAGGTGGGGAGGAAGGAGCACAGGA : E+DAF:4E>FF=EDA;DAEGEGAEFD? :
:          :            :     :       :          :     :          :     :          :    : GACACCAGGTGCTCACTGTCCA     : GD=DG=G=C5FEGDGGFAG         :
+----------+------------+-----+-------+----------+-----+----------+-----+----------+----+----------------------------+-----------------------------+
21 tuples (4.390ms)

Q4. Compute the reverse complement of the sequence and the quality strings.

sql> SELECT seq, bam.reverse_seq(seq) AS rev_seq, qual, bam.reverse_qual(qual) AS rev_qual FROM bam.alignments_1;
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------------------------+
| seq                                  | rev_seq                              | qual                                 | rev_qual                              |
+======================================+======================================+======================================+=======================================+
| TATACTTATAGAACAAATGAACCCAAAACCACATAA | ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5E | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D> |
: GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : A@B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@## : BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD>/ :
: AACTTCAACAGTACCAGGCAAAAACCAT         : TTTTGGGTTCATTTGTTCTATAAGTATA         : ?:DBDA3#2G:@E#F>A5D####A#A##         : DGBAE5@EGFEGEEEFEAD;BB?FGC            :
| TATACTTATAGAACAAATGAACCCAAAACCACATAA | ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5E | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D> |
: GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : A@B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@## : BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD>/ :
: AACTTCAACAGTACCAGGCAAAAACCAT         : TTTTGGGTTCATTTGTTCTATAAGTATA         : ?:DBDA3#2G:@E#F>A5D####A#A##         : DGBAE5@EGFEGEEEFEAD;BB?FGC            :
...
| ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | TATACTTATAGAACAAATGAACCCAAAACCACATAA | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5EA |
: TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : >BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD : @B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@##?: :
: TTTTGGGTTCATTTGTTCTATAAGTATA         : AACTTCAACAGTACCAGGCAAAAACCAT         : >/DGBAE5@EGFEGEEEFEAD;BB?FGC         : DBDA3#2G:@E#F>A5D####A#A##            :
| ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | TATACTTATAGAACAAATGAACCCAAAACCACATAA | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5EA |
: TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : >BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD : @B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@##?: :
: TTTTGGGTTCATTTGTTCTATAAGTATA         : AACTTCAACAGTACCAGGCAAAAACCAT         : >/DGBAE5@EGFEGEEEFEAD;BB?FGC         : DBDA3#2G:@E#F>A5D####A#A##            :
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------------------------+
71 tuples (5.768ms)

Q5. Select alignment records that are flagged as being both primary and a first segment:

sql> SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = FALSE AND bam.bam_flag(flag, 'firs_segm') = TRUE;
+----------+-------------+------+-------+----------+------+-------------------+------+----------+------+--------------------------+--------------------------+
| virtual_ | qname       | flag | rname | pos      | mapq | cigar             | rnex | pnext    | tlen | seq                      | qual                     |
: offset   :             :      :       :          :      :                   : t    :          :      :                          :                          :
+==========+=============+======+=======+==========+======+===================+======+==========+======+==========================+==========================+
| 34538416 | sim_22_1_a  |   99 | chr22 | 15047509 |  255 | 100=              | =    | 15047749 |    0 | TAAAAACTTGCTGGTTTTGCGGCT | F?CBF7BGFGGGDD@GFF@B;:GG |
:          :             :      :       :          :      :                   :      :          :      : TGGGGGGCATCACGGAACCTACTG : GEGCFF>EFGGDFFEDEFFG1EGC :
:          :             :      :       :          :      :                   :      :          :      : ACACGTGATGTCTCCCCTGGATGC : C?BDF#F=ED=E?@GBF@=GDED; :
:          :             :      :       :          :      :                   :      :          :      : CCAGCTTTAAAATTTCCCACTTTT : 856FD?@=F#FE?EB5CEGCFE#F :
:          :             :      :       :          :      :                   :      :          :      : GTAC                     : ####                     :
| 34539304 | sim_22_1_9  |   83 | chr22 | 15600266 |  255 | 100=              | =    | 15600021 |    0 | TTTAAAATATTAAAAGATGAATTA | 5<###E?##AB#E=DB#<==;#E# |
:          :             :      :       :          :      :                   :      :          :      : CTATCAATTGTTTTGAATTTTAAA : ?=AFEE@C@AEA6GDGFFG?DE?? :
:          :             :      :       :          :      :                   :      :          :      : CTAAAAATCAGTAGTTACTATAAA : ::?=@BC#G>GB>GEEDA=G>ADE :
:          :             :      :       :          :      :                   :      :          :      : ATTATTATTAAATGTTCTAATAAT : FDEGABEDDEBCGGEFFF7;=GEG :
:          :             :      :       :          :      :                   :      :          :      : TGTA                     : GFGC                     :
...

| 34555209 | sim_22_1_7  |   83 | chr22 | 45060375 |  255 | 41=1X56=1X1=      | =    | 45060162 |    0 | ACCAGCCTGGCCAACATGGTGAAA | 5#>?CG=E,A=#AG#?#>GAEA@= |
:          :             :      :       :          :      :                   :      :          :      : CGCTGTCTCTACTAAAAGTACAAA : DB<>ED@?CFC@A?=EG65GGD:C :
:          :             :      :       :          :      :                   :      :          :      : AAAATTAGCTAGGCGTGGTGGTGG : @EE(BFABFG5B?G;EFE?@5EG5 :
:          :             :      :       :          :      :                   :      :          :      : GCACCTGTAATCCCAGCTACTCGG : GAGD5@;DFGG?F?7GAGGAFEFA :
:          :             :      :       :          :      :                   :      :          :      : GGAG                     : >E-E                     :
| 34555441 | sim_22_1_4  |   99 | chr22 | 46558150 |  255 | 100=              | =    | 46558385 |    0 | CCCTGGCTGGAGTACAAGTTACTG | G6EG5GGGGDGGADFGEB?GFDGG |
:          :             :      :       :          :      :                   :      :          :      : GGGCTCAGTTCTTAGGATTCCTAC : AEGGCFFBECGGB*BEA9BFG6E= :
:          :             :      :       :          :      :                   :      :          :      : AAAGCCCAATTTGGGTCACCTGAC : ?=)C5FGAFE6G2FB=CFECF#DA :
:          :             :      :       :          :      :                   :      :          :      : TTGTTAAGTCAGGTGAAGGTGACT : EAG<AGD7F*44D:#B>+?F##;E :
:          :             :      :       :          :      :                   :      :          :      : TACC                     : ;#DD                     :
+----------+-------------+------+-------+----------+------+-------------------+------+----------+------+--------------------------+--------------------------+
21 tuples (7.336ms)

Q6. Count the number of alignment records for every chromosome from all data files that have been loaded into the sequential schema:

sql> SELECT rname, COUNT(*) FROM (SELECT * FROM bam.alignments_1 UNION SELECT * FROM bam.alignments_2) AS tmp GROUP BY rname;
+-------+------+
| rname | L1   |
+=======+======+
| chr13 |    2 |
| chr14 |    4 |
| chr18 |    2 |
| chr22 |  126 |
| chr2  |    2 |
| chr3  |    2 |
| chr9  |    4 |
+-------+------+
7 tuples (32.554ms)

Queries using pairwise schema

Q7. Count the number of paired primary alignments

sql> SELECT COUNT(*) FROM bam.paired_primary_alignments_4;
+------+
| L1   |
+======+
|   21 |
+------+
1 tuple (2.408ms)

Q8. Extract paired sequences

sql> SELECT l_seq, r_seq FROM bam.paired_primary_alignments_3;
+---------------------------------------------------------------------+----------------------------------------------------------------------+
| l_seq                                                               | r_seq                                                                |
+=====================================================================+======================================================================+
| ACTCTGTCGCCCAGGCCGGGGCGTAGCCATGCAAACACGGCTCACTGCAGCCTCGACCTCCCCGGCT | AGTAGATGGGATTACAGGCACCCGCCACCGCGCCCAGCTAATTTTTATATATTTTTTAAGTAGAGATG |
: CAAGCCATCCTCCCACCTCAACCTCCCAAGTAG                                   : GGGTTTCACCATGTTGGCCAGGCTGGTCTTGA                                     :
| TGATGCTGAGCTTGACGGCGGTCCCCCGGGTACCACTCCATCACCCAGTGAAGGAGGCAGGTGGGAC | TGGCCAAGTAATCATGCTAACTAGAGCTCTAGCTTCCTCTCCCTGGTTTCGCTTTTCTAAGTATAAAT |
: CATCTCAGACATGGGCTGTGTTGGGACCCCCAG                                   : AAATAAATAAATGATAAATTGCCTTATTTTTT                                     :
...

| AATATAAATTAAAATCTCAGGCCTCCATCCCCTATCTACTGAATCAGAATCTGCATTTTAACAGATC | CGTAGTGAGACCCCATCTCCAAAAAATAAATAAATTAAGGACAGGATGGGCATGGACCACTCTTCAAG |
: TTTTGTGGTTCATATACAGTTGAGTTCTTTTTT                                   : ATTACTGACCAGCTTCTTGATATTGGACAACT                                     :
| CTTTTCTATATAAAACAAATAATTTATAATACTGTATTTGTTTTCTATTACTATTGAAATTACTACA | TGCCTTTTCCAGCTGATAGAGACAGCCATGTTTCATGGCTTGTAGCCCCTTCCTCCATCTTCAAAGCC |
: AATGTAGTAGCTAAAACAACACATATTTTTCAT                                   : AACAACAACTGGTCAAGTTGTCACATTGCATC                                     :
+---------------------------------------------------------------------+----------------------------------------------------------------------+
21 tuples (2.831ms)

Q9. Find the primary alignment pairs whose left and right reads have different mapping qualities:

sql> SELECT qname FROM bam.paired_primary_alignments_3 WHERE l_mapq <> r_mapq;
+-------+
| qname |
+=======+
+-------+
0 tuples (2.736ms)

Q10. Compute the distances of the paired primary alignments

sql> SELECT qname, r_pos - (l_pos + bam.seq_length(l_cigar)) AS dist FROM bam.paired_primary_alignments_4 WHERE l_pos < r_pos;
+-------------+------+
| qname       | dist |
+=============+======+
| sim_22_1_2  |   91 |
| sim_22_1_3  |   93 |
| sim_22_1_4  |  135 |
| sim_22_1_5  |  108 |
| sim_22_1_6  |  108 |
| sim_22_1_10 |  108 |
| sim_22_1_14 |  153 |
| sim_22_1_16 |  108 |
| sim_22_1_a  |  140 |
| sim_22_1_b  |  112 |
| sim_22_1_d  |   93 |
| sim_22_1_f  |  108 |
+-------------+------+
12 tuples (8.376ms)

Q11. Create a histogram of the distances of the paired primary alignments

sql> SELECT r_pos - (l_pos + bam.seq_length(l_cigar)) AS dist, COUNT(*) AS cnt 
FROM bam.paired_primary_alignments_4 WHERE l_pos < r_pos GROUP BY dist;
+------+------+
| dist | cnt  |
+======+======+
|   91 |    1 |
|   93 |    2 |
|  135 |    1 |
|  108 |    5 |
|  153 |    1 |
|  140 |    1 |
|  112 |    1 |
+------+------+
7 tuples (6.406ms)

 

SAM/BAM SQL functions

SAM/BAM SQL functions zhang Tue, 07/08/2014 - 18:16

The MonetDB SAM/BAM module contains a small library of SQL functions which implement the most common operations in DNA sequence alignment data analysis.

bam.bam_flag(flag SMALLINT, name STRING) RETURNS BOOLEAN

Returns a BOOLEAN value to indicate whether the bit with the given name was set to 1 in the given flag or not. The flag fields as defined by the SAM specification were used.

flag: the integer flag value of an assignment record

name: the name of the bit that should be checked. The following names are used for the different flags:
 

Bit Name Description (from the SAM specification)
0x1 mult_seqm template having multiple segments in sequencing
0x2 prop_alig each segment properly aligned according to the aligner
0x4 segm_unma segment unmapped
0x8 next_unma next segment in the template unmapped
0x10 segm_reve SEQ being reverse complemented
0x20 next_reve SEQ of the next segment in the template being reversed
0x40 firs_segm the first segment in the template
0x80 last_segm the last segment in the template
0x100 seco_alig secondary alignment
0x200 qual_cont not passing quality controls
0x400 opti_dupl PCR or optical duplicate
0x800 supp_alig supplementary alignment

Example query: selecting all primary alignments:

SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = false;
bam.reverse_seq(seq STRING) RETURNS STRING

Computes the reverse complement of the given DNA sequence. The function uses the following complement mapping:

A ↔ T
C ↔ G
R ↔ Y
S ↔ S
W ↔ W
K ↔ M
H ↔ D
V ↔ B
N ↔ N
bam.reverse_qual(qual STRING) RETURNS STRING

Computes the reverse of the given quality string. Example:

sql>select bam.reverse_qual('h@!j');
+---------------------------+
| reverse_qual_single_value |
+===========================+
| j!@h                      |
+---------------------------+
1 tuple (0.410ms)


bam.seq_length(cigar STRING) RETURNS INT

Use the CIGAR string of an alignment to compute the length of the area of the reference string, to which this alignment is mapped. Example (taken from http://genome.sph.umich.edu/wiki/SAM):

Assume we have a read that is aligned like this to the reference string:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T

This would yield the CIGAR string 3M1I3M1D5M. From this example, it is easy to see that the alignment has length 12 (from position 5 up to and including position 16). However, we can use just the CIGAR string to figure out this length:

sql>SELECT bam.seq_length('3M1I3M1D5M');
+-------------------------+
| seq_length_single_value |
+=========================+
|                      12 |
+-------------------------+
1 tuple (0.388ms)

Hence, it is easy to compute this length for all of the alignments in a particular file, by issuing the following query:

SELECT bam.seq_length(cigar) FROM bam.alignments_1;

 

bam.seq_char(ref_pos INT, alg_seq STRING, alg_pos STRING, alg_cigar STRING) RETURNS CHAR

Use alignment information to figure out which character is aligned with a certain position in the reference genome.

Returns the character that was found at the reference position of interest, or null if no alignment was found at this position.

ref_pos: The position in the reference

alg_seq: The sequence string of the alignment of interest

alg_pos: The value in the pos field of the alignment of interest (which stores the starting position of the alignment)

alg_cigar: The CIGAR string of the alignment of interest

Let's take a look at the example we used earlier:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T

We can now use the seq_char function to confirm what we see above:

sql>SELECT bam.seq_char(5, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| A                     |
+-----------------------+
1 tuple (0.458ms)

In words, this query selects the character that is present on reference position 5, which is indeed an A. Some other examples figure out the characters on positions 11, 16 and 17:

sql>SELECT bam.seq_char(11, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(16, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| T                     |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(17, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.110ms)

 

 

Given this function, it becomes really simple to e.g. select all read alignments of a SAM/BAM file that overlap with a certain position:

sql>SELECT seq, pos, cigar FROM bam.alignments_1 WHERE  bam.seq_char(15600025, seq, pos, cigar) IS NOT NULL;
+------------------------------------------------------------------------------------------------------+----------+-------------+
| seq                                                                                                  | pos      | cigar       |
+======================================================================================================+==========+=============+
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600021 | 3=1X96=     |
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600023 | 1=1I1=1I96= |
+------------------------------------------------------------------------------------------------------+----------+-------------+
2 tuples (1.551ms)

Note: The bam.seq_char function is currently only available in the bamloader branch, which will be merged into the next feature release of MonetDB.

Export SAM/BAM data

Export SAM/BAM data zhang Mon, 07/14/2014 - 17:55

There are two ways to export query results into SAM files: using the server-side sam_export() function, or using the client-side SAM formatter. The server side export function works independent of which client interface is used, however, it can only write data to files to which the MonetDB server has access. With the client side export function, one can export data to any files the client interface has access to, independent of where the MonetDB server resides. However, each client interface needs to be extended with its own SAM formatter. Currently, only mclient is supported.

Server-side SAM export

  • bam.sam_export(output_path STRING)

    This procedure exports all records in the table "bam.export" to the file "output_path". If "output_path" is a relative path, the file is created in the database farm currently served by this MonetDB server (see the man-pages monetdbd and mserver5 for the definition of "dbfarm" and "dbpath"). If the file "output_path" already exists, it is overwritten. This procedure will also delete all records from the table "bam.export" after the export.

The queries below show how to i) populate the "bam.export" table, ii) export data, iii) load exported data into the database, and finally iv) conduct some checks. Note that when insert data into the "bam.export" table, one needs to select the individual columns. This is because the "bam.export" table does not contain a column for the virtual offsets, because the virtual offsets can only be computed afterwards when constructing BAM files.

Populate the export table with a sequentially stored file

sql> INSERT INTO bam.export ( 
SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_1);
sql> CALL bam.sam_export('OUTPUT_1'); -- Export data to SAM file
sql> CALL bam.bam_loader_file('OUTPUT_1', 0); -- Load exported data back into a sequential table
sql> -- Data inside original table should be exactly the same as the newly imported file
sql> SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_1
EXCEPT
SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_13; -- assume OUTPUT_1 is loaded into bam.alignments_13
sql> SELECT * FROM bam.export; -- Verify that the export table is now empty

Populate the export table with a pairwise stored file

sql> INSERT INTO bam.export (
    SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
    FROM bam.unpaired_all_alignments_3
);
sql> CALL bam.sam_export('OUTPUT_2'); -- Export data to SAM file
sql> CALL bam.bam_loader_file('OUTPUT_2', 0); -- Load exported data back into a sequential table
sql> -- Data inside original table should be exactly the same as the newly imported file
sql> SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.unpaired_all_alignments_3
EXCEPT
SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_14; -- assume OUTPUT_2 is loaded into bam.alignments_14
sql> SELECT * FROM bam.export; -- Verify that the export table is now empty

Client-side SAM export

If you are using mclient to communicate with MonetDB, you can use the built-in SAM formatter to export query results to a SAM formatted file. Note that this requires the output columns to have the appropriate names, i.e., the renderer looks for the column names as defined by the alignments_N table as defined in the sequential schema (see SAM/BAM storage schemas). If there are columns that cannot be mapped, you will receive a notification about this and the renderer will simply discard the data in the not recognised columns.

To use the SAM formatter, type in mclient:

sql> \f sam

Now, if you execute a query using mclient, the result will be displayed in SAM format.

To write the query result to a SAM file, e.g., "/tmp/out.sam", type the following command in mclient:

sql> \> /tmp/out.sam

Now, results of all subsequent queries will be appended to this SAM file.

You can easily generate multiple SAM files by passing the "\>" multiple output file names:

sql> \> /tmp/out1.sam
sql> SELECT * FROM bam.alignments_1;
sql> \> /tmp/out2.sam
sql> SELECT * FROM bam.alignments_2;

Streaming

Streaming mk Thu, 06/09/2011 - 00:53

The DataCell stream processing facilities of MonetDB are best illustrated using a minimalist example, where a sensor sends events to the database, which are picked up by a continuous query and sent out over a stream towards an actuator. To run the example, you should have a MonetDB binary with DataCell functionality enabled. This will ensure that the required libraries are loaded and the SQL catalog is informed about the stream specific functions/operators. It will also create the DataCell schema, which is used to collect compiled continuous queries. The final step in the startup is to enable the DataCell optimizer pipeline.

sql> set optimizer = 'datacell_pipe';
sql> create table datacell.bsktin (id integer, tag timestamp, payload integer);
sql> create table datacell.bsktout (like datacell.bsktin);

sql> call datacell.receptor('datacell.bsktin', 'localhost', 50500);
sql> call datacell.emitter('datacell.bsktout', 'localhost', 50600);
sql> call datacell.query('datacell.pass', 'insert into datacell.bsktout select * from datacell.bsktin;');

sql> call datacell.resume();

After these simple steps, it suffices  to hook up a sensor to sent events to the DataCell and to hook up an actuator to listen for response events. The result of this experiment will be a large number of randomly generated events passing through the stream engine in a bulk-fashion. 

$ nc -l -u localhost 50600 &
$ sensor --host=localhost --port=50500 --events=1000 --columns=3 &

The Linux netcat (nc) can be used as a strawmen's actuator to monitor the output of the DataCell. The distribution comes with a sensor and actuatoror simulator. The DataCell source code contains  a fire detection scenario to exercise the DataCell and forms as a basis for cloning your own application.

The example reconsidered
The DataCell operates on relational tables. The first action is to identify all such tables and redefine them as baskets by attaching them to receptors, emittors, or intermittent baskets.

sql> call datacell.receptor('datacell.bsktin', 'localhost', 50500);
sql> call datacell.emitter('datacell.bsktout'
, 'localhost', 50600);

A receptor thread is attached to the 'bsktin' basket on a TCP stream on port 50500 by default over which we receive tuples in CSV format. The number of fields and their lexical convention should comply with the corresponding table definition. The same semantics apply to the format as you would normally use when a COPY INTO command over a CSV file is given. The receptor mode is either active or passive. In passive mode, the default setting, it is the sensor that takes the initiative in contacting the streaming engine to deposits events. In the active mode, it is the streaming engine that contacts the sensor for more events. Note, the receptor becomes active only after you issue the datacall.resume('bsktin') or datacell.resume() operation. The calls shown are actually a shorthand for the more verbose version, where protocol and mode are made explicit.

sql> call datacell.receptor('datacell.bsktin', 'localhost', 50500,'tcp','active');
sql> call datacell.emitter('datacell.bsktout'
, 'localhost', 50600,'udp','passive');

The sensor simulator is geared at testing the infrastructure and takes short cuts on the event formats sent. Currently, it primarily generates event records starting with an optional event identifier, followed by an optional timestamp, and a payload of random integer values. To generate a test file with 100 events for the example, you can rely mostly on the default settings. Hooking up the sensor to the stream engine merely requires an additional hostname and port instead of a file argument.  A glimpse of the sensor interaction can be obtained using --trace, which writes the events to standard output, or a specific file. The sensor simulator asks for user input before it exits. This way, the receiving side can pick up the events and not be confronted with a possible broken UDP channel.

$ sensor --events=100 --protocol=debug --columns=3
1,306478,1804289383
... 98 more ...

100,137483,1956297539
$ sensor --host=localhost --port=50500 --events=100 --columns=3

An alternative scheme is to replay an event log using the --file and the --replay option. It reads an event possibly with a fixed delay (--delay=<milliseconds>), and sents it over the receptor.  An exact (time) replay calls for identifying the column with the temporal information, i.e. using option --time=<field index>.

After this step, the events have been picked up by the receptor and added to the basket datacell.bsktin. This table can be queried like any other table, but be aware that it may be emptied concurrently. The next step is define a continuous query, which in this case passes the input receveid to the output channel. Reception and emitting can be temporarily interrupted using the datacell.pause(objectname) operation.

After registration of the query, the datacell module contains the necessary optimized code for the continuous query processing. The scheduler is subsequently restarted using datacell.resume(), which moves the data from the bsktin into bsktout  when it arrives. You can check the result using ordinary SQL queries over the table producing functions:datacell.receptors(), datacell.emitters(), datacell.baskets() and datacell.queries();

sql> call datacell.query('datacell.pass', 'insert into datacell.bsktout select * from datacell.bsktin;');

 

Architecture Overview

Architecture Overview mk Wed, 08/15/2012 - 21:51

The DataCell Architecture
The DataCell approach is easily understood using previously mentioned example. The sensor program is a simulator of a real-world sensor which emits events at a regular interval, e.g. a temperature, humidity, noise, etc. The actuator is a device simulator that is controlled using events received, e.g. a fire alarm. The sensors and actuators work independently.  They are typically proprietary devices that communicate with a controlling station using a wired network. The only requirement in the DataCell is that devices can communicate using the UDP protocol to deliver events by default with the most efficient event message format CSV. Alternative message format handlers can readily be included by extending the formats recognized by the adaptors or as a simple filter between the device and the DataCell.

Baskets
The basket is the key data structure of the streaming engine. Its role is to hold a portion of an event stream, also denoted as an event window. It is represented as a temporary main-memory table. Unlike other stream systems there is no a priori order or fixed window size. The basket is simply a (multi-) set of event records received from an adapter or events ready to be shipped to an actuator.  There is no persistency and no transaction management over the baskets. If a basket should survive session brackets, its content should be inserted into a normal table. The baskets can be queried with SQL like any other table, but concurrent actions may leave you with a mostly empty table to look at.

Adapters
The receptor and emitter adapters are the interface units in the DataCell to interact with sensors and actuators. Both communicate with their environment through a channel. The default channel is a UDP connection for speed. By default the receptor is a passive thread, opening a channel and awaiting events to arrive. Contrary, the emitter is an active thread, which immediately throws the events on the channel identified. Hooks have been created to change the roles, e.g. the receptor polling a device and emitter to wait for polling actuators.

Events that can not be parsed are added to the corresponding basket as an error. All errors collected can be inspected using the table producing function datacell.errors().

Continuous queries
The continuous queries are expressed as ordinary SQL queries, where previously declared basket tables are recognized by the DataCell optimizer. For convenience they can be packed in a procedure, where the events from a basket can be delivered to multiple baskets. Access to these tables is replaced and interaction with the adapters is regulated with a locking scheme. Mixing basket tables and persistent tables is allowed. An SQL procedure can be used to encapsulate multiple SQL statements and deliver the derived events to multiple destinations.

Continuous queries often rely on control over the minimum/maximum number of events to consider when the query is executed. This information is expressed as an ordinary predicate in the where clause.  The following pre-defined predicates are supported. They inform the DataCell scheduler when the next action should be taken. They don't affect the current query, which allows for a dynamic behavior. The window slide size can be calculated with a query.  It also means that a startup query is needed to inform the scheduler the first time or set the properties explicitly using datacell.basket() and datacell.beat() calls.

datacell.threshold(B,N) query is only executed when the basket B has at least size N
datacell.window(B,M,S) extract a window of at most size M and slide with size S afterwards
datacell.window(B,T,Ts) extract a window based on a temporal interval of size T followed by a stride Ts
datacell.beat(B,T) next query is executed after a T milliseconds delay (excluding query execution time)

The sliding windows constraints are mutually exclusive. Either one slide based on the number of events is consumed or the time window. For time slicing, the first timestamp column in the basket is used as frame of reference. This leaves all other temporal columns as ordinary attributes.

Stream catalog

Stream catalog mk Sun, 01/29/2012 - 16:41

The status of the DataCell is mapped onto a series of table producing SQL functions. The status of the DataCell can be queried using the table producing functions datacell.baskets(), datacell.receptors(), datacell.emitters() and datacell.queries()

sql>select * from datacell.receptors();
+-----------------+-----------+-------+----------+---------+---------+----------------------------+--------+----------+---------+
| nme             | host      | port  | protocol | mode    | status  | lastseen                   | cycles | received | pending |
+=================+===========+=======+==========+=========+=========+============================+========+==========+=========+
| datacell.bsktin | localhost | 50500 | TCP      | passive | running | 2012-08-15 19:31:28.000000 |      2 |       20 |       0 |
+-----------------+-----------+-------+----------+---------+---------+----------------------------+--------+----------+---------+
1 tuple (1.800ms)
sql>select * from datacell.emitters();
+------------------+-----------+-------+----------+--------+---------+----------------------------+--------+------+---------+
| nme              | host      | port  | protocol | mode   | status  | lastsent                   | cycles | sent | pending |
+==================+===========+=======+==========+========+=========+============================+========+======+=========+
| datacell.bsktout | localhost | 50600 | UDP      | active | running | 2012-08-15 19:31:28.000000 |      2 |   10 |       0 |
+------------------+-----------+-------+----------+--------+---------+----------------------------+--------+------+---------+
1 tuple (1.725ms)

The receptors and emitters are qualified by their communication protocal and modes. The last time they have received/sent events is shown. The events not yet handled by a continuous query are denoted as pending.


sql>select * from datacell.baskets();
+---------------------+-----------+---------+-----------+-----------+------------+------+----------------------------+--------+
| nme                 | threshold | winsize | winstride | timeslice | timestride | beat | seen                       | events |
+=====================+===========+=========+===========+===========+============+======+============================+========+
| datacell.bsktmiddle |         0 |       0 |         0 |         0 |          0 |    0 | 2012-08-15 19:31:28.000000 |      0 |
| datacell.bsktin     |         0 |       0 |         0 |         0 |          0 |    0 | 2012-08-15 19:31:28.000000 |      0 |
| datacell.bsktout    |         0 |       0 |         0 |         0 |          0 |    0 | 2012-08-15 19:31:28.000000 |      0 |
+---------------------+-----------+---------+-----------+-----------+------------+------+----------------------------+--------+
3 tuples (1.639ms)
sql>select * from datacell.queries();
+-----------------+---------+----------------------------+--------+--------+------+-------+---------------------------------------------------------------------------------+
| nme             | status  | lastrun                    | cycles | events | time | error | def                                                                             |
+=================+=========+============================+========+========+======+=======+=================================================================================+
| datacell.pass   | running | 2012-08-15 19:31:28.000000 |      6 |     20 |  613 |       | insert into datacell.bsktmiddle select * from datacell.bsktin;                  |
| datacell.filter | running | 2012-08-15 19:31:28.000000 |      4 |      7 |  653 |       | insert into datacell.bsktout select * from datacell.bsktmiddle where id %2 = 0; |
+-----------------+---------+------
---------------------+--------+--------+------+-------+----------------------------------------------------------------------------------+

The baskets have properties used by the scheduler for emptying them. The events pending are shown. The continuous queries are marked with how often they have been selected for execution, the total number of events take from all input baskets, the total execution time and their definition.

Sensor simulator

Sensor simulator mk Sat, 01/28/2012 - 17:47

The sensor simulator is geared at testing the total infrastructure and takes short cuts on the event formats sent. Currently, it primarily generates event records starting with an optional event identifier, followed by an optional timestamp, and a payload of random integer values.

sensor [options]
--host=<host name>, default=localhost
--port=<portnr>, default=50500
--sensor=<name>
--protocol=<name> udp or tcp(default)
--increment=<number>, default=1
--timestamp, default=on
--columns=<number>, default=1
--events=<events length>, (-1=forever,>0), default=1
--file=<data file>
--replay use file or standard input
--time=<column> where to find the exact time

--batch=<batchsize> , default=1
--delay=<ticks> interbatch delay in ms, default=1
--trace=<trace> interaction
--server run as a server
--client run as a client


To generate a test file with 100 events for the example, you can rely mostly on the default settings. Hooking up the sensor to the stream engine merely requires an additional hostname and port instead of a file argument.  A glimpse of the sensor interaction can be obtained using the 'debug' protocol, which writes the events to standard output, or a specific file. The status of the DataCell can be checked with datacell.dump(), which now shows a hundred events gathered in the basket bsktin. The sensor simulator asks for user input before it exits. This way, the receiving side can pick up the events and not be confronted with a broken UDP channel.

$ sensor --events=100 --protocol=debug --columns=3
1,306478,1804289383
... 98 more ...
100,137483,1956297539
$ sensor --host=localhost --port=50500 --events=100 --columns=3

Actuator simulator

Actuator simulator mk Sat, 01/28/2012 - 17:40

To test the DataCell, the distribution contains a simple event simulator. It generates a stream of MonetDB tuple values containing only random integers. Each tuple starts with a time stamp it has been created.

The actuator simulator provides the following options:

actuator [options]
--host=<host name>, default localhost
--port=<portnr>, default 50600
--protocol=<name>  either tcp/udp, default tcp
--actuator=<actuator name> to identify the event received
--server run as a server (default)
--client run as a client
--events=<number>  number of events to receive
--statistics=<number>  show statistics after a series of events

Data Vaults

Data Vaults mk Sun, 04/18/2010 - 07:14

Modern dataware houses are not confined to strict tabular data loaded into the database using the COPY INTO statement. In many areas, most notably in the scientific domains, proprietary file formats have been designed, augmented with libraries to create and inspect them. MonetDB recognizes the value of such choices and provides an easy way to open up these treasure chest for SQL querying. Therefor, it introduces a series of data vaults, each geared towards supporting a well-defined foreign file format.

MonetDB Data Vaults (DV) are database-attached external files or file repositories. Based on a concept similar to SQL/MED, they can be used for managing external data. One of the primary purposes of a data vault is to allow faster and easier processing of data stored in large file repositories. There are DVs currently available in MonetDB for the following formats:

This topic is also the focus in ongoing research [1].

[1] Data Vaults: a Symbiosis between Database Technology and Scientific File Repositories

[2] MiniSEED Data Formats

[3] NetCDF

Embedded Python/NumPy

Embedded Python/NumPy giulia Mon, 03/23/2020 - 15:51

Python is one of the most popular languages in Data Science. It is a flexible scripting language that is easy to use and has a large amount of available statistical libraries. When we're doing statistical analysis in Python, we naturally need data accessible to us in Python in some way. And what better place to keep data than in a database.

Previously when you wanted to combine MonetDB and Python, you had to use the Python MAPI client. This client suffers from very low transfer speeds because it uses sockets. If you want to execute Python functions on larger data sets this low transfer speed quickly grows out of control. 

This is why we are now introducing MonetDB/Python, in which we allow users to create their own Python functions as they would create SQL functions. The embedded python functions can then be used within SQL statements. We show that our new method of making data available to Python analyses is faster than competing solutions.

By using Numpy arrays, which are essentially Python wrappers for C arrays, our Embedded Python implementation can transfer data from MonetDB to Python without unnecessarily copying the data, which leads to extremely fast transfer speeds. In addition, Embedded Python supports mapped operations, which allows you to run python functions in parallel within SQL queries. Combined with the numpy/scipy libraries, which contain very efficient C implementations of numerous statistical/analytical functons, Embedded Python matches the speed of native SQL functions, while offering all the flexibility and ease of use of the Python scripting language. In addition, you can load any of the numerous available Python modules and use them.

Embedded Python works by creating a SQL Function that contains the Python code to be run. A simple function that multiplies the input by two is as follows.

CREATE FUNCTION python_times_two(i INTEGER) RETURNS INTEGER LANGUAGE PYTHON {
    return i * 2
};

After creating this function, we can use it in a SQL query as follows.

CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (2), (3), (4), (5);
SELECT python_times_two(i) AS result FROM integers;
result
2
4
6
8
10

Data Input

Now you might be wondering what exactly "i" is in this function. As we have mentioned previously, we are using Numpy for converting between MonetDB and Python. The exact type of "i" depends on the input; if the input contains a NULL value, "i" will be a MaskedArray, otherwise "i" will be a regular one-dimensional Numpy array. The dtype of the array depends on the type specified in the SQL function. In this example the specified type is "INTEGER", which corresponds to a dtype of numpy.int32. Below is a table that contains the exact types for every possible input value.

INPUT          DTYPE
BOOLEAN        numpy.int8
TINYINT        numpy.int8
SMALLINT       numpy.int16
INTEGER        numpy.int32
BIGINT         numpy.int64
REAL           numpy.float32
FLOAT          numpy.float64
HUGEINT        numpy.float64
STRING         numpy.object [1]

Creating Tables using Python Data

We can also use the embedded python to create tables or insert data into existing tables. Consider the following function.

CREATE FUNCTION python_table() 
RETURNS TABLE(name STRING, country STRING, age INTEGER) 
LANGUAGE PYTHON {
    result = dict()
    result['name'] = ['Henk', 'John', 'Elizabeth']
    result['country'] = ['NL', 'USA', 'UK']
    result['age'] = [25, 30, 33]
    return result
};

This returns a table with three columns. We can then create an actual table in our database as follows.

CREATE TABLE people AS 
  SELECT * FROM python_table() 
  WITH data;
SELECT * FROM people;
name country age
Henk NL 25
John USA 30
Elizabeth UK 33

Naturally this is only a toy example that takes no input and simply returns a constant table. Perhaps a more useful example is the following function.

CREATE FUNCTION random_integers(low INTEGER, high INTEGER, amount INTEGER) 
RETURNS TABLE(value INTEGER) 
LANGUAGE PYTHON { 
    return numpy.random.randint(low, high, size=(amount,))
};

This function generates a table with "amount" entries, where each entry is a random integer between low and high. We can then use the function to generate a table of 5 integers with a value between 0 and 10 as follows. 

SELECT * FROM random_integers(0, 10, 5);
value
4
5
8
4
2

Filtering Data

We can use an embedded python function in the WHERE clause to pick which rows to include in the result set. Consider the following function that checks, for every string in a column, if a given string (needle) is a part of that string (haystack).

CREATE FUNCTION python_strstr(strings STRING, needle STRING) 
RETURNS BOOLEAN 
LANGUAGE PYTHON { 
    return [needle in haystack for haystack in strings]
};

We can now use this function to select all people with the letter 'n' in their name.

SELECT * FROM people WHERE python_strstr(name, 'n');
name country age
Henk NL 25
John USA 30

Aggregating Data

Finally, we can use an embedded python function for computing aggregates. The syntax for creating an aggregate is as follows.

CREATE AGGREGATE python_aggregate(val INTEGER) 
RETURNS INTEGER 
LANGUAGE PYTHON {
    try:
        unique = numpy.unique(aggr_group)
        x = numpy.zeros(shape=(unique.size))
        for i in range(0, unique.size):
            x[i] = numpy.sum(val[aggr_group==unique[i]])
    except NameError:
        # aggr_group doesn't exist. no groups, aggregate on all data
        x = numpy.sum(val)
    return(x)
};

This is a simple aggregate that sums the integers for each group or when no groups are defined for all values. We can then use the aggregate just as we would use a SQL aggregate.

CREATE TABLE grouped_ints(value INTEGER, groupnr INTEGER);
INSERT INTO grouped_ints VALUES (1, 0), (2, 1), (3, 0), (4, 1), (5, 0);
SELECT groupnr, python_aggregate(value) FROM grouped_ints GROUP BY groupnr;
groupnr L1
0 9
1 6

As you can see, this produces output equivalent to the SQL statement SUM(). If you look at the source code you will see the usage of a hidden parameter "aggr_group". Note that parameter "aggr_group" is only created when a GROUP BY is used in the SQL query. This parameter is passed to aggregates and contains a numpy array of the group numbers for each tuple. In the above example, aggr_group contains the numbers [0, 1, 0, 1, 0]. We then use the group numbers of each tuple to compute the aggregate value for each group.

Note also that aggr_group will always be a one dimensional array containing the group numbers, even if we do a GROUP BY over multiple columns, as in the below example.

CREATE TABLE grouped_ints(value INTEGER, groupnr INTEGER, groupnr2 INTEGER);
INSERT INTO grouped_ints VALUES (1, 0, 0), (2, 0, 0), (3, 0, 1), (4, 0, 1), (5, 1, 0), (6, 1, 0), (7, 1, 1), (8, 1, 1);
SELECT groupnr, groupnr2, python_aggregate(value) FROM grouped_ints GROUP BY groupnr, groupnr2;
groupnr groupnr2 L1
0 0 3
0 1 7
1 0 11
1 1 15

MonetDB transparently handles multiple groups for aggregates. In this example aggr_group will be a single array containing the values [0, 0, 1, 1, 2, 2, 3, 3].

Parallelized Execution of Python Functions

When performing a SQL query, MonetDB can speed up the query by splitting up the columns and running the query in parallel on multiple threads when possible. This is a process called mitosis. For example, when we want to take the square root of every element of a column, we can split up the column and have separate threads work on separate parts of the column. This is shown in the image below for a column of size 4 being split into two parts, and having two threads execute the operation.

However, certain operations cannot be parallelized because they require access to the entire column. These are called blocking operations. An example of such a blocking operation is the quantile function, because we cannot compute the quantile of a column when we only have access to part of a column. By default, embedded python is a blocking operation. This is because it is very easy to write a user defined function that is not mappable. We could, for example, use our python function to compute a quantile.

However, blocking operations are inefficient. When we use a blocking operation, we need to wait for all the previous threads to finish their operations. Then, we need to take all the split-up columns and recombine them into one big column, and then we can finally call the blocking operation. If our python function is mappable, we would prefer it to be executed in parallel.

Well, good news! We support this. If you know that your function is mappable, you can specify this by setting the LANGUAGE to PYTHON_MAP instead of PYTHON when creating the SQL function. Your function is mappable if the output does not depend on the entire column, but only on the individual rows.

For a simple example, let's go back to the first function we created: multiplying a column of integers by 2. This is a mappable function, as the output works with the individual rows. The new function looks like this.

CREATE FUNCTION python_times_two_map(i INTEGER) RETURNS INTEGER LANGUAGE PYTHON_MAP {
    return i * 2
};

We can then use it just as we used the regular python_times_two function, and it will return the same result.

SELECT python_times_two_map(i) AS result FROM integers;
result
2
4
6
8
10

Even for such a simple function we can see a big performance increase when we increase the input size. Below is a graph running both functions with 1GB of input data.

Note that it is possible to run non-mappable functions using LANGUAGE PYTHON_MAP, however, this will then naturally produce different output than running the same function with LANGUAGE PYTHON. It is possible to abuse this to run part of a Python function in parallel. As an example, let's make the MIN() function in our embedded Python.

Suppose we have the following mapped function, and the following non-mapped function.

CREATE FUNCTION python_min(i INTEGER) RETURNS integer LANGUAGE PYTHON {
    return numpy.min(i)
};
CREATE FUNCTION python_min_map(i INTEGER) RETURNS integer LANGUAGE PYTHON_MAP {
    return numpy.min(i)
};

At a glance these functions look identical. If we run the following query we will get the expected result.

SELECT python_min(i) FROM integers;
min
1

But if we run the mapped query we will get a different result [2]. 

SELECT python_min_map(i) FROM integers;
min
1
3

Since we are running with two threads, the function python_min_map is being executed twice. Once for one part of the column, and once for another part of the column. This means that the function returns two separate values. We can obtain the actual minimum value by using our sequential function python_min on the result of python_min_map.

SELECT python_min(python_min_map(i)) FROM integers;
min
1

Now part of our query is being run in parallel, and we are still obtaining the desired result. Running these functions for a data set of 1000MB we can see that our parallel function has better performance than simply calling python_min, while still computing the correct value.

Performance

To keep in with tradition, we will use embedded python to compute quantiles. We can do this efficiently using the numpy.percentile function. We will compare our embedded Python implementation against various other ways of loading data into Python and then executing the numpy.percentile function on the data, as well as some other non-Python ways of computing the quantile. In our example, we are using a single table containing 1 GB of integer values (250M values). 

  • MonetDB/Python: This is our implementation, using numpy.percentile on the data within a MonetDB table. MonetDB uses memory mapping to load the data into memory very quickly, and because of our zero-copy transfer into Python there is no additional overhead cost for transferring this data into Python. 
  • MonetDB C UDF: Compute the quantile in MonetDB using a C UDF. We computed the quantile using Quickselect.
  • NumPy Binary: Use numpy.load() to load the data into a numpy array from a numpy binary file. This is a very fast way of loading data into Python, because we are directly mapping a binary file into memory we do not have to do any decoding.
  • NumPy MemoryMap: Use numpy.memmap() to load the data into a numpy array from a numpy binary file. This has similar performance to NumPy binary files.
  • PyTables: Load the data using tables.open_file(). This is fast because it loads a binary file directly into a Numpy array.
  • PyFITS: Use pyfits.open() to load the data from into a numpy array from a .fits file.
  • castra: A column-store database created entirely in Python.  This uses Python pickling to load the database file as a Python object from an encoded string. This has some additional overhead. 
  • MonetDB/R: Using the MonetDB/R plugin, using the native R quantile function instead of the numpy.percentile function. The MonetDB/R extension does not use zero-copy transferring, which means there is extra overhead of copying the data once. In addition, it seems the numpy.percentile() function is faster than the quantile() function in R.
  • Monary: Monary is a connector for MongoDB that is optimized for use with NumPy
  • MonetDB Built-In: Use the built-in SQL quantile() function in MonetDB. This function does not use a very efficient algorithm for computing the quantile, which is why it takes a while.
  • PandasCSV: Use the read_csv() function from the pandas library to load the data. This is an efficient C implementation of a CSV loader. However, we still need to decode the CSV and convert it into integer values, which means the loading takes a while.
  • Psycopg2: Load the data into Python using psycopg2, the default Python connector for Postgres, and then use numpy.percentile() to compute the quantile. This is slow because it constructs individual Python objects for every integer.
  • PL/Python: Load the data from a Postgres table using plpy.cursor() in a PL/Python function. This is slow because it loads the data as Python objects. There is also additional overhead from Postgres having to first gather all the relevant integer into a single array, as it is a row store database.
  • SQLite: Call 'SELECT * FROM table' to load the data from SQLite into Python, using the built-in sqlite3 module. This implementation uses Python objects for integers instead of Numpy arrays, which carry a lot of additional overhead with regards to loading as we have to construct 250 million Python integer objects (which each have to be malloc'd individually).
  • Postgres Built-In: Perform the quantile computation using the built-in percentile_cont aggregate, with Postgres tuned for Data Warehousing using pgtune. 
  • CSV: Use the built-in Python csv library to load the data into a Numpy array. This is much less efficient than the pandas load_csv() library as it constructs Python objects. In addition, the CSV loader is written in pure Python which carries additional overhead. 
  • MonetDB MAPI: Use the current MonetDB Python client to load the data into Python. This is extremely slow because data has to be serialized over sockets. 
  • MySQL Cursor: Load the data into Python using the MySQL Python Connector. This is extremely slow because it sends the data to Python one row at a time.

Discussion

We have shown that our embedded Python implementation is fast, however, the true strength of using Python is the fact that it is so flexible. All the functions we have shown here can easily be implemented in SQL as well.

In the next blog post we demonstrate that MonetDB/Python can be used for much more than just emulating simple SQL functions by performing classification, including preprocessing, training and predicting, of a big dataset using MonetDB/Python combined with SQL, without the data ever leaving MonetDB. This blog post can be found here.

We would like to make this integration generally available with the next feature release of MonetDB, till then we really would appreciate feedback at M.Raasveldt@cwi.nl .

Installation

Currently MonetDB/Python is available in the in-development version of MonetDB in the default branch and must be compiled from source. Note that you will need NumPy installed for embedded python to work. You can look here for information on how to install NumPy. To compile the pythonudf branch from scratch, you can download a tar file of the source here. You can then compile the source by running the following commands in the root directory of the source tree.

./bootstrap
./configure --prefix=<install_directory>
make
make install

You can then run MonetDB by starting the monetdbd daemon or mserver5 in the given installation directory. Note that you will need to explicitly enable Python integration using the following commands if you are using the monetdbd daemon.

monetdb stop pytest
monetdb set embedpy=true pytest
monetdb start pytest

Or the following command if you are running mserver5.

mserver5 --set embedded_py=true

You can then connect to MonetDB and use embedded python.

About the Author

Mark Raasveldt is studying Computer Science at Utrecht University. He is currently working on his Master's thesis at the CWI Database Architectures Group. Surprisingly, he is working on embedding Python in MonetDB.

Footnotes

[1] The Numpy array is filled with Python objects of either type 'str' (if there are no unicode characters in the column) or 'unicode' (if there are unicode characters in the column).

[2] Note that we are running with 2 threads and forcing mitosis (mserver5 flags --set gdk_nr_threads=2 --forcemito). --forcemito forces mitosis on small tables, otherwise MonetDB would not split up a table of 5 entries.

Embedded R

Embedded R hannes Mon, 03/23/2020 - 15:56

For some time now, we have had the MonetDB.R package on CRAN, which allows you to connect to a MonetDB server from an R session. This package uses the MAPI protocol to talk to MonetDB over a standard socket. In addition to running queries, the package also supports the more high-level dplyr API. 

While we worked hard on making this integration as painless as possible, there was one issue that we could not solve: Processing data from the database with R required transferring the relevant data over the socket first. This is fine and "fast enough" for  up to – say – several thousand rows, but not for much more. We have had a lot of demand for transferring larger amounts of data from users. Hence, we chose to go in a different direction. 

Starting with the Oct2014 release, MonetDB will ship with a feature we call R-Integration. R has some support for running embedded in another application, a fact we leverage heavily. What it does is make R scripts a first class citizen of the SQL world in MonetDB. Users can create ad-hoc functions much in the style of SQL functions. In the remainder of this blog post, we will describe the installation, the usage and some initial performance observations of RIntegration.

Usage

R-Integration works by wrapping R code in an SQL function definition, so that the SQL compiler knows the input and output schema of the function. R functions can be used in various parts of an SQL query, for instance, as table-producing functions (used in the FROM clause), as projection functions (used in the SELECT clause), as filtering functions (used in the WHERE clause) and as aggregation functions (also in the SELECT clause, but together with a GROUP BY clause). 

We begin with a table-producing function

CREATE FUNCTION rapi00() RETURNS TABLE (d INTEGER) LANGUAGE R {
    seq(1,10);
};

In this example, there are no parameters going into the function, but it returns a table with a single integer-typed column. Inside the curly braces we see the function body, which is plain R code. This function generates the sequence of numbers from 1 to 10. If we want to call it, we can do so from the SQL prompt:

SELECT d FROM rapi00() AS r WHERE d > 5;

Here, the R function takes the FROM position in the query. Note how we filter the output by only selecting values greater than 5. The output of this function is

sql>SELECT d FROM rapi00() AS r WHERE d > 5;
+------+
| d    |
+======+
|    6 |
|    7 |
|    8 |
|    9 |
|   10 |
+------+

Here is a slightly more complicated example with a constant input parameter and two output columns:

CREATE FUNCTION rapi01(i integer) RETURNS TABLE (i INTEGER, d DOUBLE) LANGUAGE R {
    data.frame(i=seq(1, i), d=42.0);
};

Note how we can use the SQL input parameter inside the R code. It will be automatically translated into the R world. Also note how we construct a to return the two-column result SQL expects. An example invocation and result would be

sql>SELECT i, d FROM rapi01(42) AS R WHERE i > 40;
+------+--------------------------+
| i    | d                        |
+======+==========================+
|   41 |                       42 |
|   42 |                       42 |
+------+--------------------------+

For the remaining examples, let's create some sample data:

CREATE TABLE rval(i INTEGER);
INSERT INTO rval VALUES (42),(43),(44),(45);
CREATE TABLE rvalg(groupcol INTEGER, datacol INTEGER);
​INSERT INTO rvalg VALUES (1, 42), (1, 84), (2, 42), (2, 21);

The following R function will multiply the passed values. We can use it to do some computations on the projected columns in an SQL query:

CREATE FUNCTION rapi02(i INTEGER, j INTEGER) RETURNS INTEGER LANGUAGE R { i * j };

In this example, we will call this function with a constant as the second parameter:

sql>SELECT rapi02(i, 2) FROM rval;
+----------+
| rapi02_i |
+==========+
|       84 |
|       86 |
|       88 |
|       90 |
+----------+

We can also use the R functions in the WHERE clause of the query, where they can decide whether a row is included in the result or not:

CREATE FUNCTION rapi03(i INTEGER, j INTEGER) RETURNS BOOLEAN LANGUAGE R { i > j };
sql>SELECT * FROM rval WHERE rapi03(i, 44);
+------+
| i    |
+======+
|   45 |
+------+

Finally, we can use an R function to calculate an aggregation within the projection of a GROUP BY query:

CREATE AGGREGATE rapi04(val INTEGER) RETURNS DOUBLE LANGUAGE R {
    aggregate(val, by=list(aggr_group), FUN=median)$x
};

Note the keyword AGGREGATE and the magical R variable aggr_group, which contains an integer number denoting which group a row belongs to. If we run this function, we get the following result

sql>SELECT groupcol, rapi04(datacol) FROM rvalg GROUP BY groupcol;
+----------+--------------------------+
| groupcol | L1                       |
+==========+==========================+
|        1 |                       63 |
|        2 |                     31.5 |
+----------+--------------------------+

As a final note on usage, you can use any R package from CRAN within your function without explicitly installing it. We overloaded the library function in R to automatically install missing packages.

Performance

Since we no longer need to serialize data onto a socket with our new solution, we expect great performance benefits. Also, since both MonetDB and R use a columnar data layout, little effort has to be spent on converting the values between these two systems. In fact, we have shown how we can gain a zero-copy integration between MonetDB and R in prototypical work. In the released version however, we invest one in-memory copy of the data for increased peace of mind. For this experiment, we have used the lineitem table of the well-known TPC-H benchmark at scale factor 100. This table contains 100 million rows. The test task was to calculate the .05 and .95 quantiles of the l_extendedprice column. The contenders in the benchmark were 

  • MonetDB itself, we have added quantile support some time ago (MonetDB)
  • R using the MonetDB.R socket connector, retrieving all columns, then calculate quantiles (R-full)
  • R using the MonetDB.R socket connector, retrieving a single column, then calculate quantiles (R-col)
  • PL/R, the R integration into PostgreSQL, initial approach (PL/R-naive)
  • PL/R, the R integration into PostgreSQL, tuned after feedback from Joe Conway (PL/R-tuned)
  • MonetDB with embedded R (RInt)

All experiments were run on a desktop-class machine with 16 GB of main memory and a 8-core Intel i7 processor. We tuned PostgreSQL using the pgtune utility for analytical workloads. We ran the aggregation over an increasing number of rows, from 1.000 to 100.000.000. All queries were repeated five times for each system and then averaged. The results are given below:

Performance Comparision

In this plot, we can see that R-Integration delivers superior performance to all other solutions, curiously also that of MonetDB's quantile implementation. The reason for this is that R uses partial sorting for its quantile implementation, whereas MonetDB does a full sorting pass over the data to calculate quantiles. Missing data points are due to timeouts (60 seconds limit) or crashes.

For reference, we used the following R UDF for the winning system in this experiment:

CREATE AGGREGATE rquantile(arg1 double, arg2 double) RETURNS double LANGUAGE R { quantile(arg1, arg2) };

This was then run using the SQL query

select rquantile(cast(l_extendedprice as double), 0.05) as q5, rquantile(cast(l_extendedprice as double), 0.95) as q95 from lineitem;

Installation

For now, R-Integration is not shipped with the binary distributions of MonetDB due to compatibility and security concerns, Hence, you need to compile MonetDB from source with (just a few) custom parameters. In addition, you need to install an R version (or compile from source, too) that includes the R shared library. On Linux systems, you can check whether you have a file named libR.so. If you have installed R using your OS's package manager, it will most likely be the case. Compiling MonetDB from source is not hard. To enable R-Integration, you need to run the ./configure script as follows:

./configure --enable-rintegration=yes

If all goes well, the output will contain the following line in the summary at the end

rintegration is enabled 

If not, check that the R binary is in your $PATH, and that R was configured with the --enable-R-shlib=yes flag. You may also be required to set the $LD_LIBRARY_PATH variable to include the path to libr.so before starting MonetDB.

Startup

After R-Integration has been compiled into the MonetDB installation, you need to explicitly enable it during server startup. If you are using the monetdbd daemon, you can enable R-Integration as follows

monetdb stop rtest
monetdb set embedr=true rtest
monetdb start rtest

Where rtest is the name of your database. If you are running mserver5 directly, the parameter name is different:

mserver5 --set embedded_r=true

In the latter case, you will see MonetDB's startup message. If you see no error messages and

# MonetDB/R   module loaded

all is well.

MonetDBLite for Java

MonetDBLite for Java giulia Mon, 03/23/2020 - 16:00

Following the footsteps of MonetDBLite for R and MonetDBLite for Python, we now have MonetDBJavaLite which deploys MonetDBLite in a JVM with JDBC support. It has been tested on Linux, Mac and Windows. In the "lite" versions of MonetDB, both client and server run within the same process, saving eventual inter-process communication such as a socket connection. Although the code is still somewhat experimental, it’s worth trying out, and report your stories and complains.

MonetDBJavaLite connections

We provide two APIs for MonetDBJavaLite: an Embedded API (non-standard) and the standard JDBC API. The former API has been introduced for certain scenarios where performance is critical, such as when dealing with large result sets. But the better performance comes at the cost of less portability. The Javadoc of this API can be found in our website.

A MonetDBLite JDBC connection is very similar to a regular connection with a MonetDB server. Differences between the two connections are:

  • Only one database is allowed per JVM process. However multiple connections to the same database is allowed within the same process.
  • MonetDBLite is a compacted version of MonetDB with some features removed in order to reduce  the size of the library, e.g. the GEOM  module, MERGE TABLE  and REMOTE TABLE, the JSON module, and the Data Vaults extension.
  • The authentication scheme is absent from MonetDBLite.
  • MonetDBJavaLite native code uses a separate heap from the JVM, which means that connections and result sets descriptions must be explitily closed to avoid memory leaks.
  • No two concurrent MonetDBJavaLite processes can use the same database farm simultaneously.

Embedded API

Start a database

A database must be started before any connection can be made. When starting the database, one can specify a path to the database farm. If no such path is supplied, an in-memory connection will be automatically established instead. In an in-memory connection data is not persisted on disk. In other hand transactions are held in-memory thus more performance is obtained.

Path directoryPath = Files.createTempDirectory("testdb");
//start the database with the silent flag set (no debugging output) and disable the sequential pipeline.
MonetDBEmbeddedDatabase.startDatabase(directoryPath.toString(), true, false);
MonetDBEmbeddedConnection connection = MonetDBEmbeddedDatabase.createConnection();
connection.executeUpdate("CREATE TABLE example (words text)");
//...
connection.close();
MonetDBEmbeddedDatabase.stopDatabase(); //Don’t forget to shutdown at the end :)

Transactions

After a connection has been established, one can send  queries to the embedded database and retrieve the results. The connection starts on the auto-commit mode by default. The methods void startTransaction(), void commit() and void rollback() can be used for transaction management. The methods Savepoint setSavepoint(), Savepoint setSavepoint(String name), void releaseSavepoint(Savepoint savepoint) and void rollback(Savepoint savepoint) handle savepoints in transactions.

Update queries

For update queries (e.g. INSERT, UPDATE and DELETE), the method int executeUpdate(String query) is used to send update queries to the server and get the number of rows affected.

connection.startTransaction();
int numberOfInsertions = connection.executeUpdate("INSERT INTO example VALUES ('monetdb'), ('java'), (null)");
connection.commit();

Queries with result sets

For queries with result sets, one can use the method QueryResultSet executeQuery(String query) to send a query to the server, and retrieve the results using a QueryResultSet instance.

The result set metadata can be retrieved with the methods int getNumberOfRows(), int getNumberOfColumns(), void getColumnNames(String[] input) and void getColumnTypes(String[] input).

There are several ways to retrieve the results of a query. The family of methods T get#TYPE#ByColumnIndexAndRow(int column, int row) and
T get#TYPE#ByColumnNameAndRow(String columnName, int row) retrieve a single value from the result set. The column and row indexes for these methods (and the other methods in this family) start from 1, same as  in JDBC.

A column of values can be retrieved using the family of methods void get#TYPE#ColumnByIndex(int column, T[] input, int offset, int length) and void get#TYPE#ColumnByName(String name, T[] input, int offset, int length). Note  that the input array must be initialized beforehand. If there is no desire to provide the offset and length parameters, the methods void get#Type#ColumnByIndex(int column, T[] input) and get#Type#ColumnByName(String columnName, T[] input) can be used instead.

QueryResultSet qrs = connection.executeQuery("SELECT words FROM example");
int numberOfRows = qrs.getNumberOfRows();
int numberOfColumns = qrs.getNumberOfColumns();
String[] columnNames = new String[numberOfColumns];
qrs.getColumnNames(columnNames); //returns ['words']

String singleWord = qrs.getStringByColumnIndexAndRow(1, 1); //gets 'monetdb'
String[] wordsValues = new int[numberOfRows];
qrs.getStringColumnByIndex(1, wordsValues); //returns ['words', 'java', null]
qrs.close(); //Don’t forget :)

To check if a boolean value is NULL, one can use the method boolean checkBooleanIsNull(int column, int row) of the class QueryResultSet. For all other data types, one can use the methods boolean Check#Type#IsNull(T value) of the class NullMappings.

Append data to a table

To append new data to a table, one can use the method int appendColumns(Object[] data) from the class MonetDBTable. The data should come as an array of columns, where each column has the same number of rows, and each array class corresponds to the mapping defined above. To insert null values, use the constant T get#Type#NullConstant() from the class NullMappings.

connection.executeUpdate("CREATE TABLE interactWithMe (dutchGoodies text, justNumbers int)");
MonetDBTable interactWithMe = connection.getMonetDBTable("interactWithMe");
String[] goodies = new String[]{"eerlijk", "lekker", "smullen", "smaak", NullMappings.getObjectNullConstant<String>()};
int[] numbers = new int[]{2, 3, NullMappings.getIntNullConstant(), -1122100, -23123};
Object[] appends = new Object[]{goodies, numbers};
interactWithMe.appendColumns(appends);

Data type mapping

The Java programming language is a strong typed language, thus the mapping between MonetDB SQL types and Java classes/primitives must be explicit. The usage of Java primitives is favored for the most common MonetDB SQL types, hence making less object allocations. However for the more complex SQL types, such as Strings and Dates, the map is made to Java Classes, while matching the JDBC specification.

One important feature of MonetDB is that the SQL NULL values are mapped into the system's minimum values. In MonetDBJavaLite, this feature persists for primitive types. NB: for the Java Classes mapping, SQL NULL values are translated into null objects! Other more rare data types such as geometry, json, inet, url, uuid and hugeint are missing. These types were removed from MonetDBLite to reduce the size of the library. Please check the GitHub documentation for details on data type mapping.

Other methods provided in this API include Prepared Statements which are detailed in the GitHub documentation and the Javadocs.

JDBC API

To start a JDBC embedded connection, one must provide a JDBC URL in the format: jdbc:monetdb:embedded:[<directory>], where directory is the location of the database. To connect to an in-memory database the directory must be :memory: or not present.

When starting a JDBC Embedded connection, it checks if there is a database instance running in the provided directory, otherwise an exception is thrown. While closing, if it's the last connection, the database will shut down automatically.

//Connection con = DriverManager.getConnection("jdbc:monetdb:embedded:/home/user/myfarm"); //POSIX
//Connection con = DriverManager.getConnection("jdbc:monetdb:embedded:C:\\user\\myfarm"); //Windows
//Connection con = DriverManager.getConnection("jdbc:monetdb:embedded::memory:"); //in-memory mode
Statement st = con.createStatement();
st.executeUpdate("CREATE TABLE jdbcTest (justAnInteger int, justAString varchar(32))");
st.executeUpdate("INSERT INTO jdbcTest VALUES (1, 'testing'), (2, 'jdbc')");
ResultSet rs = st.executeQuery("SELECT justAnInteger, justAString from test1;");
while (rs.next()) {
    int justAnInteger = rs.getInt(1);
    String justAString = rs.getString(2);
    System.out.println(justAnInteger + " " + justAString);
}
rs.close(); //Don't forget! :)
st.close();
con.close();

Please check the GitHub documentation for differences between MonetDB’s JDBC socket and embedded connections.

Distribution

The project is hosted and maintained on Github: https://github.com/hannesmuehleisen/MonetDBLite-Java

Two JAR files are distributed: monetdb-java-lite (~6.4Mb) and monetdb-jdbc-new (~150Kb). The former depends on the later and contains MonetDBLite library adapted for the JVM. It is compatible with JVM 8 onwards only. This JAR also provides native libraries for 64-bit Linux, MacOS X and Windows. The later is a fork of MonetDB’s JDBC driver and is used for JDBC connections.

Both JARs are hosted on Maven Central repository.

  • Apache Maven:
<dependency>
    <groupId>monetdb</groupId>
    <artifactId>monetdb-java-lite</artifactId>
    <version>2.33</version>
</dependency>
  • Gradle: compile 'monetdb:monetdb-java-lite:2.33'

Otherwise is possible to download the JARs from our website and add them to the CLASSPATH.

Developer and support

MonetDBJavaLite is being supported by Pedro Ferreira, a software developer at MonetDB Solutions. Feel free to sign up and send your questions to the MonetDB users-list.

User Defined Functions

User Defined Functions mk Thu, 04/01/2010 - 22:16

An open source solution provides a stepping stone for others to extend its kernel functionality with specific types and functions. Experience shows that the need for those are fairly limited. Often the use of the built-in data types, the MAL algebra and functional abstraction, provide the necessary toolkit to achieve your goal.

In the few cases where the MonetDB kernel and SQL runtime system needs extensions, it calls for access to the source code of MonetDB and proficiency in C-programming, compilation and debugging. The openess of MonetDB means that extensions are not sand-boxed; they run within the system address space. Moreover, the multi-layered architecture means you have to make the functions written in C known to the MAL interpreter, before they can be made known to the  SQL compiler. The current setup makes this a little more cumbersome, but the added benefit is that both simple scalar functions and columnar operations can be introduced.

In this section we show how to extend SQL with a simple scalar function to reverse a string, i.e.

sql> select 'hello',reverse('hello');
+---------------+
| hello | olleh |
+---------------+

step 1. You should access and be able to compile and install MonetDB in a private directory.

step 2. Go to the sql/backends/monet5/UDF directory from the sources top directory. It provides the reverse example as a template. A group of user-defined functions is assembled in a directory like UDF. It contains files that described the SQL signature, the MAL signature, and the C-code implementation.

step 3. Extension starts with a definitin of the MAL signatures. See the example given, or browse through the files in monetdb5/modules/mal/*.mal to get a glimpse on how to write them. The MonetDB kernel documentation provides more details.  The file contains the MAL snippet:
command reverse(ra1:str):str
address UDFreverse
comment "Reverse a string";

step 4. The signature says that it expects a command body implementation under the name UDFreverse, shown below. The C-signature is a direct mapping, where arguments are passed by reference and the return value(s)  references are the first in the arguments list. The body should return a (malloced) string to denote an exception being raised or MAL_SUCCEED upon access.
#include "udf.h"

static str
reverse(const char *src)
{
    size_t len;
    str ret, new;

    /* The scalar function returns the new space */
    len = strlen(src);
    ret = new = GDKmalloc(len + 1);
    if (new == NULL)
        return NULL;
    new[len] = 0;
    while (len > 0)
        *new++ = src[--len];
    return ret;
}

str
UDFreverse(str *ret, str *src)
{
    if (*src == 0 || strcmp(*src, str_nil) == 0)
        *ret = GDKstrdup(str_nil);
    else
        *ret = reverse(*src);
    return MAL_SUCCEED;
}

step 5. The next step is to administer the routine in the SQL catalog. This calls for a SQL statement to be executed once for each database. The autoload method can relieve you from loading the modules manually in the server after each restart. The UDF template contains the file 80_udf.sql and 80_udf.mal. The former contains the definition needed for SQL:
create function reverse(src string)
returns string external name udf.reverse;

step 6. The MAL interpreter should be informed about the linked in functionality. This is faciliated using an autoload feature too. The MAL script  simply contains the module signature.

include udf;

step 7. After all pieces are prepared, you have to call the bootstrap program in the root of your checked out source tree once. Thereafter a configure/make/make install attempts compilation and places the interface files and libraries in the proper place.

Creation of bulk  and polymorphmic operations require much more care. In general, it is best to find an instruction that is already close to what you need. Clone it, expand it, compile it, and test it.  A bulk variation of the reverse operation is included in the sample UDF template. As a last resort you can contact us on the mailing lists for further advice.
 

User Defined Functions

User Defined Functions mk Mon, 05/04/2020 - 17:38

An open source solution provides a stepping stone for others to extend its kernel functionality with specific types and functions. Experience shows that the need for those are fairly limited. Often the use of the built-in data types, the MAL algebra and functional abstraction, provide the necessary toolkit to achieve your goal.

In the few cases where the MonetDB kernel and SQL runtime system needs extensions, it calls for access to the source code of MonetDB and proficiency in C-programming, compilation and debugging. The openess of MonetDB means that extensions are not sand-boxed; they run within the system address space. Moreover, the multi-layered architecture means you have to make the functions written in C known to the MAL interpreter, before they can be made known to the  SQL compiler. The current setup makes this a little more cumbersome, but the added benefit is that both simple scalar functions and columnar operations can be introduced.

In this section we show how to extend SQL with a simple scalar function to reverse a string, i.e.

sql> select 'hello',reverse('hello');
+---------------+
| hello | olleh |
+---------------+

step 1. You should access and be able to compile and install MonetDB in a private directory.

step 2. Go to the sql/backends/monet5/UDF directory from the sources top directory. It provides the reverse example as a template. A group of user-defined functions is assembled in a directory like UDF. It contains files that described the SQL signature, the MAL signature, and the C-code implementation.

step 3. Extension starts with a definitin of the MAL signatures. See the example given, or browse through the files in monetdb5/modules/mal/*.mal to get a glimpse on how to write them. The MonetDB kernel documentation provides more details.  The file contains the MAL snippet:
command reverse(ra1:str):str
address UDFreverse
comment "Reverse a string";

step 4. The signature says that it expects a command body implementation under the name UDFreverse, shown below. The C-signature is a direct mapping, where arguments are passed by reference and the return value(s)  references are the first in the arguments list. The body should return a (malloced) string to denote an exception being raised or MAL_SUCCEED upon access.
#include "udf.h"

static str
reverse(const char *src)
{
    size_t len;
    str ret, new;

    /* The scalar function returns the new space */
    len = strlen(src);
    ret = new = GDKmalloc(len + 1);
    if (new == NULL)
        return NULL;
    new[len] = 0;
    while (len > 0)
        *new++ = src[--len];
    return ret;
}

str
UDFreverse(str *ret, str *src)
{
    if (*src == 0 || strcmp(*src, str_nil) == 0)
        *ret = GDKstrdup(str_nil);
    else
        *ret = reverse(*src);
    return MAL_SUCCEED;
}

step 5. The next step is to administer the routine in the SQL catalog. This calls for a SQL statement to be executed once for each database. The autoload method can relieve you from loading the modules manually in the server after each restart. The UDF template contains the file 80_udf.sql and 80_udf.mal. The former contains the definition needed for SQL:
create function reverse(src string)
returns string external name udf.reverse;

step 6. The MAL interpreter should be informed about the linked in functionality. This is faciliated using an autoload feature too. The MAL script  simply contains the module signature.

include udf;

step 7. After all pieces are prepared, you have to call the bootstrap program in the root of your checked out source tree once. Thereafter a configure/make/make install attempts compilation and places the interface files and libraries in the proper place.

Creation of bulk  and polymorphmic operations require much more care. In general, it is best to find an instruction that is already close to what you need. Clone it, expand it, compile it, and test it.  A bulk variation of the reverse operation is included in the sample UDF template. As a last resort you can contact us on the mailing lists for further advice.