Data manipulation

Data manipulation mk Sat, 03/27/2010 - 22:42

Table updates

Table updates mk Thu, 06/30/2011 - 16:01

Data insertions

A table can be populated using an insert statement. It takes a table name and a value expression list. The expression result types should align with the columns in the table definition. Otherwise the column-value association should be explicitly defined using a column name list. Multiple rows can be inserted in a single statement.

The result of a query can be bulk inserted into a table, provided both source and destination are type compatible. Insertion of a table into itself effectively doubles its content, provided non of the table constraints is violated.

[ WITH cte_alias AS SELECT_query [, cte_alias AS SELECT_query , ...] ]
INSERT INTO [ schema_name '.' ] table_name [ '(' column_name [ ',' ... ] ')' ]
{ VALUES '('value_expression [ ',' ... ' ] )' [ ',' ... ] | select_query }

value_expression : { scalar_exp | DEFAULT | NULL }

You must have INSERT privilege for the table. The creator/owner of the table will have INSERT privilege automatically.
The "WITH cte_alias AS SELECT_query" option is supported from release Apr2019 (11.33.3) onwards.

MonetDB/SQL does not support data insertions on views.

Data updates

The update statement syntax follows the SQL standard, but its semantics for bulk updates on keys may be slightly different than expected from other systems. In particular, the update implementation ensures that you can freely update any column without the danger of run-away values. MonetDB/SQL doest not support updates through views.

[ WITH cte_alias AS SELECT_query [, cte_alias AS SELECT_query , ...] ]
UPDATE [ schema_name '.' ] table_name  [ [AS] table_alias ]
SET assignment [ ',' assignment ... ]
[ WHERE search_condition ]

assignment : column_name '=' { scalar_exp  | search_condition | NULL | DEFAULT }

You must have UPDATE privilege for the table or column(s). The creator/owner of the table will have UPDATE privilege automatically.
The "WITH cte_alias AS SELECT_query" option as well table alias is supported from release Apr2019 (11.33.3) onwards.

MonetDB/SQL does not support data updates on views.

Data deletions

[ WITH cte_alias AS SELECT_query [, cte_alias AS SELECT_query , ...] ]
DELETE FROM [ schema_name '.' ] table_name  [ [AS] table_alias ]
[ WHERE search_condition ]

You must have DELETE privilege for the table. The creator/owner of the table will have DELETE privilege automatically.
The "WITH cte_alias AS SELECT_query" option as well table alias is supported from release Apr2019 (11.33.3) onwards.

MonetDB/SQL does not support data deletions on views.

 

To quickly delete all rows in a table use TRUNCATE TABLE.

TRUNCATE [ TABLE ] [ schema_name '.' ] table_name
[ CONTINUE IDENTITY | RESTART IDENTITY ]
[ RESTRICT | CASCADE ]

You must have TRUNCATE privilege for the table. The creator/owner of the table will have TRUNCATE privilege automatically.

A 'CONTINUE IDENTITY' or 'RESTART IDENTITY' clause can be passed to restart or not an identity sequence if present in the table. Default is to CONTINUE IDENTITY sequence numbering.
The 'CASCADE' option instructs to truncate referencing table(s) also if the referencing table(s) have foreign key references to this table. The default behavior is 'RESTRICT'.

Note: it is possible to use TRUNCATE statements in a transaction and thus to roll back the effects of a truncate.

MonetDB/SQL does not support truncations of data on views.

Table expressions

Table expressions mk Sat, 03/27/2010 - 22:45

When a query is created a table can be referenced in different ways, sometimes by its name or by a select query or a join result. Here is the syntax to refer to a table.

table_reference:
      simple_table
    | joined_table [ [AS] alias [ '(' column_name [',' ...] ')' ] ]
    | '(' select_query ')' [ [AS] alias [ '(' column_name [',' ...] ')' ] ]
    | '(' VALUES '(' tuple ')' [, '(' tuple ')' [, '('...')' ] ] ')' [ [AS] alias [ '(' column_name [',' ...] ')' ] ]

joined_table:
      '(' joined_table ')'
    | table_reference CROSS JOIN table_reference
    | table_reference NATURAL [ join_type ] JOIN table_reference
    | table_reference [ join_type ] JOIN table_reference join_spec
    | table_reference UNIONJOIN table_reference join_spec

join_type:
      INNER
    | { LEFT | RIGHT | FULL } [ OUTER ]

join_spec:
      ON search_condition
    | USING '(' colunm_name [',' ...] ')'

When no join_type is specified, INNER is assumed.

The SQL framework for table expressions is based on the select-from-where construct.

select_query:
    [ with_clause ]
    SELECT [ ALL | DISTINCT ]   ( '*' | column_expr [',' ...] )
    [ INTO column_expr [',' ...] ]
    [ FROM table_reference [',' ... ] ]
    [ WINDOW window_definition_list ]
    [ WHERE search_condition ]
    [ GROUP BY column_expr [',' ...] ]
    [ HAVING search_condition ]
    [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] [ CORRESPONDING [ BY '(' column_ref_commalist ')' ] ] select_query ]
    [ ORDER BY ordering_spec [',' ...] ]
    [ LIMIT posint ]
    [ OFFSET posint ]
    [ SAMPLE posint [ SEED integer ] ]

Since Apr2019 release (11.33.3), expressions are allowed in the GROUP BY clause. The same expressions can be used in the projection clause, if and only if they are literally equal e.g.: SELECT count(*)*(col1+col2) as total FROM t1 GROUP BY col1 + col2.

The WITH clause prefix (aka Common Table Expressions (CTE)) provides the mechanism to introduce temporary in-line view definitions:

with_clause:
    WITH cte_element [',' cte_element [',' ...] ]

cte_element:
    query_alias [ '(' column_name [',' ...] ')' ] AS '(' select_query ')'

 

The pattern matching operations are used over predicates. It is possible to compare them, see the differences between them, if a predicate is a sub-predicate of another, etc. The following syntax description cover all the operations supported by MonetDB.

predicate:
       comparison_predicate
     | between_predicate
     | like_predicate
     | test_for_null
     | in_predicate
     | all_or_any_predicate
     | existence_test
     | '(' predicate ')'

comparison_predicate:
       pred_expr COMPARISON pred_expr
     | pred_expr = pred_expr

between_predicate:
     pred_expr [NOT] BETWEEN [SYMMETRIC | ASYMMETRIC] pred_expr AND pred_expr

like_predicate:
     pred_expr [NOT] { LIKE | ILIKE } atom_expr

test_for_null:
     column_name IS [NOT] NULL

in_predicate:
     pred_expr [NOT] IN '(' element_name ',' ... ')'

all_or_any_predicate :
     pred_expr COMPARISON { ANY | ALL | SOME } subquery

existence_test:
     [NOT] EXISTS subquery

pred_expr:
       scalar_expr
     | predicate

atom_expr:
       atom [ ESCAPE string ]
     | '(' atom ')' [ ESCAPE string ]
     | '(' atom_expr ')'
     | '?'

string_funcs:
      scalar_expr '||' scalar_expr
    | SUBSTRING '(' scalar_expr FROM scalar_expr [ FOR scalar_expr ] ')'
    | SUBSTRING '(' scalar_expr ',' scalar_expr [ ',' scalar_expr ] ')'
    | SUBSTRING '(' scalar_expr int_value scalar_expr ')'
   

 

Window functions

Window functions mk Sun, 10/13/2013 - 13:52

SQL provides a method to aggregate over a series of related tuples. They are called window functions and always come with an OVER() clause which determines how tuples are split up over the window functions.The PARTITION BY clause within OVER divides the rows into groups that share the same values of the PARTITION BY expression(s). For each row, the window function is computed over all rows participating in the group. The order within a partition can be used as well.

window_function_spec:
    window_function OVER   { window_name | '(' window_specification ')' }

window_function:
      RANK '(' ')'
    | DENSE_RANK '(' ')'
    | PERCENT_RANK '(' ')'
    | ROW_NUMBER '(' ')'
    | CUME_DIST '(' ')'
    | FIRST_VALUE '(' query_expression ')'
    | LAST_VALUE '(' query_expression ')'
    | NTH_VALUE '(' query_expression ',' query_expression ')'
    | NTILE '(' query_expression ')'
    | LEAD '(' query_expression [ ',' query_expression [ ',' query_expression ] ] ')'
    | LAG '(' query_expression [ ',' query_expression [ ',' query_expression ] ] ')'
    | aggregate_function

aggregate_function:
      COUNT '(' '*' ')'
    | COUNT '(' query_expression ')'
    | MAX '(' query_expression ')'
    | MIN '(' query_expression ')'
    | SUM '(' query_expression ')'
    | PROD '(' query_expression ')'
    | AVG '(' query_expression ')'

window_specification:
    [ window_name ]   [ PARTITION BY column_ref   [ ',' ... ] ]   [ ORDER BY sort_spec ]   [ window_frame ]

window_frame:
    { ROWS | RANGE | GROUPS }   { window_frame_start | BETWEEN window_bound AND window_bound }
        [ EXCLUDING { CURRENT ROW | GROUP | TIES | NO OTHERS }   ]

window_frame_start:
     UNBOUNDED PRECEDING
   | value PRECEDING
   | CURRENT ROW

window_bound:
    window_frame_start
   | UNBOUNDED FOLLOWING
   | value FOLLOWING

Supported Window Functions:
    RANK() : BIGINT - Returns the rank number within a partition, starting at 1.
    DENSE_RANK() : BIGINT - Returns the rank of the current row without gaps, it counts peer groups.
    PERCENT_RANK() : DOUBLE - Calculates the relative rank of the current row: (rank() - 1) / (rows in partition - 1).
    ROW_NUMBER() : BIGINT - Returns the position of the tuple currently in the result set, starting at 1.
    CUME_DIST() : DOUBLE - Calculates the cumulative distribution: number of rows preceding or peer with current row / rows in partition.
    FIRST_VALUE(input A) : A - Returns input value at first row of the window frame.
    LAST_VALUE(input A) : A - Returns input value at last row of the window frame.
    NTH_VALUE(input A, nth BIGINT) : A - Returns input value at “nth” row of the window frame. If there is no “nth” row in the window frame, then NULL is returned.
    NTILE(nbuckets BIGINT) : BIGINT - Enumerates rows from 1 in each partition, dividing it in the most equal way possible.
    LAG(input A [, offset BIGINT [, default_value A ] ]) : A - Returns input value at row “offset” before the current row in the partition. If the offset row does not exist, then the “default_value” is output. If omitted, “offset” defaults to 1 and “default_value” to NULL.
    LEAD(input A [, offset BIGINT [, default_value A ] ]) : A - Returns input value at row “offset” after the current row in the partition. If the offset row does not exist, then the “default_value” is output. If omitted, “offset” defaults to 1 and “default_value” to NULL.
    MIN(input A) : A
    MAX(input A) : A
    COUNT(*) : BIGINT
    COUNT(input A) : BIGINT
    SUM(input A) : A
    PROD(input A) : A
    AVG(input A) : DOUBLE.

The supported frames are:
    ROWS - Frames are calculated on physical offsets of input rows.
    RANGE - Result frames are calculated on value differences from input rows (used with a custom PRECEDING or FOLLOWING bound requires an ORDER BY clause).
    GROUPS - Groups of equal row values are used to calculate result frames (requires an ORDER BY clause).

See also en.wikibooks.org/wiki/Structured_Query_Language/Window_functions

 

Examples:

create table ranktest (id int, k varchar(3));
insert into ranktest values (1, 'a'),(2, 'b'),(3, 'c'),(4, 'd');
insert into ranktest values (1, 'a'),(2, 'b'),(3, 'c'),(4, 'd');

select ROW_NUMBER() over () as foo from ranktest;
select ROW_NUMBER() over (PARTITION BY id) as foo, id from ranktest;
select ROW_NUMBER() over (PARTITION BY id ORDER BY id) as foo, id from ranktest;
select ROW_NUMBER() over (ORDER BY id) as foo, id from ranktest;

select RANK() over () as foo from ranktest;
select RANK() over (PARTITION BY id) as foo, id from ranktest;
select RANK() over (PARTITION BY id ORDER BY id) as foo, id from ranktest;
select RANK() over (ORDER BY id) as foo, id from ranktest;

select RANK() over () as foo, id, k from ranktest;
select RANK() over (PARTITION BY id) as foo, id, k from ranktest;
select RANK() over (PARTITION BY id ORDER BY id, k) as foo, id, k from ranktest;
select RANK() over (ORDER BY id, k) as foo, id, k from ranktest;

select DENSE_RANK() over () as foo, id, k from ranktest order by k;
select DENSE_RANK() over (PARTITION BY id) as foo, id, k from ranktest order by k;
select DENSE_RANK() over (PARTITION BY id ORDER BY id, k) as foo, id, k from ranktest order by k;
select DENSE_RANK() over (ORDER BY id, k) as foo, id, k from ranktest order by k;
drop table ranktest;

For more examples see extended_sql_window_functions

Statistics

Statistics mk Tue, 11/19/2013 - 08:52

The SQL implementation provides the well-known standard SQL aggregate functions COUNT(*|...), COUNT(DISTINCT ...), SUM(...), AVG(...), MIN(...) and MAX(...) over scalar types/expressions and groupings. In addition, a few important statistical aggregate functions: MEDIAN, QUANTILE, STDDEV, VAR and correlation CORR are available.

extra_statistical_aggregate_functions:
      sys.median '(' scalar_expr ')'
    | sys.median_avg '(' scalar_expr ')'
    | sys.quantile '(' scalar_expr ',' scalar_expr ')'
    | sys.quantile_avg '(' scalar_expr ',' scalar_expr ')'
    | sys.stddev_samp '(' scalar_expr ')'
    | sys.stddev_pop '(' scalar_expr ')'
    | sys.var_samp '(' scalar_expr ')'
    | sys.var_pop '(' scalar_expr ')'
    | sys.corr '(' scalar_expr ',' scalar_expr ')'

For sys.quantile the percentile argument is a float value between 0.0 and 1.0. sys.median(<expr>) is equivalent to sys.quantile(<expr>, 0.5).
sys.stddev_samp computes the cumulative sample standard deviation and returns the square root of the sample variance. sys.stddev_pop computes the population standard deviation and returns the square root of the population variance. Both functions take as an argument any numeric datatype.
Likewise, sys.var_samp and sys.var_pop functions return the sample variance (/n-1) of a set of numbers and the biased variance (/n) of a set of numbers, respectively.

Note: The aggregate functions sys.median_avg and sys.quantile_avg are added in Nov2019 (11.35.3) release. They return the interpolated value if the median/quantile doesn't fall exactly on a particular row. These functions always return a value of type DOUBLE and only work for numeric types (various width integers, decimal and floating point).

Usage example:

   create schema aggr_tst;
   set schema aggr_tst;
   create table tc (c real);
   insert into tc values (1), (2), (3), (4), (5), (9);
   select * from tc;
   select count(*) countstar, COUNT(c) count, COUNT(DISTINCT c) countdistinct
        , SUM(c) sum, AVG(c) average, PROD(c) product, MIN(c) minimum, MAX(c) maximum
        , sys.MEDIAN(c) median
        , sys.MEDIAN_AVG(c) median_avg
        , sys.QUANTILE(c, 0.5) quantile
        , sys.QUANTILE_AVG(c, 0.5) quantile_avg
        , sys.STDDEV_SAMP(c) stddev_samp
        , sys.STDDEV_POP(c) stddev_pop
        , sys.VAR_SAMP(c) var_samp
        , sys.VAR_POP(c) var_pop
        , sys.CORR(c, c+1) corr
        , sys.group_concat(c) group_concat
        , sys.group_concat(c, '|') group_concat_mysep
    from tc;
   drop table tc;
   set schema sys;
   drop schema aggr_tst;

 

Tip: To view all the available aggregate functions in your MonetDB server use query:

SELECT * FROM sys.functions where type = 3;

 

Bulk input/output

Bulk input/output mk Sun, 03/28/2010 - 00:48

The COPY INTO command enables fast insertion of multiple tuples from an text file or standard input. Each tuple  in the input file is terminated by a record-separator (default '\n') and fields are separated by the field separator (default '|'). The field values should adhere to the  syntax for value literals. Alternative separators should adhere to the SQL lexical syntax for string values. A different NULL value representation can be passed using the NULL as null_string option. Furthermore, the fields are optionally enclosed with a user defined quote character. The text file should use UTF-8 encoding if specified as file_name, and the same encoding as mclient is using if read using FROM STDIN.

It is strongly advised to announce the maximum number of records to be inserted. It avoids guessing by the server and subsequent re-allocation of table space which may involve potentially expensive copying. A portion of the input file can be skipped using the offset feature.  The offset value specifies the record at which loading should commence, the first record having offset 1.

As of the Aug2018-SP2 release, if using FROM STDIN and the number of records is not specified, reading the input stops at an empty line. This means that if a one column table is being read, there may be confusion as to whether an empty line is an empty (string) value or a NULL (if NULL AS '' is specified). The end of input takes precedence here.

As of the Apr2019 release (11.33.3), the option ON CLIENT or ON SERVER can be used. This allows the client to read/write files from/to the client instead of doing it in the server. This has the advantage that COPY INTO is then no longer restricted to only the "super user" monetdb, nor only to absolute file names. The syntax to have the server communicate with the client for file content is COPY INTO table FROM file ON CLIENT ...; and COPY query INTO file ON CLIENT ...;. This also works for COPY BINARY INTO. There is also the possibility to specify that files are to be read/written by the server by using ON SERVER. This is also the default when ON CLIENT or ON SERVER is not specified. In that case the file must be accessible by the server. Therefore, it must reside on or be accessible to the database server machine and they must be identified with an absolute path name.

The STDIN file designator reads data streaming from the client application. An empty record determines the end of sequence.

COPY   {  [ int_val [ OFFSET int_val ] RECORDS ]  |  [ int_val RECORDS ] OFFSET int_val  }
      INTO table_name
      FROM ['('] file_name ',' ... [')']
      [ ON { CLIENT | SERVER } ]
      [ [ USING ] DELIMITERS field_separator [',' record_separator [',' string_quote ] ] ]
      [ NULL AS null_string ] [ LOCKED ] [ BEST EFFORT ] [ NO CONSTRAINT ]
      [ FWF '(' pos [ ',' pos [ ... ] ] ')' ]

COPY   {  [ int_val [ OFFSET int_val ] RECORDS ]  |  [ int_val RECORDS ] OFFSET int_val  }
      INTO table_name
      FROM STDIN
      [ [ USING ] DELIMITERS field_separator [',' record_separator [',' string_quote ] ] ]
      [ NULL AS null_string ] [ LOCKED ] [ BEST EFFORT ] [ NO CONSTRAINT ]

The input syntax should comply to the following grammar: [ [ [quote] [[escape]char] * [quote]] feldspar] * record separator. Quote characters in quoted fields may be escaped with a backslash. Field and record separators can be embedded in quoted fields.

LOCKED mode

In many bulk loading situations, the original file can be saved as a backup or recreated for disaster handling. This reliefs the database system from having to prepare for recovery as well and to safe significant storage space. The LOCKED qualifier can be used in this situation (and in single user mode!) to skip the logging operation normally performed.

WARNING It is advised to add integrity constraints to the table after the file has been loaded. The ALTER statements perform bulk integrity checking and perform these checks often more efficiently.

For more see the CSV bulk load recipe.

Copy into File

The COPY INTO command with a file name argument allows for fast dumping of a result set into an ASCII file. The file must be accessible by the server and a full path name may be required. The file STDOUT can be used to direct the result to the primary output channel.

The delimiters and NULL AS arguments provide control over the layout required.  

COPY data_query
      INTO file_name
      [ ON { CLIENT | SERVER } ]
      [ [ USING ] DELIMITERS field_separator [',' record_separator [',' string_quote ] ] ]
      [ NULL AS null_string ]

COPY data_query
      INTO STDOUT
      [ [ USING ] DELIMITERS field_separator [',' record_separator [',' string_quote ] ] ]
      [ NULL AS null_string ]

For both the input and output versions of the COPY INTO commands one can specify a file name ending with '.gz' or '.bz2' or '.xz' or '.lz4' to use the appropriate compression library (if available).

For more see the Exporting bulk data recipe.

Copying binary files

Migration of tables between MonetDB/SQL instances can be speed up using the COPY BINARY INTO format.

For more see the Binary bulk data load recipe.

Aggregate into a single string

Aggregate into a single string pedro Fri, 10/05/2018 - 11:41

In the Aug2018 release of MonetDB, we added a new aggregation function "sys.group_concat", which aggregates an input column into a single string as output. We provide two versions of this aggregate: sys.group_concat(string) and sys.group_concat(string, string). In both versions, the first parameter corresponds to the input string column to be concatenated. In the former the default delimiter is the ',' character, in the latter the second parameter indicates the separator to be used. If either a group in has a NULL value, or the delimiter is NULL, the output will be NULL.

sql>create table demo (a int, b clob);
operation successful
sql>insert into demo values (1, 'chair'), (1, 'desk'), (2, 'room'), (1, 'decoration'), (2, 'window'), (2, 'sofa');
6 affected rows
sql>select '[' || sys.group_concat(a) || ']' from demo;
+---------------+
| L3            |
+===============+
| [1,1,2,1,2,2] |
+---------------+
1 tuple
sql>select a, sys.group_concat(b) from demo group by a;
+------+-----------------------+
| a    | L4                    |
+======+=======================+
| 1    | chair,desk,decoration |
| 2    | room,window,sofa      |
+------+-----------------------+
2 tuples
sql>select a, sys.group_concat(b, '|') from demo group by a;
+------+-----------------------+
| a    | L5                    |
+======+=======================+
|    1 | chair|desk|decoration |
|    2 | room|window|sofa      |
+------+-----------------------+
2 tuples
sql>insert into demo values (3, 'car'), (3, NULL);
2 affected rows
sql>select '[' || sys.group_concat(b, '-') || ']' from demo group by a;
+-------------------------+
| L6                      |
+=========================+
| [chair-desk-decoration] |
| [room-window-sofa]      |
| null                    |
+-------------------------+
3 tuples