Data types

Data types mk Sat, 03/27/2010 - 22:30

SQL prescribes a large collection of built-in types, most of which are directly supported. The MonetDB database kernel can also be extended with user defined types to better support application domains.

Some domain specific data types (e.g. GeoSpatial) are not part of the MonetDB core, but are provided by extensions. For more information check the Spatial data types article.

Builtin SQL types

Builtin SQL types mk Thu, 04/01/2010 - 22:20
CHAR [ '(' length ')' ]  | CHARACTER [ '(' length ')' ] UTF-8 character string with length upperbound limit. CHAR or CHARACTER without the "( length )" specification are treated as CHAR(1).
Note: currently no spaces are padded at the end
   
VARCHAR '(' length ')' | CHARACTER VARYING '(' length ')' UTF-8 character string with length upperbound limit
   
TEXT | CLOB | CHARACTER LARGE OBJECT | STRING UTF-8 character string with unbounded length
   
BLOB | BINARY LARGE OBJECT bytes string with unbounded length
   
BOOLEAN | BOOL logic values: true or false
   
TINYINT 8 bit signed integer between -127 and 127
SMALLINT 16 bit signed integer between -32767 and 32767
INT | INTEGER 32 bit signed integer between -2147483647 and 2147483647
BIGINT 64 bit signed integer between -9223372036854775807 and 9223372036854775807
   
HUGEINT 128 bit signed integer between -2127 +1 and +2127 -1 (±170141183460469231731687303715884105727)
Note: HUGEINT is only available on platforms with a C-compiler that supports the __int128 or __int128_t data type (e.g., recent gcc, clang, & icc on Linux or MacOS X) and from Jul2015 release onwards
   
DECIMAL '(' Prec ',' Scale ')' |
DEC '(' Prec ',' Scale ')' |
NUMERIC '(' Prec ',' Scale ')'
Exact decimal number with precision Prec and scale Scale. Prec must be between 1 and 18 (or 38 when HUGEINT is also supported). Scale must be between 0 and Prec
   
DECIMAL '(' Prec ')' |
DEC '(' Prec ')' |
NUMERIC '(' Prec ')'
Exact decimal number with precision Prec and scale 0. Prec must be between 1 and 18 (or 38 when HUGEINT is also supported).
   
DECIMAL | DEC | NUMERIC Exact decimal number with precision 18 and scale 3.
   
REAL 32 bit floating point approximate number
FLOAT | DOUBLE | DOUBLE PRECISION 64 bit floating point approximate number
FLOAT '(' Prec ')' floating point approximate number with binary precision Prec. Prec must be between 1 and 53. FLOAT(24) is same as REAL, FLOAT(53) is same as DOUBLE

 

All scalar types include a NULL value, which is internally represented as a valid domain value. For numerical types, this is the smallest value in the type's domain (i.e., the one omitted in the ranges given above). Arithmetic expressions that overflow may lead to returning the NULL instead.

The integer types align with the storage of 1, 2, 4, 8 and 16 bytes.

The numeric and decimal types are represented as fixed length integers, whose decimal point is produced during result rendering.

The types REAL, FLOAT and DOUBLE map to the underlying implementation system. No special attention is given to the value NaN.

Temporal types

Temporal types mk Thu, 06/30/2011 - 09:46

MonetDB/SQL supports a rich set of time-related operations within the kernel. The starting point are SQL 92 and the ODBC time-related data types.  There are six basic types and operations on them:

DATE a date in the Gregorian calendar YYYY-MM-DD, e.g. 1999-12-31
TIME  [ '(' posint ')' ] time of day (HH.MI.SS.sss) with precision (default 0)
TIME  [ '(' posint ')' ] WITH TIME ZONE time of day (HH.MI.SS.sss) with precision (default 0) and time zone information
TIMESTAMP [ '(' posint ')' ] date concatenated with unique time, precision
TIMESTAMP [ '(' posint ')' ] WITH TIME ZONE date concatenated with unique time, precision and time zone information
INTERVAL interval_qualifier a temporal interval such as: INTERVAL SECOND or INTERVAL MONTH

interval_qualifier :
    { YEAR | MONTH | DAY | HOUR | MINUTE | SECOND [time_precision] }
   |  start_field  TO  end_field

start_field :
   
{ YEAR | MONTH | DAY | HOUR | MINUTE }

end_field:
   
{ YEAR | MONTH | DAY | HOUR | MINUTE | SECOND [time_precision] }

time_precision :
     '(' posint ')'

A timestamp is a combination of date and time, indicating an exact point in time (GMT). GMT is the time at the Greenwich meridian without a daylight savings time (DST) regime. Absence of DST means that hours are consecutive (no jumps) which makes it easy to perform time difference calculations.

The local time is often different from GMT (even at Greenwich in summer, as the UK also has DST). Therefore, whenever a timestamp is composed from a local date and time a timezone should be specified in order to translate the local time to GMT (and vice versa if a timestamp is to be decomposed in a local date and time). To adjust the local time can issue a command such as SET TIME ZONE INTERVAL '1' HOUR TO MINUTE.

We provide predefined timezone objects for a number of timezones (see below). Also, there is one timezone called the local timezone, which can be set to one global value in a running MonetDB server, that is used if the timezone parameter is omitted from a command that needs it (if not set, the default value of the local timezone is plain GMT).

The value ranges and lexical denotations are defined as follows:

Min and max years. The maximum and minimum dates and timestamps that can be stored are in the years 5,867,411 and -5,867,411, respectively. Interestingly, the year 0 is not a valid year. The year before 1 is called -1.

Dates. Fall in a valid year, and have a month and day that is valid in that year. The first day in the year is January 1, the last December 31. Months with 31 days are January, March, May, July, August, October, and December, while April, June, September and November have 30 days. February has 28 days, expect in a leap year, when it has 29. A leap year is a year that is an exact multiple of 4. Years that are a multiple of 100 but not of 400 are an exception; they are no leap years.

Time. The smallest time is 00:00:00.000 and the largest 23:59:59.999 (the hours in a time range between [0,23], minutes and seconds between [0,59] and milliseconds between [0:999] ). Time identifies a valid time-of-day, not an amount of time (for denoting amounts of time, or time differences, we use here concepts like "number of days" or "number of seconds" denoted by some value of a standard integer type).

Timestamp. A valid timestamp is formed by a combination of a valid date and valid daytime. When creating a timestamp from a date and daytime, a timezone should be specified (if timezone is omitted, the local timezone is assumed). If a timezone is specified, it is used to convert the date and time in that timezone to GMT, which is the internal timestamp representation. One problem here is that the first hour after DST has ended (some Sunday night in autumn, generally), the time is set back one hour, so the same times occur twice. Hence two translations to a timestamp are possible for such date and time combinations. In those case, we act as if it was the first occurrence (still within DST).

For difference calculations between dates (in numbers of days) we use signed integer. Hence, the valid range for difference calculations is between -2147483647 and 2147483647 days (which corresponds to roughly -5,867,411 and 5,867,411 years).

For difference between timestamps (in numbers of milliseconds) we use 64-bit bigint. These are large integers of maximally 19 digits, which therefore impose a limit of about 106,000,000,000 years on the maximum time difference used in computations.

datetime_func:
     EXTRACT '(' datetime_field FROM scalar_exp ')'
   | CURRENT_DATE [ '(' ')' ]
   | CURRENT_TIME [ '(' ')' ]
   | CURRENT_TIMESTAMP [ '(' ')' ]
   | LOCALTIME [ '(' ')' ]
   | LOCALTIMESTAMP [ '(' ')' ]

datetime_field: YEAR | MONTH | DAY | HOUR | MINUTE | SECOND

Gregorian dates.
The basics of the Gregorian calendar stem from the time of Julius Caesar, when the concept of a solar year as consisting of 365.25 days (365 days plus once in 4 years one extra day) was introduced. However, this Julian Calendar, made a year 11 minutes long, which subsequently accumulated over the ages, causing a shift in seasons. In medieval times this was noticed, and in 1582 Pope Gregory XIII issued a decree, skipped 11 days. This measure was not adopted in the whole of Europe immediately, however.  For this reason, there were many regions in Europe that upheld different dates.

It was only on September 14, 1752 that some consensus was reached and more countries joined the Gregorian Calendar, which also was last modified at that time. The modifications were twofold: first, 12 more days were skipped. Second, it was determined that the year starts on January 1 (in England, for instance, it had been starting on March 25). Other parts of the world have adopted the Gregorian Calendar even later.

MonetDB implements the Gregorian Calendar in all its regularity. This means that values before the year 1752 probably do not correspond with the dates that people really used in times before that (what they did use, however, was very vague anyway, as explained above). In solar terms, however, this calendar is reasonably accurate (see the "correction seconds" note below).

Timezones
The basic timezone regime was established on November 1, 1884 in the International Meridian Conference held in Greenwich (UK). Before that, a different time held in almost any city. The conference established 24 different time zones defined by regular longitude intervals that all differed by one hour.  Not for long it was that national and political interest started to erode this nicely regular system.  Timezones now often follow country borders, and some regions (like the Guinea areas in Latin America) have times that differ with a 15 minute grain from GMT rather than an hour or even half-an-hour grain.

An extra complication became the introduction of daylight saving time (DST), which causes a time jump in spring, when the clock is skips one hour and in autumn, when the
clock is set back one hour (so in a one hour span, the same times occur twice). The DST regime is a purely political decision made on a country-by-country basis. Countries in the same timezone can have different DST regimes. Even worse, some countries have DST in some years, and not in other years.

To avoid confusion, this temporal type module stores absolute points of time in GMT only (GMT does not have a DST regime). When storing local times in the database, or retrieving local times from absolute timestamps, a correct timezone object should be used for the conversion.

Applications that do not make correct use of timezones, will produce irregular results on e.g. time difference calculations.

Correction seconds
Once every such hundred years, a correction second is added on new year's night. This rule would seriously complicate the temporal type module (as then the duration of a day, which is now the fixed number of 24*60*60*1000 milliseconds, becomes parametrized by the date), it is not implemented. Hence these seconds are lost, so time difference calculations in milliseconds (rather than in days) have a small error if the time difference spans many hundreds of years.

We cannot handle well changes in the timezone rules (e.g., DST only exists since 40 years, and some countries make frequent changes to the DST policy). To accommodate this we should make timezone_local a function with a year parameter. The tool should maintain and access a timezone database. Lookup of the correct timezone would be dynamic in this structure. The timezone_setlocal would just set the string name of the timezone.

Serial datatypes

Serial datatypes mk Thu, 04/01/2010 - 22:21

As of 2003 the SQL standard supports serial types (sequences). They are of particular use in auto-generating key values.A serial type is defined as a primary database object over any of the built-in data types. The NEXT VALUE operation generates the next value and can be used anywhere a value expression is allowed. Its name should be unique within the current schema. It can only be dropped when the tables mentioning it have previously been dropped.

generated_column:
    GENERATED ALWAYS AS IDENTITY [ '(' seq_params ')' ]
  | AUTO_INCREMENT

sequence_commands:
    CREATE SEQUENCE [schema_name '.'] seq_name [ seq_params ]
  | DROP SEQUENCE [schema_name '.'] seq_name
  | ALTER SEQUENCE [schema_name '.'] seq_name [alter_seq_params]

seq_params:
    [ AS datatype ]
    [ START WITH nonzero-bigint ]
    [ INCREMENT BY nonzerol-bigint ]
    [ MINVALUE nonzero-bigint | NO MINVALUE ]
    [ MAXVALUE nonzero-bigint | NO MAXVALUE ]
    [ CACHE nonzero-bigint ]
    [ [ NO ] CYCLE ]

alter_seq_params:
    [ AS datatype ]
    [ RESTART [WITH nonzero-bigint ] ]
    [ INCREMENT BY nonzerol-bigint ]
    [ MINVALUE nonzero-bigint | NO MINVALUE ]
    [ MAXVALUE nonzero-bigint | NO MAXVALUE ]
    [ CACHE nonzero-bigint ]
    [ [ NO ] CYCLE ]

Example. The example shown below introduces the column count, which is incremented with each row being added. It is conceptually identical to the value expression max(count)+1in each insert. The column info is a limited range with wrap around.The serial type as found in PostgreSQL and the auto_increment  flag as found in MySQL are both mapped onto a sequence type in MonetDB/SQL.

CREATE TABLE test_serial (
   d DATE,
   id SERIAL,  -- this will implicitly create a PKey. Use BIGSERIAL if you want the id to be of type bigint instead of int.
   count INT AUTO_INCREMENT,
   info INT GENERATED ALWAYS AS

        IDENTITY (
           START WITH 100 INCREMENT BY 2
           NO MINVALUE MAXVALUE 1000
           CACHE 2 CYCLE
)      );

Much like other primary database objects, the sequence type can be altered at any time as illustrated below.

sql>CREATE SEQUENCE "my_test_seq" as integer START WITH 2;
sql>CREATE TABLE test (t int DEFAULT NEXT VALUE FOR "my_test_seq", v char);
sql>INSERT INTO test(v) VALUES ('a');
Rows affected 1
sql>INSERT INTO test VALUES (10, 'b');
Rows affected 1
sql>ALTER SEQUENCE "my_test_seq" RESTART WITH (SELECT MAX(t) + 1 FROM test);
sql>INSERT INTO test(v) VALUES ('c');
Rows affected 1
sql>SELECT * FROM test;
+----+---+
| t  | v |
+====+===+
| 2  | a |
| 10 | b |
| 11 | c |
+----+---+

The functions sys.get_value_for('<schema name>', '<sequence name>') and sys.next_value_for('<schema name>', '<sequence name>') can be used to query the current value of a sequence. The difference is that next_value_for() also advances the current value of a sequence to the next value.  The SQL statement SELECT NEXT VALUE FOR <schemal name>.<sequence name> is an equivalent of sys.next_value_for('<schema name>', '<sequence name>'). The following queries demonstrate how these functions and statement work:

sql>CREATE SEQUENCE "myseq" AS INTEGER;
operation successful
sql>SELECT get_value_for('sys', 'myseq');
+------+
| L2   |
+======+
|    1 |
+------+
1 tuple
sql>SELECT next_value_for('sys', 'myseq');
+------+
| L2   |
+======+
|    1 |
+------+
1 tuple
sql>SELECT NEXT VALUE FOR myseq;
+------+
| L2   |
+======+
|    2 |
+------+
1 tuple

JSON datatype

JSON datatype mk Thu, 01/23/2014 - 08:44

JSON has become the de facto light weight data interchange format for many web applications. It has a simple hierarchical structure and supports a limited set of value types. JSON is a natural representation of data for the C family of programming languages.

JSON is supported in MonetDB as a subtype over type VARCHAR, which ensures that only valid JSON strings are added to the database.

Example: CREATE TABLE json_example (c1 JSON, c2 JSON(512) NOT NULL);

MonetDB supports most of the JSON path expressions defined in [ref]. It can be used to decompose the values into regular tables, which then act as an index. A limited set of operators are predefined in the SQL catalogue.

json.filter(J, Pathexpr) Extracts the component from J that satisfied the Pathexpr
json.filter(J, Number) Extracts a indexed component from J
json.text(J, [Sep]) Glue together the values separated by Sep character (default space)
json.number(J) Turn a number, singleton array value, or singleton object tag into a double
json."integer"(J) Turn a number, singleton array value, or singleton object element into an integer
json.isvalid(StringExpr) Checks the string for JSON compliance. Returns boolean.
json.isobject(StringExpr) Checks the string for JSON object compliance. Returns boolean.
json.isarray(StringExpr) Checks the string for JSON array compliance. Returns boolean.
json.length(J) Returns the number of top-level components of J.
json.keyarray(J) Returns a list of key tags for the top-level components of J.
json.valuearray(J) Returns a list of values for the top-level components of J.

JSON path expressions always refer to a single JSON structure. The root of this structure is identified by the identifier '$', which is implicitly assumed in most expressions. Components of the JSON structure are addressed through the dot notation, i.e. representing child steps and array element access. The wild card can be used for child names and undetermined array indices.

JSON path Description Example
"$" The root object json.filter(v, '$') = { "store":...}"
"." childname The child step operator json.filter(v, '$.store.bicycle') = {"color": "red", "price": 19.95}
".." childname Recursive child step json.filter(v, '$..price') = [8.95,12.99,8.99,22.99,19.95]
"*" Child name wildcard json.filter(v, '$.store.bicycle.*') = {"color": "red", "price": 19.95}
"[" nr "]" Array element access json.filter(v, '$.store.book.[1]') = the second book
"[" * "]" Any array element access  
E1 "," E2 Union path expressions json.filter(v, '$.store.bicycle,$..price')

An example JSON object used for these expressions [ref], more examples in the testsuite.

{ "store": {
    "book": [
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

URL datatype

URL datatype mk Wed, 02/26/2014 - 15:14

The related URL data type is a subdomain over strings and represent valid Uniform Resource Locators.

Example: CREATE TABLE URL_example (c1 URL, c2 URL(512) NOT NULL);

A collection of functions are provided to create, validate and extract portions.

  Description
  sys.getAnchor(theUrl url) returns string
  sys.getBasename(theUrl url) returns string
  sys.getContent(theUrl url) returns string
  sys.getContext(theUrl url) returns string
  sys.getDomain(theUrl url) returns string
  sys.getExtension(theUrl url) returns string
  sys.getFile(theUrl url) returns string
  sys.getHost(theUrl url) returns string
  sys.getPort(theUrl url) returns string
  sys.getProtocol(theUrl url) returns string
  sys.getQuery(theUrl url) returns string
  sys.getUser(theUrl url) returns string
  sys.getRobotURL(theUrl url) returns string
  sys.isaURL(theUrl string) returns boolean
  sys.newurl(protocol string, hostname string, "port" int, file string) returns url
  sys.newurl(protocol string, hostname string, file string) returns url

UUID datatype

UUID datatype mk Sat, 01/14/2012 - 14:40

The data type  UUID  stores a Universally Unique Identifiers (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards. It can be used as a global unique 128-bit identifier. UUIDs are generated by an algorithm that ensures an extreme low probability that two calls to the same create function will ever produce the same value. They are often used in distributed (web-based) systems for this reason.

A UUID is written as a sequence of lower-case hexadecimal digits, in several groups separated by hyphens, specifically a group of 8 digits followed by three groups of 4 digits followed by a group of 12 digits, for a total of 32 digits representing the 128 bits. An example of a UUID in this standard form is:

select uuid();
+--------------------------------------+ 
| uuid                                 | 
+======================================+
| 65950c76-a2f6-4543-660a-b849cf5f2453 |
+--------------------------------------+

The system function sys.uuid() generates a new uuid and returns the uuid.

The system function sys.isauuid(string) checks whether the string value satisfies the grammatical UUID structure and returns a boolean value.

You can use cast() or convert() to convert a valid uuid string to a uuid type.

select cast('26d7a80b-7538-4682-a49a-9d0f9676b765' as uuid);
select convert('83886744-d558-4e41-a361-a40b2765455b', uuid);

Network Address Type

Network Address Type mk Sat, 01/14/2012 - 14:49

The SQL type inet describes IPv4 network addresses. The inet module contains a collection of functions that operate on IPv4 addresses.  The most relevant functions are the `containment' functions that deal with subnet masks.

  Description
  sys.broadcast(p inet) returns inet
  sys.host(p inet) returns clob
  sys.masklen(p inet) returns int
  sys.setmasklen(p inet, mask int) returns inet
  sys.netmask(p inet) returns inet
  sys.hostmask(p inet) returns inet
  sys.network(p inet) returns inet
  sys.text(p inet) returns clob
  sys.abbrev(p inet) returns clob
  sys."left_shift"(i1 inet, i2 inet) returns boolean
  sys."right_shift"(i1 inet, i2 inet) returns boolean
  sys."left_shift_assign"(i1 inet, i2 inet) returns boolean
  sys."right_shift_assign"(i1 inet, i2 inet) returns boolean

The functionality of this module is greatly inspired by the PostgreSQL inet atom. It should be extended to also support IPv6.

User defined types

User defined types mk Thu, 04/01/2010 - 22:22

The MonetDB kernel supports creation of user defined types, e.g. geometric types. The relationship between SQL and MAL world is expressed using an external name.

CREATE TYPE [schema_name '.'] type_name EXTERNAL NAME type_identification

DROP TYPE [schema_name '.'] type_name [RESTRICT | CASCADE]

The implementation of new atomary types is best postponed until there is no other performance-wise acceptable solution. Addition of an atomary type in the kernel would be beneficial if it is also complemented with bulk-operations and when type specific optimizers could exploit their semantics.