Python User Defined Functions

From MonetDB
Jump to: navigation, search

MonetDB supports user-defined functions in the Python language. These user-defined functions require NumPy version >= 1.7 and Python to be installed.

Creating Python Functions[edit]

Python functions can be created through the CREATE FUNCTION statement. There are three different types of Python UDFs that can be created.


User-Defined Scalar Functions[edit]

User-defined scalar functions are functions that can operate on any number of columns, but return only a single scalar column. The syntax to create a user-defined scalar function is as follows.

Syntax[edit]

CREATE FUNCTION functionname([paramlist] | *)
RETURNS <return_type>
LANGUAGE (PYTHON || PYTHON_MAP)
{ function_code } || 'external_file.py'
- [paramlist]: A list of the input parameter names and types. An example of a valid parameter list is (i INTEGER, j INTEGER). The asterisk symbol can also be used instead of a parameter list, which indicates that any set of input parameters is valid.
- <return_type>: The return type of the scalar function. This must be a single SQL type. A valid return type is INTEGER.
- (PYTHON || PYTHON_MAP): The language of the function specifies if parallel execution is allowed. For more information, see the Parallel Execution section.
- { function_code } || 'external_file.py': You must either specify the source code of the function directly when creating the user-defined function, or specify a filename. If a filename is specified, the source code for the function will be loaded from the specified file.

Usage[edit]

User-defined scalar functions can be used anywhere regular scalar functions can be used. They can be used in the SELECT and WHERE clauses, and can also be used in JOIN predicates.

Note that there are no cardinality restrictions placed on user-defined scalar functions, meaning they can return any number of values regardless of the amount of input values. This means you can also compute aggregates using user-defined scalar functions (albeit not when a GROUP BY statement is involved).

Examples[edit]

Below is an example of a simple Python UDF that multiples two columns together and how it can be used in various queries.

CREATE FUNCTION python_multiplication(i INTEGER, j INTEGER) 
RETURNS INTEGER 
LANGUAGE PYTHON 
{
    return i * j
};
/* Regular usage */
SELECT python_multiplication(col1, col2) FROM table;
/* Nested usage */
SELECT python_multiplication(python_multiplication(col1, col2), col3) FROM table;
/* Use in Where Clause */
SELECT * FROM table WHERE python_multiplication(col1, col2) > 100;
/* Use in Join predicate */
SELECT * FROM table INNER JOIN table2 ON python_multiplication(table.col1, table2.col1) = 100;


User-Defined Table Functions[edit]

User-defined table functions allow you to generate tables using Python functions. Unlike scalar functions, they can return any number of output columns. The columns can either be returned in order, or in a {name: value} dictionary. Once created, the user-defined table functions can be used anywhere a table can be used.

Syntax[edit]

CREATE FUNCTION functionname([paramlist] | *)
RETURNS TABLE([paramlist])
LANGUAGE (PYTHON || PYTHON_MAP)
{ function_code } || 'external_file.py'

Examples[edit]

Below is an example of a table that takes a constant value as input and returns a table with two columns. Note that user-defined table functions can also take columns as input, they are not limited to only taking constant values as input.

CREATE FUNCTION python_random(count INTEGER) 
RETURNS TABLE(id INTEGER, value INTEGER) 
LANGUAGE PYTHON 
{
    return { 'id': numpy.arange(count), 'value': numpy.random.randint(0, 100, count) }
};
/* Use as normal table */
SELECT * FROM python_random(100);
/* Use value from other table as input to this function */
SELECT * FROM python_random( ( SELECT COUNT(*) FROM table ) );

User-Defined Aggregate Functions[edit]

User-defined aggregate functions can be used to compute aggregates on data. They can be used in the SELECT statement of a query, and can also be used to compute aggregates over different groups (when the query contains a GROUP BY statement).

Syntax[edit]

CREATE AGGREGATE functionname([paramlist] | *)
RETURNS <return_type>
LANGUAGE (PYTHON || PYTHON_MAP)
{ function_code } || 'external_file.py'

Regular Aggregations[edit]

When the language PYTHON is used, the aggregation function is called only once with all the columns as input. The groups for each tuple are passed using the hidden aggr_group parameter. This parameter is a numpy.int32 array that contains a label for each tuple that indicates to which group it belongs. It is the responsibility of the function to perform the aggregation over each of the groups, and return one aggregate value for each group.

Example[edit]

Below is an example of a regular aggregation using the language PYTHON.

CREATE AGGREGATE python_sum(val INTEGER) 
RETURNS INTEGER 
LANGUAGE PYTHON 
{
   unique = numpy.unique(aggr_group)
   x = numpy.zeros(shape=(unique.size))
   for i in range(0, unique.size):
       x[i] = numpy.sum(val[aggr_group==unique[i]])
   return(x)
};
/* Use in group-by aggregation */
SELECT col2, python_sum(col1) FROM table GROUP BY col2;

Parallel Aggregations[edit]

Regular aggregations are cumbersome to write because they require you to manually perform the grouping with the aggr_group parameter. When the language PYTHON_MAP is used, the aggregation function is instead called once per group, with only the tuples that belong to that group as input. The aggregation is computed in parallel over the groups.

As the function is called once per group, the per-group overhead is larger than with regular aggregations. As such, parallel aggregations should be used when the amount of groups is low. When the amount of groups is very large, regular aggregations are preferable.

Example[edit]

Below is an example of a parallel aggregation using the language PYTHON_MAP.

CREATE AGGREGATE python_sum_map(val INTEGER) 
RETURNS INTEGER 
LANGUAGE PYTHON_MAP
{
   return numpy.sum(val)
};
/* Use in group-by aggregation */
SELECT col2, python_sum_map(col1) FROM table GROUP BY col2;

Advanced Usage[edit]

Parallel Execution[edit]

MonetDB supports parallel execution through the Mergetable and Mitosis optimizers. These optimizers work by partitioning the input columns into separate chunks, then executing non-blocking functions once for each chunk. Finally, it merges the resulting data together again when access to the entire column is necessary. This happens when a blocking function is called or when the query results have to be gathered. Examples of non-blocking (parallelizable) functions are multiplication and SQRT, an example of a blocking function is MEDIAN.

By default, MonetDB/Python functions are blocking, meaning the input columns will be merged prior to calling the MonetDB/Python function. However, MonetDB/Python functions can be executed in parallel if they use the PYTHON_MAP language. In this case, the function will be called multiple times, each time with only a part of the input columns as input.

CREATE FUNCTION parallel_multiplication(i INTEGER, j INTEGER) 
RETURNS INTEGER 
LANGUAGE PYTHON_MAP
{
    return i * j
};
/* Multiply two columns in parallel */
SELECT parallel_multiplication(col1,col2) FROM table;

Note that if you use a function that is not mappable and execute it in parallel, you may see unexpected results. This is because the non-mappable function is not executed once on the entire columns, but once for every chunk created. The amount of chunks created is usually equal to the amount of cores on your machine.

Take for example the following non-mappable operation, which calls numpy.min.

CREATE FUNCTION parallel_min(i INTEGER) 
RETURNS INTEGER 
LANGUAGE PYTHON_MAP
{
    return numpy.min(i)
};
/* This will result in one 'minimal' value per chunk, rather than one value in total  */
SELECT parallel_min(col1) FROM table;

We can use this to partially execute a function in parallel by first executing a part of the function in a non-blocking function (map), and then gathering the results computed in parallel and uniting them with a blocking operation (reduce).

To continue with the previous example, after computing the minimal value of each chunk, we can compute the minimal value of these values to get the actual minimal value in the table. Now most of the work has been done in parallel, and we only need to compute the minimal of a small amount of values sequentially, leading to a significant speed-up.

SELECT MIN(parallel_min(col1)) FROM table;

Iterating Over Columns/Column Types[edit]

MonetDB/Python functions are passed two hidden parameters that allow you to iterate over the columns. The _columns type, which is a dictionary that contains a [name, value] map for each of the columns, and the _column_types type, which contains the SQL types of each of the input columns.

CREATE FUNCTION python_print(i INTEGER, j STRING) RETURNS INTEGER LANGUAGE PYTHON {
    for name,val in _columns.iteritems():
        print(name, _column_types[name])
    return(1)
};
/* Prints i INTEGER, j STRING */
SELECT python_print(0, );

Querying the Database[edit]

MonetDB/Python supports loopback queries, meaning that you can query the database from within the Python UDF. The results of the query are converted to NumPy arrays using the same conversion mechanism as the input columns. Loopback queries can be used by using the hidden _conn parameter that is passed to every MonetDB/Python UDF, and can be used in both mappable and non-mappable functions.


CREATE FUNCTION python_loopback() 
RETURNS TABLE(i INTEGER) 
LANGUAGE PYTHON
{
    res = _conn.execute('SELECT i FROM table;')
    return res['i']
};
/* Identical result to SELECT i FROM table; */ 
SELECT * FROM python_loopback();