Hi Stefan,
Thanks for your reply!
We have run the query a few times with different size of data. There we
used 16G RAM(actually 13.5G was used), and find the size of 10G's data
is the critical point that can run the query. All of the data files'
size are listed below, each file name is a table name(there are only a
few tables are refered -- store_sales, date_dim, item, customer,
catalog_sales, web_sales):
7.4K call_center.dat
1.6M catalog_page.dat
212M catalog_returns.dat
2.9G catalog_sales.dat
27M customer_address.dat
64M customer.dat
77M customer_demographics.dat
9.9M date_dim.dat
77B dbgen_version.dat
149K household_demographics.dat
328B income_band.dat
2.6G inventory.dat
28M item.dat
61K promotion.dat
1.7K reason.dat
1.1K ship_mode.dat
27K store.dat
323M store_returns.dat
3.8G store_sales.dat
4.9M time_dim.dat
1.2K warehouse.dat
19K web_page.dat
98M web_returns.dat
1.5G web_sales.dat
12K web_site.dat
So we guess that the monetdb has no memory management?
For the output of `mserver5 --version` is:
MonetDB 5 server v11.27.13 "Jul2017-SP4" (64-bit, 128-bit integers)
Copyright (c) 1993 - July 2008 CWI
Copyright (c) August 2008 - 2018 MonetDB B.V., all rights reserved
Visit https://www.monetdb.org/ for further information
Found 17.0GiB available memory, 40 available cpu cores
Libraries:
libpcre: 8.38 2015-11-23 (compiled with 8.38)
openssl: OpenSSL 1.0.2g 1 Mar 2016 (compiled with OpenSSL 1.0.2g 1
Mar 2016)
libxml2: 2.9.3 (compiled with 2.9.3)
Compiled by: monetdb(a)MonetDB-0.0 (x86_64-pc-linux-gnu)
Compilation: gcc -g -O2
Linking : /usr/bin/ld -m elf_x86_64
And the size of processes is not limited.
To let you reproduce the problem conveniently, I'll provide more details
here:
you can get tpc-ds from its website(we use version 2.6.0).
Install the tpc-ds, access the directory v2.6.0/tools and run `./dsdgen
-scale 10 -dir /home/monetdb/tpc-ds_test_data10G` to generate the data.
When data has been generated, using the script /expe.sh/ to create
tables and load the data. The query script is 123.tpcds.23.sql.(The
syntaxs of other queries that tpc-ds generates is not suitable for
monetdb all, we don't modify them all when the problem occurred).
One more question, I can't get your reply email, so I don't know how to
reply you, for this case, I could only send a new mail echo time.
Thanks!
Regards,
Rancho
Hi all,
8 yeas after this interesting thread about using ulimit to limit the data
segment size (
https://www.monetdb.org/pipermail/users-list/2010-March/004007.html), a few
things have changed and I find myself a bit lost again about memory
management.
At some point (can't remember when) mserver5 stopped respecting ulimit. No
matter how I change the limits, it always sees all available physical ram.
Does any other mechanism exist?
The issue I'm facing all the time, and for which I find very little
documentation, is:
- multiple dbfarms, possibly tens of mserver5 instances, running on the
same (virtual) machine, plus other applications
- Not necessarily, but most times each dmfarm is served by a Docker
container
- CPU and memory resources need to be share fairly (ideally, controlling
priorities)
- No mserver5 should ever be killed because of memory issues
Now, especially when using Docker, there are means to control the maximum
memory size allowed to each process. However, the resolution in case of
overusage is to kill the process. That is of course not a solution.
What I'm looking for is: can I set a cap to resources *at process level*,
in a way that mserver5 will just *see* those, make the best out of them,
and never try to use more? (yes, one way would be a VM, but far from ideal
for me)
Does anyone have experience with / tips about multi-dbfarm environments?
Thanks!
Roberto
Hi all:
I'm trying to figure how to calculate the time delta between two events.
My events have a timestamp, event source, event type, and an event state and event id.
So an event looks like -
event_date_time event_source event_type event_state event_id
2018-06-22 09:05:07.000000 MODULE123 TYPEABC ACK 24567890
2018-06-22 11:32:14.000000 MODULE123 TYPEABC UNACK 24812266
This makes up a unique "event", start to finish.
However I have hundreds or thousands of entries between and I want be able to group them together in their unique pairings as above.
I'm not sure how to wrap my head around this one.
Adam Doherty
We are interested in adding support to import records from Monetdb tables
in the ETL we are developing. We have performance issues when executing
queries like:
select col0, col1, ... coln from <table>
on tables with a large number of records. What I am concerned about is the
time spent upfront, before the first record is retrieved: once records
become available we can start processing them.
For example, on my machine, executing a executing a select * from <table>
query on a 68 million records table takes around 24 seconds for the query
to execute, before it is actually possible to start retrieving records from
the server response.
Queries are submitted via the JDBC driver.
Is there a more efficient way of streaming records from a table ?
Thanks,
Best regards,
*Manuel Pagliai*
Visokio Ltd
www.visokio.com
+44 20 7351 4646
The MonetDB team at CWI/MonetDB BV is pleased to announce the
Mar2018-SP1 bugfix release of the MonetDB suite of programs.
More information about MonetDB can be found on our website at
<https://www.monetdb.org/>.
For details on this release, please see the release notes at
<https://www.monetdb.org/Downloads/ReleaseNotes>.
As usual, the download location is <https://dev.monetdb.org/downloads/>.
Mar 2018-SP1 bugfix release (11.29.7)
Build Environment
* On Linux and Unix, try not to link to libraries multiple times:
that only causes the code (and worse, the variables in those
libraries) to be included multiple times. On Windows, we do need to
link to libraries multiple times (once for each DLL we create if
the DLL references anything from the library) but that doesn't
cause the linked library to be loaded multiple times.
SQL Frontend
* Corrected the definition of view: sys.ids.
Bug Fixes
* 6562: Sqlitelogictest crash on group by query with not in operator
* 6565: Sqlitelogictest crash on complex select query with coalesce
call
* 6566: Sqlitelogictest unavailable calc.- MAL operations
* 6568: Sqlitelogictest crash on complex case query
* 6569: Sqlitelogictest select query with not between cause with
wrong results
* 6570: Sqlitelogictest select coalesce undefined calc
* 6572: ordered index Error in optimizer garbageCollector
* 6573: Sqlitelogictest crash on complex select query with case
statement
* 6574: server crashed could not find ordered index while creating a
table
* 6576: Sqlitelogictest aritmetic expressions with negative numbers
handling
* 6577: creating temp table kills performance of the original query
* 6578: One two-tuple insert gives different results than two single
inserts
* 6581: Join condition errors.
* 6583: Fixed size string concatenation with integer results in fixed
size string of size 0
* 6584: SELECT FROM REMOTE TABLE WHERE IS NOT NULL produces wrong
result
* 6585: Nested Merge tables cause an infinite loop in rel_optimizer
* 6587: Sqlitelogictest crash on complex case statement
* 6589: Sqlitelogictest crash on complex on complex expression
* 6594: Sqlitelogictest crash on complex case statement
* 6595: Remote decimal division triggers assertion / returns wrong
answer
* 6598: Python 3.4 not supported (due to usage of Py_DecodeLocale)
* 6600: Sqlitelogictest queries fail to execute
* 6601: "where is null" clause on remote table causes problem with
next query
* 6602: Sqlitelogictest wrong results in IN query
* 6603: Sqlitelogictest: Aggregation query with distinct clause
produces duplicated rows
Hello all:
This is more of curiosity but I have data in my events table currently trapped in a nvarchar field.
I am wondering if I can build a table and do execute a select into with a python UDF pulling data from the existing fields plus the values of my nvarchar field.
My string data has the following format -
"key1=value, key2=val, key3=value"...
I want to break out the data first by splitting on the , and then the =
Keys would be the columns in the table.
I can do this with a python script and pymonetdb but a UDF would be faster.
Adam Doherty
Process Control Systems Administrator
Kearl Technical, Imperial
505 Quarry Park Blvd SE
W3B.224
Calgary, AB T2C 5N1
adam.doherty(a)esso.ca<mailto:adam.doherty@esso.ca> P 587.476.4640<tel:+15874764640> M 403.869.1636<tel:+14038691636>
imperialoil.ca<http://www.imperialoil.ca/Canada-English/default.aspx> | Twitter<http://www.twitter.com/imperialoil> | YouTube<http://www.youtube.com/imperialoil>
Goodmorning,
At this moment there is no functional compatibility between the PostgreSQL
and MonetDB with respect to the generate_series function. The range in
PostgreSQL is inclusive, in MonetDB exclusive. More importantly
generate_series can't be used with argument that is joined from a different
table, MonetDB just seems to support a subquery, hence joining that back
without a primary is impossible.
PostgreSQL;
select generate_series(a, b), a from (select 1 as a, 2 as b) as a;
generate_series | a
-----------------+---
1 | 1
2 | 1
(2 rows)
MonetDB;
select * from generate_series((select 1 as a, 2 as b));
+-------+
| value |
+=======+
| 1 |
+-------+
If I would like to functionally achieve the following;
I have list that states the speeds over partial segments, identified by
linear referencing. From 10 meter to 12 meter, the distance was 2 meter and
speed was 2s.
10 2 2
I would like to expand these values to segments of one meter, hence end up
with a table that would state:
10 2 2
11 2 2
12 2 2
Now I am aware of the Python bindings and I could achieve what I wanted by
the function below, but I would find generate_series still very practical.
DROP FUNCTION expand_per_meter;
CREATE FUNCTION expand_per_meter(trip_hash BIGINT, userstopcode STRING,
distancesincelastuserstop INTEGER, s INTEGER, t INTEGER)
RETURNS TABLE(trip_hash BIGINT, userstopcode STRING,
distancesincelastuserstop INTEGER, s INTEGER, t INTEGER)
LANGUAGE PYTHON {
result = dict()
result['trip_hash'] = []
result['userstopcode'] = []
result['distancesincelastuserstop'] = []
result['s'] = []
result['t'] = []
for i in range(0, len(trip_hash)):
expanded = range(distancesincelastuserstop[i],
distancesincelastuserstop[i] + s[i] + 1)
expand = len(expanded)
result['trip_hash'] += [trip_hash[i]] * expand
result['userstopcode'] += [userstopcode[i]] * expand
result['distancesincelastuserstop'] += expanded
result['s'] += [s[i]] * expand
result['t'] += [t[i]] * expand
return result
};
--
Stefan
Stefan,
This is a good question. For the specific example, you are quite correct.
However, the motivation for the question came from using the MonetDB-rmath extension and embedded R: often it is more efficient to calculate a number of values (e.g. the confidence intervals and p-values) together and then return all of the values as a table. This probably also reflects that I am a statistician and an old R user.
The recent article by Raasveldt suggested that analysis objects from Python could be stored natively rather than being serialised BLOBs. This could be further extended to storing R objects natively. Members of the objects could then be extracted as they are needed. Admittedly, this would lose many relational advantages. My suggestion is halfway, where R and Python can be used for analytics, with simple relational storage of the results.
Actually, is there any way to optimise a lateral call using BATs?
-- Mark
On 13 Jun 2018 17:43, Stefan Manegold <Stefan.Manegold(a)cwi.nl> wrote:
Hi Mark,
just out of curiosity,
(in the given case/example) what's wrong with a simple
select *, mean-1.96*sd as lci, mean+1.96*sd as uci from test;
or
create or replace function ci(se double)
returns double
begin
return 1.96*se;
end;
select *, mean-ci(sd) as lci, mean+ci(sd) as uci from test;
in case you really want/need to avoid the "redundant" occurrence or "1.96"
...
Thanks!
Best,
Stefan
----- On Jun 13, 2018, at 5:30 PM, Mark Clements mark.clements(a)ki.se wrote:
> Jennie,
>
> Thank you for replying to this inquiry - it's appreciated.
>
> Your solution gives the right result - but it requires changing the
> function call and will not generalise easily to other column types or
> other number of covariates. If we had another table such as:
>
> drop table test;
> create table test as select 1 as id, 2 as sex, cast(3 as double) as
> mean, cast(4.0 as double) as sd;
> -- and repeating the function
> create or replace function cis(mu double, se double) returns table(lci
> double, uci double)
> begin return select mu-1.96*se, mu+1.96*se;
> end;
>
> we can get all of the test table combined with the confidence interval by:
>
> -- row_number()
> select * from
> (select *, row_number() over() as rid from test) as t1
> natural join
> (select *, row_number() over() as rid from cis((select mean,sd from
> test))) as t2;
> -- or using lateral
> select * from test, lateral cis(test.mean,test.sd);
>
> The row_number solution is fast, and the lateral solution is simple. I
> was hoping for:
>
> select *, cis(mean,test).* from test;
>
> Again, thank you for your help.
>
> -- Mark
>
> On 06/08/2018 01:53 PM, Ying Zhang wrote:
>> Hai Mark,
>>
>> I’m just wonder if this gives you wat you want:
>>
>> create function cis(mu double, se double)
>> returns table(mu_out double, lci double, uci double)
>> begin
>> return select mu, mu-1.96*se, mu+1.96*se;
>> end;
>>
>> Regards,
>> Jennie
>
> _______________________________________________
> users-list mailing list
> users-list(a)monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
--
| Stefan.Manegold(a)CWI.nl | DB Architectures (DA) |
| www.CWI.nl/~manegold/<http://www.CWI.nl/~manegold/> | Science Park 123 (L321) |
| +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________
users-list mailing list
users-list(a)monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
For using the MonetDB-rmath extension, I was interested in using
table-returning functions to return multiple values. Is there a fast and
elegant way to combine the results with the source table in SQL? To make
this concrete, consider some test data and a function to calculate a 95%
confidence interval from a mean and standard error:
drop table test;
create table test as select cast(value as double) as value from
generate_series(0,5);
drop function cis(double,double);
create function cis(mu double, se double) returns table(lci double, uci
double)
begin return select mu-1.96*se, mu+1.96*se;
end;
We can then call the table-returning function:
select * from cis((select value, 1.0 from test));
If we want to re-join the results with the source table, we could use
row_number() or use lateral:
-- Using row_number:
select * from
(select *, row_number() over() as rid from test) as t1
natural join
(select *, row_number() over() as rid from cis((select value,1.0 from
test))) as t2;
-- Using lateral:
select * from test, lateral cis(test.value, cast(1.0 as double)) as t2;
I believe that the latter will lead to a slow loop, even when BAT
functions are available. The table-returning function can be used in the
select statement only for a scalar expression:
select cis(0,1).*; -- ok
select cis(0,1).* from test; -- Error: no such operator 'cis'
select cis(value,1).* from test; -- Error: identifier 'value' unknown
Again, is there a fast and elegant way to combine the source table with
a table-returning function in SQL?
Kindly, Mark.
Hello all:
I am trying to find the most efficient way to filter results for those matching today only.
My events table has 550 million rows and I want a view as follows-
CREATE VIEW events_today AS SELECT * FROM events WHERE str_to_date(timestamp_to_str(event_timestamp, '%Y-%m-%d'), '%Y-%m-%d') = CURRENT_DATE which should give me all events with today's date.
Running the SELECT by itself also never returns.
However, the query just hangs without ever returning.
I'm running 11.29.3 (Mar2018) on CentOS 7 with 128GB and 16 Cores.
Adam