On Thu, May 28, 2015 at 12:36 PM, martin van dinther <martin.van.dinther@monetdbsolutions.com> wrote:

On 27-05-15 11:52, Anthony Damico wrote:

hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive? i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging. here's my pseudo-sql that displays the kinds of commands i've been running. thanks!

============

(a) i am conducting a left join where not all records match. the non-matching records need to be zeroes

in general, is this the least computationally demanding way to get an average amount per person, including the zeroes?

SELECT AVG( sum_amount ) AS avg_amt_per_person

FROM

( SELECT
person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount
FROM
( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table
GROUP BY person_id ) ;

Assuming person_id in person_table has unique values (so is a key) and all three columns (person_table.person_id, amount_table.person_id, amount_table.amount) are defined as not null you may use following variations:

SELECT AVG(sum_amount) AS avg_amt_per_person
FROM (
SELECT person_id, (CASE WHEN (SELECT SUM(amount) FROM amount_table WHERE person_table.person_id = amount_table.person_id) IS NULL THEN 0 ELSE (SELECT SUM(amount) FROM amount_table WHERE person_table.person_id = amount_table.person_id) END) as sum_amount
FROM person_table
) T;

or

SELECT AVG(sum_amount) AS avg_amt_per_person
FROM (
SELECT person_table.person_id, (CASE WHEN SUM(amount_table.amount) IS NULL THEN 0 ELSE SUM(amount_table.amount) END)as sum_amount
FROM person_table LEFT JOIN amount_table on (person_table.person_id = amount_table.person_id)
GROUP BY person_table.person_id
) T;

or

SELECT AVG(sum_amount) AS avg_amt_per_person
FROM (
SELECT person_id, SUM(amount) as sum_amount
FROM amount_table
GROUP BY person_id
UNION ALL
SELECT person_id, 0 as sum_amount
FROM person_table
WHERE person_id NOT IN (SELECT person_id FROM amount_table)
) T;

or

CREATE VIEW person_view AS
SELECT person_id, (SELECT SUM(amount) FROM amount_table WHERE amount IS NOT NULL AND person_table.person_id = amount_table.person_id) as sum_amount
FROM person_table;

SELECT AVG(sum_amount) AS avg_amt_per_person
FROM (
SELECT person_id, (CASE WHEN sum_amount IS NULL THEN 0 ELSE sum_amount END) as sum_amount
FROM person_view
) T;

or

ALTER TABLE person_table ADD COLUMN sum_amount double not null default 0;
UPDATE person_table
SET sum_amount = (SELECT SUM(amount) FROM amount_table WHERE amount IS NOT NULL AND person_table.person_id = amount_table.person_id)
WHERE person_id IN (SELECT person_id FROM amount_table);

SELECT AVG(sum_amount) AS avg_amt_per_person FROM person_table;

This last one (adding a column to person_table and populating it) only works correct when the source data (in this case the amount_table data) is static during running of the queries or is re-updated before a set of queries is run. It is an approach often used in datawarehouses where the source data is refreshed at known times (e.g. in the night) and the derivable redundant data can be re-derived after the nightly refresh.

============

(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits). i am just calculating the month as part of my query on-the-fly. is it foolish of me not to add a new, permanent column to the data set? i don't have a good sense of whether this segment slows down my query by a lot. it is one of my join/merge variables.

CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month

As month number is expected to be an integer I would cast it to SMALLINT (or TINYINT) instead of DOUBLE.
Instead of replicating this expression part in all queries where you need the some_month value, create a view which extends the table with this some_month column using the extraction expression in one place and next use the view in the queries where you need some_month.
When performance of the view is not good enough and the data is reasonably static, consider to add additional column(s) to the table, which are populated using a one-time UPDATE statement as presented above with the ALTER TABLE and UPDATE.
With Triggers (on insert and on update) you could automate the update of these derivable column data when new records are added or specific columns are updated.

============

(c) i have a three tables that have the same structure that get LEFT JOINed onto my main table. so right now i am doing something like--

main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table

is it smarter fo me to do this instead? --

CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table

main_table LEFT JOIN stacked_table

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list