here is the CASE / CAST / SUBSTRING performance cost on a data set with about 200 million records and a windows system with 64gb of ram.  this might help others as well..


> # no conversion commands
> system.time( dbGetQuery( db , "SELECT COUNT(*) , some_date AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.03    0.00    7.94
>
> # SUBSTRING only
> system.time( dbGetQuery( db , "SELECT COUNT(*) , SUBSTRING( some_date , 1 , 1 ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.00    0.00   16.39
>
> # CAST only
> system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( some_date AS VARCHAR(255) ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.02    0.00   35.58
>
> # CASE only
> system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN ( some_date IS NULL ) THEN 'hello' ELSE some_date END ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.02    0.00    8.73
>
> # CAST + SUBSTRING
> system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( SUBSTRING( some_date , 1 , 1 ) AS DOUBLE ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
    0.0     0.0    34.5
>
> # CASE + SUBSTRING
> system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.00    0.00   51.49
>
> # CASE + CAST + SUBSTRING
> system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.00    0.00   73.03




On Wed, May 27, 2015 at 5:52 AM, Anthony Damico <ajdamico@gmail.com> wrote:
hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive?  i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging.  here's my pseudo-sql that displays the kinds of commands i've been running.  thanks!

============

(a) i am conducting a left join where not all records match. the non-matching records need to be zeroes

in general, is this the least computationally demanding way to get an average amount per person, including the zeroes?


SELECT AVG( sum_amount ) AS avg_amt_per_person

FROM

( SELECT
  person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount
FROM
   ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table
GROUP BY person_id ) ;


============

(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits).  i am just calculating the month as part of my query on-the-fly.  is it foolish of me not to add a new, permanent column to the data set?  i don't have a good sense of whether this segment slows down my query by a lot.  it is one of my join/merge variables.

    CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month


============

(c) i have a three tables that have the same structure that get LEFT JOINed onto my main table.  so right now i am doing something like--

main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table

is it smarter fo me to do this instead? --

CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table

main_table LEFT JOIN stacked_table