Hi Aris, 

“Using available CPU cycles between memory fetching tasks for algebraic operations” sounds to me like exploring alternative, block-based, out-of-core algorithms (new optimizations?). Truth is, a bunch of online (aggregation) algorithms can be applied on “running” buffers, so to speak, insuring parallelism, vectorization, proper cpu cache utilization and RAM preserving. (or, at least, I’d like to think it is possible, ☺).

While writing this reply, it occurred to me that a slightly different dual storage solution might also become handy at times. One that would attach to the original storage a bunch of fixed block size statistical information that can be used to massively accelerate some aggregations. I think some folks are calling this pre-aggregated layer. 

(thinking out loud, of course)

Dan