ALTER TABLE ALTER COLUMN SET STORAGE

aris aris.koning at monetdbsolutions.com
Fri Sep 13 09:13:31 CEST 2019


Hi Dan,


On 12-09-19 11:01, developers-list-request at monetdb.org wrote:
> Send developers-list mailing list submissions to
> 	developers-list at monetdb.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://www.monetdb.org/mailman/listinfo/developers-list
> or, via email, send a message with subject or body 'help' to
> 	developers-list-request at monetdb.org
>
> You can reach the person managing the list at
> 	developers-list-owner at monetdb.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of developers-list digest..."
>
>
> Today's Topics:
>
>     1. Re: ALTER TABLE ALTER COLUMN SET STORAGE (Daniel Zvinca)
>     2. Re: ALTER TABLE ALTER COLUMN SET STORAGE (Daniel Zvinca)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 11 Sep 2019 13:53:27 +0300
> From: Daniel Zvinca <daniel.zvinca at logbis.com>
> To: "Communication channel for developers of the MonetDB suite."
> 	<developers-list at monetdb.org>
> Subject: Re: ALTER TABLE ALTER COLUMN SET STORAGE
> Message-ID:
> 	<CALxHsPEdkfOUKjy-xV5GDYHFry=SAst5qzZM5rk8JvFHqk1FNQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Thank you so much for your answer, Aris.
>
> The good news is that compression is considered and is going to be part of
> MonetDB in two releases or so (one year).
First release is probably much closer to half a year.
>
> I will try to see if I can use the current code in a custom build, I am
> quite curious how that will affect performance.
> To be honest I don't expect performance issues, nowadays even on SSD era,
> it is still faster to read compressed data and decompress in memory if the
> right ratio is there, of course. And I do expect decent compression ratio
> on most of the data.

The idea is that between every memory fetch you have quite some CPU 
cycles at your disposal which can be used to perform either 
(potentially) more expensive algebraic operations directly on compressed 
data or to just decompress data. This is the general principle that 
implies that at least lightweight compression can be used to accelerate 
queries.

>
> However, at this stage the MOSAIC's dual compressed - uncompressed
> storage will obviously not give me the gain I need. Yet, it is interesting
> to understand at least how query performance might look in the future.
Again this feature is still under discussion. But I don't think it is 
unlikely that it will enter into Mosaic.
>
> Bad news is of course the compression feature is going to happen in ... one
> year. But if it comes also with support for compressing in memory results
> (I know, it wasnt promised), it might worth to wait.

We're working it ☺. More remarks below on your other email...

>
> Best regards,
>
>
>
>
>
>
> On Wed, Sep 11, 2019 at 12:18 PM aris <aris.koning at monetdbsolutions.com>
> wrote:
>
>>
>>
>> -------- Forwarded Message --------
>> Subject: Re: ALTER TABLE ALTER COLUMN SET STORAGE
>> Date: Tue, 10 Sep 2019 15:15:09 +0200
>> From: aris <aris.koning at monetdbsolutions.com>
>> <aris.koning at monetdbsolutions.com>
>> To: developers-list-request at monetdb.org
>>
>> Hi Daniel,
>>
>> On 10-09-19 12:00, developers-list-request at monetdb.org wrote:
>>
>> Send developers-list mailing list submissions to
>> developers-list at monetdb.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://www.monetdb.org/mailman/listinfo/developers-list
>> or, via email, send a message with subject or body 'help' to
>> developers-list-request at monetdb.org
>>
>> You can reach the person managing the list at
>> developers-list-owner at monetdb.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of developers-list digest..."
>>
>>
>> Today's Topics:
>>
>> 1. ALTER TABLE ALTER COLUMN SET STORAGE (Daniel Zvinca)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 10 Sep 2019 12:40:22 +0300
>> From: Daniel Zvinca <daniel.zvinca at logbis.com> <daniel.zvinca at logbis.com>
>> To: developers-list at monetdb.org
>> Subject: ALTER TABLE ALTER COLUMN SET STORAGE
>> Message-ID:
>> <CALxHsPHEEExozCgdp_srxQm2tFtFOqGOGNqBTw2UhTBwWZu93g at mail.gmail.com>
>> <CALxHsPHEEExozCgdp_srxQm2tFtFOqGOGNqBTw2UhTBwWZu93g at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello,
>>
>> I am interested to find out more about ALTER TABLE ALTER COLUMN SET STORAGE
>> feature and how is that related to compression.
>>
>> As far as I understood this is related to an active development branch,
>> called MOSAIC which was never merged with any of the previous MonetDB
>> versions. Obviously, compression is an important feature columnar databases
>> are providing for data storage and manipulation. A module like MOSAIC that
>> seems to allow several compression techniques, would be an interesting
>> option.
>>
>> Yes, compression a.k.a. Mosaic is a going to be a new feature in MonetDB.
>> Although the feature won't be included in the upcoming November release.
>> Most likely, you can expect the feature in the first release after the
>> November release. But Mosaic is a somewhat big undertaking. our current
>> road map is probably covering multiple future MonetDB releases before all
>> envisioned compression features are available in MonetDB. The first
>> milestone in the current road map is to apply a single compression
>> technique on an entire column. But one of the next milestone is to
>> partition a column into variable-sized compression blocks. Within each
>> block a particular compression is applied.
>>
>>
>> First question I have: Can MOSAIC extension be used (sources added and
>> custom compiled) with success for any of its proposed codecs with any of
>> the newest versions (Apr2019 +). I mean without affecting any of embedded,
>> capi, rapi and pyapi modules, which all exchange data with external
>> libraries.
>>
>> If by this you mean you want to import the mosaic module as an external
>> library into an existing release out of the box, then the answer is no.
>> There are some slight modifications in the GDK layer to accommodate the
>> Mosaic module. And to interact with it from SQL, there are also some code
>> changes in the SQL layer. But besides those dependencies, I don't expect
>> any issue with the particular (x)api frameworks. But nothing is guaranteed
>> obviously. It sounds like you want to hack-back port it into custom builds
>> of earlier releases. I wouldn't give it a zero change of success but I do
>> wish you much luck :)
>>
>>
>> A quick read of MOSAIC code made me understand that this compression can be
>> applied only on readonly PERSISTENT columns. That means that I would loose
>> the major benefit of compression that I mostly need during importing stage.
>> Sure I can imagine a controlled batching import scheme that would append
>> data to tables and when it reaches certain threshold table is made
>> readonly, then compressed, then added to a merged table, but this looks
>> quite of a scenario. Am I wrong, can MOSAIC be used in a different
>> scenario?
>>
>> Your observation about the joint life cycle of a Mosaic structure and its
>> original column file is correct: currently Mosaic adds a compressed
>> representation next to the existing uncompressed column. For the first
>> milestone on the Mosaic road map we want to successfully apply compression
>> on READ-ONLY pre-existing columns where the purpose of compression is to
>> potentially accelerate analytical queries on these columns. However we are
>> still looking into potentially freeing the uncompressed column once a
>> compressed Mosaic heap is available. This would accommodate compression for
>> the more traditional sake of limiting memory- and/or disk footprint.
>>
>>
>> I can understand reasons behind compressing only PERSISTENT bats, yet I am
>> wondering if TRANSIENT bats can also benefit from it especially for 1.
>> result building stage (server-client or embedded version) or 2. for remote
>> connections when data is transferred for merging operations.
>>
>> Regarding to above question, are there any chances that you would consider
>> keeping compressed results in memory? Sure I can use instead disk
>> temporary tables for subsequent manipulation, but for performance reasons
>> in memory compressed results would be way faster. Actually, when embedded
>> version provides a result set, it stays valid till the user releases it,
>> why not to be able to also use that for possible subsequent SQL operations
>> that do not fit into a CTE scenario. That would provide superior
>> flexibility and memory management to CTE mechanism. Temporary results can
>> be developed in steps, they can be accessed directly at any time as
>> convenient as temporary views in CTE, but without the burden of possible
>> temporary bats that are not released till one CTE ends.
>>
>> I think it is an interesting idea. But I think it is part of a more
>> general goal/problem of how to handle updates on compressed data. There are
>> internal  discussions on this topic. But whatever the outcome, this will be
>> only relevant for a much later milestone on the road map.
>>
>>
>>
>> Thank you,
>> Dan
>>
>>
>> Hope it helps.
>>
>> Kind regards,
>>
>> Aris
>>
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> <http://www.monetdb.org/pipermail/developers-list/attachments/20190910/e87e3f64/attachment-0001.htm>
>> <http://www.monetdb.org/pipermail/developers-list/attachments/20190910/e87e3f64/attachment-0001.htm>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> developers-list mailing list
>> developers-list at monetdb.org
>> https://www.monetdb.org/mailman/listinfo/developers-list
>>
>>
>> ------------------------------
>>
>> End of developers-list Digest, Vol 78, Issue 3
>> **********************************************
>>
>> _______________________________________________
>> developers-list mailing list
>> developers-list at monetdb.org
>> https://www.monetdb.org/mailman/listinfo/developers-list
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20190911/608d31b1/attachment-0001.htm>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 12 Sep 2019 12:00:31 +0300
> From: Daniel Zvinca <daniel.zvinca at logbis.com>
> To: "Communication channel for developers of the MonetDB suite."
> 	<developers-list at monetdb.org>
> Subject: Re: ALTER TABLE ALTER COLUMN SET STORAGE
> Message-ID:
> 	<CALxHsPGq_RbS-E4cudmm=PcHikQ=TMdwEYJyy88sHuN4JeFnMg at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello Aris,
>
>
>
> With the latest code adjustments (thank you for that), the library does
> what supposed to do. For whom may want to try I confirm that it can be
> integrated as-is in a recent version with a decent effort (it requires a
> bit of understanding of MonetDB code logic).
>
> Here are some remarks/questions I have after checking out a bit the code
> (from extensibility perspective) and its functionality so far.
>
> I am curious if the reason behind a constant MOSAICMAXCNT used for all
> types is related to thetaselects or projections implementations (do the
> blocks need to have the counters aligned?). Truth is that for most of the
> compressing techniques I?ve look into (not from MOSAIC project, but lz4,
> zstd, blosc2 as general compressors, zp, sz for floating numbers, Daniel
> Lemire?s compression for integers, dictionaries Judy, Art for strings) the
> size of the to-be-compressed block is usually the one that matters the
> most. Block sizes that would fit in L1, L2, L3 cache usually provide better
> results for certain compressors. Sure, a constant block counter matters per
> BAT, but would be a problem to have for instance something like
> MOSAICMAXCNT/(sizeof(type) or twidth ) ?

That is a good question. Mosaic can apply different compression 
techniques to different variable sized parts or blocks of the original 
column. The aim of Mosaic is to apply the compression mix with the best 
compression ratio. This is a hard problem of course. So we currently use 
some heuristics to find a reasonable mix of compression techniques. One 
of those heuristics is that given a stride of data that seems to 
compress very well with a certain technique A, we cut the compression 
technique A of at a fixed limit of MOSAICMAXCNT. This is just to give 
other compression techniques a chance to retry there compression 
strategy at a different starting point. If technique A is still a good 
choice, we just restart compressing from this point onward with 
technique A. This is mostly relevant in the estimate phase. Perhaps you 
can take a look at the function MOSoptimizerCost.

>
> MOSAIC pipe optimizer, being derived from sequential one, will inherit the
> performance penalties of the latest, yet after I checked the results of
> mosaic.analysis, I am not even sure that those with a factor close to 1
> were actually considered for compression ( I guess RAW alternative is used
> if gain is under certain threshold) therefore I have no clue how reliable
> the testing benchmarks are (for those cases do I mostly test RAW codec?).
> Anyway, for what I have tested. times seem to roughly double when mosaic
> pipe is used, being only slightly slower than sequential_pipe.

That is also a good point that we recently have discussed internally. 
Raw is kind of our compression exception handling mechanism. When a 
certain compression technique doesn't seem to be efficient. Mosaic 
transparently falls back to raw. This is by design. But I noticed 
earlier that this is probably not always warranted from the user 
perspective who expects that a single compression technique is applied; 
period. So we are currently revisiting this part of the design and 
implementation on how we can improve here.


Your remarks do give me an idea: when we are applying only a single 
compression technique, perhaps we do not need to be so strict with 
MOSAICMAXCNT.

>
> I assume that a complex, constant size header for all compressions (
> MOSAICHEADER) is convenient to maintain, but I am wondering how that will
> look if more compressors are about to be added. Is it mandatory to have
> common header size for all mosaic heaps?
The MOSAICHEADER is indeed a global header for a single Mosaic heap. 
With in a Mosaic heap there are multiple blocks. Each block also has a 
block specific header. Certain compression techniques need a shared 
resource, like dictionary compression for instance. The shared resources 
are maintained in the MOSAICHEADER.
>
>
>
> Thank you,
Thanks for your input as well. It is very useful.

Aris
>
> On Wed, Sep 11, 2019 at 1:53 PM Daniel Zvinca <daniel.zvinca at logbis.com>
> wrote:
>
>> Thank you so much for your answer, Aris.
>>
>> The good news is that compression is considered and is going to be part of
>> MonetDB in two releases or so (one year).
>>
>> I will try to see if I can use the current code in a custom build, I am
>> quite curious how that will affect performance.
>> To be honest I don't expect performance issues, nowadays even on SSD era,
>> it is still faster to read compressed data and decompress in memory if the
>> right ratio is there, of course. And I do expect decent compression ratio
>> on most of the data.
>>
>> However, at this stage the MOSAIC's dual compressed - uncompressed
>> storage will obviously not give me the gain I need. Yet, it is interesting
>> to understand at least how query performance might look in the future.
>>
>> Bad news is of course the compression feature is going to happen in ...
>> one year. But if it comes also with support for compressing in memory
>> results (I know, it wasnt promised), it might worth to wait.
>>
>> Best regards,
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 11, 2019 at 12:18 PM aris <aris.koning at monetdbsolutions.com>
>> wrote:
>>
>>>
>>>
>>> -------- Forwarded Message --------
>>> Subject: Re: ALTER TABLE ALTER COLUMN SET STORAGE
>>> Date: Tue, 10 Sep 2019 15:15:09 +0200
>>> From: aris <aris.koning at monetdbsolutions.com>
>>> <aris.koning at monetdbsolutions.com>
>>> To: developers-list-request at monetdb.org
>>>
>>> Hi Daniel,
>>>
>>> On 10-09-19 12:00, developers-list-request at monetdb.org wrote:
>>>
>>> Send developers-list mailing list submissions to
>>> developers-list at monetdb.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>> https://www.monetdb.org/mailman/listinfo/developers-list
>>> or, via email, send a message with subject or body 'help' to
>>> developers-list-request at monetdb.org
>>>
>>> You can reach the person managing the list at
>>> developers-list-owner at monetdb.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of developers-list digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>> 1. ALTER TABLE ALTER COLUMN SET STORAGE (Daniel Zvinca)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Tue, 10 Sep 2019 12:40:22 +0300
>>> From: Daniel Zvinca <daniel.zvinca at logbis.com> <daniel.zvinca at logbis.com>
>>> To: developers-list at monetdb.org
>>> Subject: ALTER TABLE ALTER COLUMN SET STORAGE
>>> Message-ID:
>>> <CALxHsPHEEExozCgdp_srxQm2tFtFOqGOGNqBTw2UhTBwWZu93g at mail.gmail.com>
>>> <CALxHsPHEEExozCgdp_srxQm2tFtFOqGOGNqBTw2UhTBwWZu93g at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hello,
>>>
>>> I am interested to find out more about ALTER TABLE ALTER COLUMN SET
>>> STORAGE
>>> feature and how is that related to compression.
>>>
>>> As far as I understood this is related to an active development branch,
>>> called MOSAIC which was never merged with any of the previous MonetDB
>>> versions. Obviously, compression is an important feature columnar
>>> databases
>>> are providing for data storage and manipulation. A module like MOSAIC that
>>> seems to allow several compression techniques, would be an interesting
>>> option.
>>>
>>> Yes, compression a.k.a. Mosaic is a going to be a new feature in MonetDB.
>>> Although the feature won't be included in the upcoming November release.
>>> Most likely, you can expect the feature in the first release after the
>>> November release. But Mosaic is a somewhat big undertaking. our current
>>> road map is probably covering multiple future MonetDB releases before all
>>> envisioned compression features are available in MonetDB. The first
>>> milestone in the current road map is to apply a single compression
>>> technique on an entire column. But one of the next milestone is to
>>> partition a column into variable-sized compression blocks. Within each
>>> block a particular compression is applied.
>>>
>>>
>>> First question I have: Can MOSAIC extension be used (sources added and
>>> custom compiled) with success for any of its proposed codecs with any of
>>> the newest versions (Apr2019 +). I mean without affecting any of embedded,
>>> capi, rapi and pyapi modules, which all exchange data with external
>>> libraries.
>>>
>>> If by this you mean you want to import the mosaic module as an external
>>> library into an existing release out of the box, then the answer is no.
>>> There are some slight modifications in the GDK layer to accommodate the
>>> Mosaic module. And to interact with it from SQL, there are also some code
>>> changes in the SQL layer. But besides those dependencies, I don't expect
>>> any issue with the particular (x)api frameworks. But nothing is guaranteed
>>> obviously. It sounds like you want to hack-back port it into custom builds
>>> of earlier releases. I wouldn't give it a zero change of success but I do
>>> wish you much luck :)
>>>
>>>
>>> A quick read of MOSAIC code made me understand that this compression can
>>> be
>>> applied only on readonly PERSISTENT columns. That means that I would loose
>>> the major benefit of compression that I mostly need during importing
>>> stage.
>>> Sure I can imagine a controlled batching import scheme that would append
>>> data to tables and when it reaches certain threshold table is made
>>> readonly, then compressed, then added to a merged table, but this looks
>>> quite of a scenario. Am I wrong, can MOSAIC be used in a different
>>> scenario?
>>>
>>> Your observation about the joint life cycle of a Mosaic structure and its
>>> original column file is correct: currently Mosaic adds a compressed
>>> representation next to the existing uncompressed column. For the first
>>> milestone on the Mosaic road map we want to successfully apply compression
>>> on READ-ONLY pre-existing columns where the purpose of compression is to
>>> potentially accelerate analytical queries on these columns. However we are
>>> still looking into potentially freeing the uncompressed column once a
>>> compressed Mosaic heap is available. This would accommodate compression for
>>> the more traditional sake of limiting memory- and/or disk footprint.
>>>
>>>
>>> I can understand reasons behind compressing only PERSISTENT bats, yet I am
>>> wondering if TRANSIENT bats can also benefit from it especially for 1.
>>> result building stage (server-client or embedded version) or 2. for remote
>>> connections when data is transferred for merging operations.
>>>
>>> Regarding to above question, are there any chances that you would consider
>>> keeping compressed results in memory? Sure I can use instead disk
>>> temporary tables for subsequent manipulation, but for performance reasons
>>> in memory compressed results would be way faster. Actually, when embedded
>>> version provides a result set, it stays valid till the user releases it,
>>> why not to be able to also use that for possible subsequent SQL operations
>>> that do not fit into a CTE scenario. That would provide superior
>>> flexibility and memory management to CTE mechanism. Temporary results can
>>> be developed in steps, they can be accessed directly at any time as
>>> convenient as temporary views in CTE, but without the burden of possible
>>> temporary bats that are not released till one CTE ends.
>>>
>>> I think it is an interesting idea. But I think it is part of a more
>>> general goal/problem of how to handle updates on compressed data. There are
>>> internal  discussions on this topic. But whatever the outcome, this will be
>>> only relevant for a much later milestone on the road map.
>>>
>>>
>>>
>>> Thank you,
>>> Dan
>>>
>>>
>>> Hope it helps.
>>>
>>> Kind regards,
>>>
>>> Aris
>>>
>>>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> <http://www.monetdb.org/pipermail/developers-list/attachments/20190910/e87e3f64/attachment-0001.htm>
>>> <http://www.monetdb.org/pipermail/developers-list/attachments/20190910/e87e3f64/attachment-0001.htm>
>>>
>>> ------------------------------
>>>
>>> Subject: Digest Footer
>>>
>>> _______________________________________________
>>> developers-list mailing list
>>> developers-list at monetdb.org
>>> https://www.monetdb.org/mailman/listinfo/developers-list
>>>
>>>
>>> ------------------------------
>>>
>>> End of developers-list Digest, Vol 78, Issue 3
>>> **********************************************
>>>
>>> _______________________________________________
>>> developers-list mailing list
>>> developers-list at monetdb.org
>>> https://www.monetdb.org/mailman/listinfo/developers-list
>>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20190912/7d822571/attachment.htm>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> developers-list mailing list
> developers-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/developers-list
>
>
> ------------------------------
>
> End of developers-list Digest, Vol 78, Issue 5
> **********************************************


More information about the developers-list mailing list