Column Imprints: A Secondary Index Structure
From MonetDBJump to navigationJump to search
Authors: Lefteris Sidirourgos, Martin Kersten
Cache-conscious secondary index structure: column imprints.
- Each imprint indexes data fitting in one cache line.
- Structure is run-length encoded, so both CPU and memory friendly.
- A number of bins is chosen. Typically 64 in running on a 64 bit system.
- For each cache line of data, a column imprint vector of 64 bits is made.
- A bit is set if one of the values in the cache line falls into the corresponding bin.
- Run-length encoding used to compress final column imprint.
- A dictionary is constructed, which contains a counter and a repeat flag. If repeat flag 0, then the next x lines are different, otherwise they are the same. Each entry points to (an) imprint vector(s).
- Histogram of unique values made to assign bin ranges that will hold equal number of tuples.
- Partial ordering can be exploited: less susceptible to order of individual values within a single cache line.
- Shown also to be robust with highly unordered data.
- Borders of bins may need to be readjusted, though unlikely, because first and last bins hold overflow values, and the original bins were based on a uniform sample of the domain data.
- Simply append column imprints of new data to existing imprint structure.
- Storage results for column imprints better than bit binning and zone maps.
- Selection query results show column imprints to be best over others
- Compared with scan, imprints are a lot better for low selectivity, but not worse than scan with high selectivity.
- WAH was found to be worse than scan when the selectivity was high. WAH better when DATA is not already in memory, but should be fetched from disk.