Column Imprints: A Secondary Index Structure

From MonetDB
Jump to navigationJump to search

Authors: Lefteris Sidirourgos, Martin Kersten

Overview[edit]

Cache-conscious secondary index structure: column imprints.

  • Each imprint indexes data fitting in one cache line.
  • Structure is run-length encoded, so both CPU and memory friendly.

Method[edit]

  • A number of bins is chosen. Typically 64 in running on a 64 bit system.
  • For each cache line of data, a column imprint vector of 64 bits is made.
  • A bit is set if one of the values in the cache line falls into the corresponding bin.

Compression Method[edit]

  • Run-length encoding used to compress final column imprint.
  • A dictionary is constructed, which contains a counter and a repeat flag. If repeat flag 0, then the next x lines are different, otherwise they are the same. Each entry points to (an) imprint vector(s).

Binning Method[edit]

  • Histogram of unique values made to assign bin ranges that will hold equal number of tuples.

Benefits[edit]

  • Partial ordering can be exploited: less susceptible to order of individual values within a single cache line.
  • Shown also to be robust with highly unordered data.

Updating[edit]

Data Appending[edit]

  • Borders of bins may need to be readjusted, though unlikely, because first and last bins hold overflow values, and the original bins were based on a uniform sample of the domain data.
  • Simply append column imprints of new data to existing imprint structure.

Results[edit]

  • Storage results for column imprints better than bit binning and zone maps.
  • Selection query results show column imprints to be best over others
  • Compared with scan, imprints are a lot better for low selectivity, but not worse than scan with high selectivity.
  • WAH was found to be worse than scan when the selectivity was high. WAH better when DATA is not already in memory, but should be fetched from disk.