The storage schemas

The storage schemas zhang Tue, 07/08/2014 - 18:19

The MonetDB SAM/BAM module supports two types of schemas to store SAM/BAM data. With the sequential schema, all alignment records are simply read, parsed and stored subsequently, comparable to how they are stored in the SAM/BAM files. With the pairwise schema, pairs of alignment records (as determined by the information in the alignment records) are stored together in single database records.

Meta-data tables

Regardless of the database schema in which the alignment records are stored, headers of the SAM/BAM files are stored in a fixed number of meta-data tables. Their specifications and relationships are show in the figure below:

The table "bam.files" stores which BAM files are loaded into the database.

  • file_id: every file tuple in this table contains a unique file ID, handed out by the BAM loader during the loading process;
  • file_location: location of the SAM/BAM file on disk, from where it was loaded;
  • dbschema: the database storage schema used, 0 for sequential schema, 1 for pairwise schema;
  • format_version: the format version, as stored in the SAM/BAM header
  • sorting_order: by which field are the alignment records in this file sorted; "unsorted" otherwise.
  • comments: the comments that were found inside CO tags in the file header.

The database design defines additional tables to store the remainder of the SAM/BAM file header, where a really straightforward approach was used (e.g. table "bam.sq" for SQ and table "bam.rg" for RG header records). For more information on the exact meaning of these fields, please consult the SAM specification.

Alignment data tables

For the alignment data, the database design defines a separate set of tables for every SAM/BAM file that has been loaded into the database. This design choice was made to speed up analyses on only a small number of SAM/BAM files, since this often occurs in practice. This speed up is realized by not having to filter out alignment data of specific BAM files as a first analysis step.

When loading SAM/BAM alignment records into the database, you can choose to store these data in one of two sets of predefined database tables, referenced to as the sequential table set and the pairwise table set. These table sets are presented in the next figures, for a BAM file with file_id i.

The database design uses the virtual offset of every alignment record as a primary key. By storing this virtual offset, specific alignments can easily be found back in the original SAM/BAM file using e.g. SAMtools.

Sequential schema

The sequential table set is aimed towards a straightforward mapping from the alignment data as it occurs in a SAM/BAM file. Every alignment field gets stored in a database column. Furthermore, the extra information that is contained in alignment data is parsed and stored in a separate table alignments_extra_i.

Pairwise schema

The pairwise table set aims at reducing the performance overhead when many operations are done on paired alignment data. During the loading process, both primary and secondary alignment pairs are automatically recognized and stored accordingly in the appropriate database tables. The columns in the paired tables have either an 'l' or an 'r' prefix (except for the 'qname', since two alignment records in the same pair always have the same QNAME). Columns with an 'l' prefix store data from alignment records that have their 'first segment' flag set to 1 and columns with an 'r' prefix do this for alignments with their 'last segment' flag set to 1. All alignments that can not be paired are stored in the unpaired alignments table, which has the same structure as the regular alignments table in the straightforward table set.

In addition to physical tables, the pairwise table set also defines some views over the data, which are also automatically created. These views offer ways to access the data as if it was stored in a sequential fashion, e.g., any query that is aimed towards the alignments table from the straightforward table set can be fired at the unpaired_all_alignments_i view.

In the above figure, arrows are used to indicate over which physical tables the different views are defined. 

Note: currently, if you want to load a SAM/BAM file into the pairwise schema, the file has to be sorted by QNAME. If this is not the case yet, please sort it first using e.g. SAMtools before trying to load it into the database. Another way would be to load the file into the sequential schema, write the ordered alignments back to a SAM file (see SAM export for more information on exporting functionality) and then load that SAM file. However, unless you already have your file loaded into a sequential schema this is not adviced, since this sequence of actions will in general take longer than simply sorting the file with e.g. SAMtools before inserting it.