Life Science

Life Science zhang Thu, 03/27/2014 - 14:15

MonetDB/SQL comes with a SAM/BAM module, the de-facto standard for managing DNA sequence alignment data. The module contains the following features to assist SAM/BAM data analysis:

  • Features to load a single SAM/BAM file, a user specified selection of SAM/BAM files, or a complete repository of SAM/BAM files into the database.
  • A database cleanup feature to remove individual SAM/BAM files from the database.
  • A  collection of  SQL functions commonly used in SAM/BAM data analysis, such as computing the reverse complement of a DNA string, and filtering alignment records by their flags.
  • A SAM formatter that renders a database result set into the SAM format

The SAM/BAM modules forms a cornerstone for building lifescience applications around DNA sequences. A quick introduction of the features can be seen in the screencast below, which also demonstrates how MonetDB/BAM works together with popular genomic tools, such as IGV, the Integrative Genomics Viewer.

A specification of the SAM/BAM formats can be found here. Some auxiliary tools for manipulating alignments in the SAM/BAM formats can be found here.

Installation

Installation zhang Tue, 07/08/2014 - 18:01

Installation of SAMtools

If you want to use the SAM/BAM module of MonetDB, you must have the SAMtools (development) library pre-installed. On Ubuntu, you need to install the libbam-dev package. On Fedora, you need to install the samtools-devel package. On any other OSs, you need to install the SAMtools library yourself and tell MonetDB where to find it. The details of the steps are shown below:

  1. Create a folder for your SAMtools installation. We refer to the absolute path of this folder as $SAMTOOLS_BASEDIR.
  2. Download and save the latest version of SAMtools from here in $SAMTOOLS_BASEDIR. We used version 0.1.19 for this manual, so we have a file $SAMTOOLS_BASEDIR/samtools-0.1.19.tar.bz2.
  3. Unzip the bzip2 file and rename the resulting directory:
    $ cd $SAMTOOLS_BASEDIR
    $ tar xf samtools-0.1.19.tar.bz2
    $ mv samtools-0.1.19 samtools
  4. Create two symbolic links $SAMTOOLS_BASEDIR/include and $SAMTOOLS_BASEDIR/lib:
    $ ln -s $SAMTOOLS_BASEDIR $SAMTOOLS_BASEDIR/include
    $ ln -s $SAMTOOLS_BASEDIR/samtools $SAMTOOLS_BASEDIR/lib
  5. Open the file $SAMTOOLS_BASEDIR/samtools/Makefile with your favourite editor and add the flag -fPIC to the CFLAGS variable. This edit is necessary for MonetDB to use the SAMtools libraries during its installation process. NB: we do not know what effects this change might have on other software that also use the SAMtools library directly.
  6. Now the SAMtools library can be compiled without any other argument.
    $ cd $SAMTOOLS_BASEDIR/samtools && make
  7. For more detailed information on the compilation and installation, see the file $SAMTOOLS_BASEDIR/samtools/INSTALL.
  8. Add the SAMtools library we have just installed to the library path:
    $ export LD_LIBRARY_PATH=$SAMTOOLS_BASEDIR/samtools:$LD_LIBRARY_PATH
  9. When installing MonetDB, it needs to know where to find the SAMtools library. This is achieved by passing the option --with-samtools=$SAMTOOLS_BASEDIR to the configure script of MonetDB.

Contact MonetDB Solutions for additional support options or a reference to a prepared Cloud instance.

Loading SAM/BAM data

Loading SAM/BAM data zhang Tue, 07/08/2014 - 18:06

To demonstrate the MonetDB SAM/BAM library, we prepared four sample input data files: file1.bam and file2.sam are unsorted; while file1-sorted.bam and file2-sorted.sam are sorted by the QNAME fields of the alignment records. They can also be found in the subdirectory sql/backends/monet5/bam/Tests/files of your MonetDB source directory. Assume these files are saved on your computer in directory "/tmp". There are several ways to load the SAM/BAM files into MonetDB.

Load individual files

The SQL function bam_loader_file(bam_file STRING, dbschema SMALLINT) allows loading one SAM/BAM file at a time:

sql> CALL bam.bam_loader_file('/tmp/file1.bam', 0);
sql> CALL bam.bam_loader_file('/tmp/file2.sam', 0);
sql> CALL bam.bam_loader_file('/tmp/file1-sorted.bam', 1);
sql> CALL bam.bam_loader_file('/tmp/file2-sorted.sam', 1);

The first argument is the absolute path to the SAM/BAM file. The second argument is either 0 for the sequential schema, or 1 for the pairwise schema (see the definition of both storage schemas in SAM/BAM storage). Note that to use the pairwise schema, the SAM/BAM files must be sorted by the QNAME field.

To check which SAM/BAM files have been loaded into the database, you can consult the database catalogue table "bam.files":

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)

Load SAM/BAM repositories

The SQL function bam_loader_repos(bam_repos STRING, dbschema SMALLINT, nr_threads SMALLINT) allows loading a repository of SAM/BAM files in the given directory:

sql> CALL bam.bam_loader_repos('/tmp', 0, 4);

The first argument is the absolute path to the SAM/BAM repository (note that the loading function does not search in the subdirectory for SAM/BAM files). The second argument is either 0 for the sequential schema, or 1 for the pairwise schema (see SAM/BAM storage). The third argument is the number of threads to use during the loading process. Note: the last argument is removed in the bamloader branch of MonetDB, since a user on SQL level should not have to deal with such a low-level detail. This change will be merged into the next feature release of MonetDB. 

Load list of SAM/BAM files

The SQL function bam_loader_files(bam_files STRING, dbschema SMALLINT, nr_threads SMALLINT) loads all SAM/BAM files listed in the given file:

sql> CALL bam.bam_loader_files('/tmp/files.txt', 0, 4);

The first argument is the absolute path of the file which contains a list of absolute paths to the SAM/BAM files to load:

$ cat /tmp/files.txt
/tmp/file1.bam
/tmp/file2.sam

The second argument is either 0 for the sequential schema, or 1 for pairwise schema (see SAM/BAM storage). The third argument is the number of threads to use during the loading process. Note: the last argument is removed in the bamloader branch of MonetDB, since a user on SQL level should not have to deal with such a low-level detail. This change will be merged into the next feature release of MonetDB. 

Remove files

The SQL function bam_drop_file(file_id BIGINT, dbschema SMALLINT) allows removing all data contained in a SAM/BAM file from the database:

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)
sql>CALL bam.bam_drop_file(2, 0);

The first argument is the file_id of the file that will be removed. The second argument is the dbschema of this file, which can be either zero or one for the sequential or the pairwise schema respectively.

sql>SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
3 tuples (5.334ms)

 

Known issues

There exists a loading issue in the Oct2014 release of MonetDB. When loading a list or a repository of files, using respectively the SQL commands bam_loader_files and bam_loader_repos, too many file descriptors will be opened. This might cause the loading process to fail with the error message that the bam_wrapper code could not open a certain file. This issue has been fixed in the bamloader branch and will be merged into the next feature release of MonetDB.

 

The storage schemas

The storage schemas zhang Tue, 07/08/2014 - 18:19

The MonetDB SAM/BAM module supports two types of schemas to store SAM/BAM data. With the sequential schema, all alignment records are simply read, parsed and stored subsequently, comparable to how they are stored in the SAM/BAM files. With the pairwise schema, pairs of alignment records (as determined by the information in the alignment records) are stored together in single database records.

Meta-data tables

Regardless of the database schema in which the alignment records are stored, headers of the SAM/BAM files are stored in a fixed number of meta-data tables. Their specifications and relationships are show in the figure below:

The table "bam.files" stores which BAM files are loaded into the database.

  • file_id: every file tuple in this table contains a unique file ID, handed out by the BAM loader during the loading process;
  • file_location: location of the SAM/BAM file on disk, from where it was loaded;
  • dbschema: the database storage schema used, 0 for sequential schema, 1 for pairwise schema;
  • format_version: the format version, as stored in the SAM/BAM header
  • sorting_order: by which field are the alignment records in this file sorted; "unsorted" otherwise.
  • comments: the comments that were found inside CO tags in the file header.

The database design defines additional tables to store the remainder of the SAM/BAM file header, where a really straightforward approach was used (e.g. table "bam.sq" for SQ and table "bam.rg" for RG header records). For more information on the exact meaning of these fields, please consult the SAM specification.

Alignment data tables

For the alignment data, the database design defines a separate set of tables for every SAM/BAM file that has been loaded into the database. This design choice was made to speed up analyses on only a small number of SAM/BAM files, since this often occurs in practice. This speed up is realized by not having to filter out alignment data of specific BAM files as a first analysis step.

When loading SAM/BAM alignment records into the database, you can choose to store these data in one of two sets of predefined database tables, referenced to as the sequential table set and the pairwise table set. These table sets are presented in the next figures, for a BAM file with file_id i.

The database design uses the virtual offset of every alignment record as a primary key. By storing this virtual offset, specific alignments can easily be found back in the original SAM/BAM file using e.g. SAMtools.

Sequential schema

The sequential table set is aimed towards a straightforward mapping from the alignment data as it occurs in a SAM/BAM file. Every alignment field gets stored in a database column. Furthermore, the extra information that is contained in alignment data is parsed and stored in a separate table alignments_extra_i.

Pairwise schema

The pairwise table set aims at reducing the performance overhead when many operations are done on paired alignment data. During the loading process, both primary and secondary alignment pairs are automatically recognized and stored accordingly in the appropriate database tables. The columns in the paired tables have either an 'l' or an 'r' prefix (except for the 'qname', since two alignment records in the same pair always have the same QNAME). Columns with an 'l' prefix store data from alignment records that have their 'first segment' flag set to 1 and columns with an 'r' prefix do this for alignments with their 'last segment' flag set to 1. All alignments that can not be paired are stored in the unpaired alignments table, which has the same structure as the regular alignments table in the straightforward table set.

In addition to physical tables, the pairwise table set also defines some views over the data, which are also automatically created. These views offer ways to access the data as if it was stored in a sequential fashion, e.g., any query that is aimed towards the alignments table from the straightforward table set can be fired at the unpaired_all_alignments_i view.

In the above figure, arrows are used to indicate over which physical tables the different views are defined. 

Note: currently, if you want to load a SAM/BAM file into the pairwise schema, the file has to be sorted by QNAME. If this is not the case yet, please sort it first using e.g. SAMtools before trying to load it into the database. Another way would be to load the file into the sequential schema, write the ordered alignments back to a SAM file (see SAM export for more information on exporting functionality) and then load that SAM file. However, unless you already have your file loaded into a sequential schema this is not adviced, since this sequence of actions will in general take longer than simply sorting the file with e.g. SAMtools before inserting it.

Query DNA sequence data

Query DNA sequence data zhang Tue, 07/08/2014 - 18:11

The use of the MonetDB SAM/BAM module to query DNA alignment data is shown using a dump of an actual session. More examples can be found in the MonetDB test directory code sql/backends/monet5/bam/Tests for further inspiration.

We start off by loading some SAM and BAM files into MonetDB (see Loading SAM/BAM data):

sql> CALL bam.bam_loader_file('/tmp/file1.bam', 0);
sql> CALL bam.bam_loader_file('/tmp/file2.sam', 0);
sql> CALL bam.bam_loader_file('/tmp/file1-sorted.bam', 1);
sql> CALL bam.bam_loader_file('/tmp/file2-sorted.sam', 1);

All loaded SAM and BAM files are stored in the database table bam.files:

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)

A summary of the database now looks as follows:

sql>set schema bam;
auto commit mode: on
sql>\d
TABLE  bam.alignments_1
TABLE  bam.alignments_2
TABLE  bam.alignments_extra_1
TABLE  bam.alignments_extra_2
TABLE  bam.alignments_extra_3
TABLE  bam.alignments_extra_4
TABLE  bam.export
TABLE  bam.files
TABLE  bam.paired_primary_alignments_3
TABLE  bam.paired_primary_alignments_4
TABLE  bam.paired_secondary_alignments_3
TABLE  bam.paired_secondary_alignments_4
TABLE  bam.pg
TABLE  bam.rg
TABLE  bam.sq
TABLE  bam.unpaired_alignments_3
TABLE  bam.unpaired_alignments_4
VIEW   bam.unpaired_all_alignments_3
VIEW   bam.unpaired_all_alignments_4
VIEW   bam.unpaired_primary_alignments_3
VIEW   bam.unpaired_primary_alignments_4
VIEW   bam.unpaired_secondary_alignments_3
VIEW   bam.unpaired_secondary_alignments_4
sql>set schema sys;
auto commit mode: on

Now we can use SQL queries to conduct various analyse steps on the alignment data. The module has several distinguished features to facilitate this:

  • Auxilliary functions for common SAM/BAM analysis operations (see also SAM/BAM SQL functions).
  • Possibility of doing analyses on data from multiple files, which is impossible with many other tools.
  • Both unpaired (i.e., sequential) and pairwise storage schemas to ease and accelerating data analysis depending on the types of queries.

Queries using sequential schema

Q1. Count the number of alignment records and chromosomes contained in "file1.bam":

sql> SELECT COUNT(*), COUNT(DISTINCT rname) FROM bam.alignments_1;
+------+------+
| L1   | L2   |
+======+======+
|   71 |    7 |
+------+------+
1 tuple (9.272ms)

Q2. Count the number of alignment records for every chromosome:

sql> SELECT rname, COUNT(*) FROM bam.alignments_1 GROUP BY rname;
+-------+------+
| rname | L1   |
+=======+======+
| chr13 |    1 |
| chr14 |    2 |
| chr18 |    1 |
| chr22 |   63 |
| chr2  |    1 |
| chr3  |    1 |
| chr9  |    2 |
+-------+------+
7 tuples (9.179ms)

Q3. Find all alignments with a mapping quality above 200 and marked as reversed, sort the results by their rname and pos:

sql> SELECT * FROM bam.alignments_1 WHERE mapq > 200 AND bam.bam_flag(flag, 'segm_reve') = TRUE ORDER BY rname, pos;
+----------+------------+-----+-------+----------+-----+----------+-----+----------+----+----------------------------+-----------------------------+
| virtual_ | qname      | fla | rname | pos      | map | cigar    | rne | pnext    | tl | seq                        | qual                        |
: offset   :            : g   :       :          : q   :          : xt  :          : en :                            :                             :
+==========+============+======+=======+==========+======+========+=====+==========+====+============================+=============================+
| 34538632 | sim_22_1_a | 147 | chr22 | 15047749 | 255 | 100=     | =   | 15047509 |  0 | ATGGTTTTTGCCTGGTACTGTTGAAG | ##A#A####D5A>F#E@:G2#3ADBD: |
:          :            :     :       :          :     :          :     :          :    : TTAGGCTTAATTTTGAACCAGTAGCT : ?##@CE-?D>BEG?D5DEABG3B5EC? :
:          :            :     :       :          :     :          :     :          :    : TTGTTGTTTACCTTATGTGGTTTTGG : CGDEDE=B@AE5@BDDGD>/DGBAE5@ :
:          :            :     :       :          :     :          :     :          :    : GTTCATTTGTTCTATAAGTATA     : EGFEGEEEFEAD;BB?FGC         :
| 34539304 | sim_22_1_9 |  83 | chr22 | 15600266 | 255 | 100=     | =   | 15600021 |  0 | TTTAAAATATTAAAAGATGAATTACT | 5<###E?##AB#E=DB#<==;#E#?=A |
:          :            :     :       :          :     :          :     :          :    : ATCAATTGTTTTGAATTTTAAACTAA : FEE@C@AEA6GDGFFG?DE??::?=@B :
:          :            :     :       :          :     :          :     :          :    : AAATCAGTAGTTACTATAAAATTATT : C#G>GB>GEEDA=G>ADEFDEGABEDD :
:          :            :     :       :          :     :          :     :          :    : ATTAAATGTTCTAATAATTGTA     : EBCGGEFFF7;=GEGGFGC         :
...
| 34555209 | sim_22_1_7 |  83 | chr22 | 45060375 | 255 | 41=1X56= | =   | 45060162 |  0 | ACCAGCCTGGCCAACATGGTGAAACG | 5#>?CG=E,A=#AG#?#>GAEA@=DB< |
:          :            :     :       :          :     : 1x1=     :     :          :    : CTGTCTCTACTAAAAGTACAAAAAAA : >ED@?CFC@A?=EG65GGD:C@EE(BF :
:          :            :     :       :          :     :          :     :          :    : TTAGCTAGGCGTGGTGGTGGGCACCT : ABFG5B?G;EFE?@5EG5GAGD5@;DF :
:          :            :     :       :          :     :          :     :          :    : GTAATCCCAGCTACTCGGGGAG     : GG?F?7GAGGAFEFA>E-E         :
| 34555657 | sim_22_1_4 | 147 | chr22 | 46558385 | 255 | 1X49=1X9 | =   | 46558150 |  0 | TGACTCCAGGGAAGTTAAATAAATGC | #E###BF#?;;#>@=1#@=E=B#E-ED |
:          :            :     :       :          :     : =1x39=   :     :          :    : CAAGAGGATTCACGTGGCAGGGCCAT : F?DBC#EDEF?;6FE>+GE@GDE.AED :
:          :            :     :       :          :     :          :     :          :    : GGAGGTGGGGAGGAAGGAGCACAGGA : E+DAF:4E>FF=EDA;DAEGEGAEFD? :
:          :            :     :       :          :     :          :     :          :    : GACACCAGGTGCTCACTGTCCA     : GD=DG=G=C5FEGDGGFAG         :
+----------+------------+-----+-------+----------+-----+----------+-----+----------+----+----------------------------+-----------------------------+
21 tuples (4.390ms)

Q4. Compute the reverse complement of the sequence and the quality strings.

sql> SELECT seq, bam.reverse_seq(seq) AS rev_seq, qual, bam.reverse_qual(qual) AS rev_qual FROM bam.alignments_1;
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------------------------+
| seq                                  | rev_seq                              | qual                                 | rev_qual                              |
+======================================+======================================+======================================+=======================================+
| TATACTTATAGAACAAATGAACCCAAAACCACATAA | ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5E | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D> |
: GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : A@B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@## : BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD>/ :
: AACTTCAACAGTACCAGGCAAAAACCAT         : TTTTGGGTTCATTTGTTCTATAAGTATA         : ?:DBDA3#2G:@E#F>A5D####A#A##         : DGBAE5@EGFEGEEEFEAD;BB?FGC            :
| TATACTTATAGAACAAATGAACCCAAAACCACATAA | ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5E | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D> |
: GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : A@B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@## : BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD>/ :
: AACTTCAACAGTACCAGGCAAAAACCAT         : TTTTGGGTTCATTTGTTCTATAAGTATA         : ?:DBDA3#2G:@E#F>A5D####A#A##         : DGBAE5@EGFEGEEEFEAD;BB?FGC            :
...
| ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | TATACTTATAGAACAAATGAACCCAAAACCACATAA | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5EA |
: TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : >BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD : @B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@##?: :
: TTTTGGGTTCATTTGTTCTATAAGTATA         : AACTTCAACAGTACCAGGCAAAAACCAT         : >/DGBAE5@EGFEGEEEFEAD;BB?FGC         : DBDA3#2G:@E#F>A5D####A#A##            :
| ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | TATACTTATAGAACAAATGAACCCAAAACCACATAA | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5EA |
: TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : >BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD : @B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@##?: :
: TTTTGGGTTCATTTGTTCTATAAGTATA         : AACTTCAACAGTACCAGGCAAAAACCAT         : >/DGBAE5@EGFEGEEEFEAD;BB?FGC         : DBDA3#2G:@E#F>A5D####A#A##            :
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------------------------+
71 tuples (5.768ms)

Q5. Select alignment records that are flagged as being both primary and a first segment:

sql> SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = FALSE AND bam.bam_flag(flag, 'firs_segm') = TRUE;
+----------+-------------+------+-------+----------+------+-------------------+------+----------+------+--------------------------+--------------------------+
| virtual_ | qname       | flag | rname | pos      | mapq | cigar             | rnex | pnext    | tlen | seq                      | qual                     |
: offset   :             :      :       :          :      :                   : t    :          :      :                          :                          :
+==========+=============+======+=======+==========+======+===================+======+==========+======+==========================+==========================+
| 34538416 | sim_22_1_a  |   99 | chr22 | 15047509 |  255 | 100=              | =    | 15047749 |    0 | TAAAAACTTGCTGGTTTTGCGGCT | F?CBF7BGFGGGDD@GFF@B;:GG |
:          :             :      :       :          :      :                   :      :          :      : TGGGGGGCATCACGGAACCTACTG : GEGCFF>EFGGDFFEDEFFG1EGC :
:          :             :      :       :          :      :                   :      :          :      : ACACGTGATGTCTCCCCTGGATGC : C?BDF#F=ED=E?@GBF@=GDED; :
:          :             :      :       :          :      :                   :      :          :      : CCAGCTTTAAAATTTCCCACTTTT : 856FD?@=F#FE?EB5CEGCFE#F :
:          :             :      :       :          :      :                   :      :          :      : GTAC                     : ####                     :
| 34539304 | sim_22_1_9  |   83 | chr22 | 15600266 |  255 | 100=              | =    | 15600021 |    0 | TTTAAAATATTAAAAGATGAATTA | 5<###E?##AB#E=DB#<==;#E# |
:          :             :      :       :          :      :                   :      :          :      : CTATCAATTGTTTTGAATTTTAAA : ?=AFEE@C@AEA6GDGFFG?DE?? :
:          :             :      :       :          :      :                   :      :          :      : CTAAAAATCAGTAGTTACTATAAA : ::?=@BC#G>GB>GEEDA=G>ADE :
:          :             :      :       :          :      :                   :      :          :      : ATTATTATTAAATGTTCTAATAAT : FDEGABEDDEBCGGEFFF7;=GEG :
:          :             :      :       :          :      :                   :      :          :      : TGTA                     : GFGC                     :
...

| 34555209 | sim_22_1_7  |   83 | chr22 | 45060375 |  255 | 41=1X56=1X1=      | =    | 45060162 |    0 | ACCAGCCTGGCCAACATGGTGAAA | 5#>?CG=E,A=#AG#?#>GAEA@= |
:          :             :      :       :          :      :                   :      :          :      : CGCTGTCTCTACTAAAAGTACAAA : DB<>ED@?CFC@A?=EG65GGD:C :
:          :             :      :       :          :      :                   :      :          :      : AAAATTAGCTAGGCGTGGTGGTGG : @EE(BFABFG5B?G;EFE?@5EG5 :
:          :             :      :       :          :      :                   :      :          :      : GCACCTGTAATCCCAGCTACTCGG : GAGD5@;DFGG?F?7GAGGAFEFA :
:          :             :      :       :          :      :                   :      :          :      : GGAG                     : >E-E                     :
| 34555441 | sim_22_1_4  |   99 | chr22 | 46558150 |  255 | 100=              | =    | 46558385 |    0 | CCCTGGCTGGAGTACAAGTTACTG | G6EG5GGGGDGGADFGEB?GFDGG |
:          :             :      :       :          :      :                   :      :          :      : GGGCTCAGTTCTTAGGATTCCTAC : AEGGCFFBECGGB*BEA9BFG6E= :
:          :             :      :       :          :      :                   :      :          :      : AAAGCCCAATTTGGGTCACCTGAC : ?=)C5FGAFE6G2FB=CFECF#DA :
:          :             :      :       :          :      :                   :      :          :      : TTGTTAAGTCAGGTGAAGGTGACT : EAG<AGD7F*44D:#B>+?F##;E :
:          :             :      :       :          :      :                   :      :          :      : TACC                     : ;#DD                     :
+----------+-------------+------+-------+----------+------+-------------------+------+----------+------+--------------------------+--------------------------+
21 tuples (7.336ms)

Q6. Count the number of alignment records for every chromosome from all data files that have been loaded into the sequential schema:

sql> SELECT rname, COUNT(*) FROM (SELECT * FROM bam.alignments_1 UNION SELECT * FROM bam.alignments_2) AS tmp GROUP BY rname;
+-------+------+
| rname | L1   |
+=======+======+
| chr13 |    2 |
| chr14 |    4 |
| chr18 |    2 |
| chr22 |  126 |
| chr2  |    2 |
| chr3  |    2 |
| chr9  |    4 |
+-------+------+
7 tuples (32.554ms)

Queries using pairwise schema

Q7. Count the number of paired primary alignments

sql> SELECT COUNT(*) FROM bam.paired_primary_alignments_4;
+------+
| L1   |
+======+
|   21 |
+------+
1 tuple (2.408ms)

Q8. Extract paired sequences

sql> SELECT l_seq, r_seq FROM bam.paired_primary_alignments_3;
+---------------------------------------------------------------------+----------------------------------------------------------------------+
| l_seq                                                               | r_seq                                                                |
+=====================================================================+======================================================================+
| ACTCTGTCGCCCAGGCCGGGGCGTAGCCATGCAAACACGGCTCACTGCAGCCTCGACCTCCCCGGCT | AGTAGATGGGATTACAGGCACCCGCCACCGCGCCCAGCTAATTTTTATATATTTTTTAAGTAGAGATG |
: CAAGCCATCCTCCCACCTCAACCTCCCAAGTAG                                   : GGGTTTCACCATGTTGGCCAGGCTGGTCTTGA                                     :
| TGATGCTGAGCTTGACGGCGGTCCCCCGGGTACCACTCCATCACCCAGTGAAGGAGGCAGGTGGGAC | TGGCCAAGTAATCATGCTAACTAGAGCTCTAGCTTCCTCTCCCTGGTTTCGCTTTTCTAAGTATAAAT |
: CATCTCAGACATGGGCTGTGTTGGGACCCCCAG                                   : AAATAAATAAATGATAAATTGCCTTATTTTTT                                     :
...

| AATATAAATTAAAATCTCAGGCCTCCATCCCCTATCTACTGAATCAGAATCTGCATTTTAACAGATC | CGTAGTGAGACCCCATCTCCAAAAAATAAATAAATTAAGGACAGGATGGGCATGGACCACTCTTCAAG |
: TTTTGTGGTTCATATACAGTTGAGTTCTTTTTT                                   : ATTACTGACCAGCTTCTTGATATTGGACAACT                                     :
| CTTTTCTATATAAAACAAATAATTTATAATACTGTATTTGTTTTCTATTACTATTGAAATTACTACA | TGCCTTTTCCAGCTGATAGAGACAGCCATGTTTCATGGCTTGTAGCCCCTTCCTCCATCTTCAAAGCC |
: AATGTAGTAGCTAAAACAACACATATTTTTCAT                                   : AACAACAACTGGTCAAGTTGTCACATTGCATC                                     :
+---------------------------------------------------------------------+----------------------------------------------------------------------+
21 tuples (2.831ms)

Q9. Find the primary alignment pairs whose left and right reads have different mapping qualities:

sql> SELECT qname FROM bam.paired_primary_alignments_3 WHERE l_mapq <> r_mapq;
+-------+
| qname |
+=======+
+-------+
0 tuples (2.736ms)

Q10. Compute the distances of the paired primary alignments

sql> SELECT qname, r_pos - (l_pos + bam.seq_length(l_cigar)) AS dist FROM bam.paired_primary_alignments_4 WHERE l_pos < r_pos;
+-------------+------+
| qname       | dist |
+=============+======+
| sim_22_1_2  |   91 |
| sim_22_1_3  |   93 |
| sim_22_1_4  |  135 |
| sim_22_1_5  |  108 |
| sim_22_1_6  |  108 |
| sim_22_1_10 |  108 |
| sim_22_1_14 |  153 |
| sim_22_1_16 |  108 |
| sim_22_1_a  |  140 |
| sim_22_1_b  |  112 |
| sim_22_1_d  |   93 |
| sim_22_1_f  |  108 |
+-------------+------+
12 tuples (8.376ms)

Q11. Create a histogram of the distances of the paired primary alignments

sql> SELECT r_pos - (l_pos + bam.seq_length(l_cigar)) AS dist, COUNT(*) AS cnt 
FROM bam.paired_primary_alignments_4 WHERE l_pos < r_pos GROUP BY dist;
+------+------+
| dist | cnt  |
+======+======+
|   91 |    1 |
|   93 |    2 |
|  135 |    1 |
|  108 |    5 |
|  153 |    1 |
|  140 |    1 |
|  112 |    1 |
+------+------+
7 tuples (6.406ms)

 

SAM/BAM SQL functions

SAM/BAM SQL functions zhang Tue, 07/08/2014 - 18:16

The MonetDB SAM/BAM module contains a small library of SQL functions which implement the most common operations in DNA sequence alignment data analysis.

bam.bam_flag(flag SMALLINT, name STRING) RETURNS BOOLEAN

Returns a BOOLEAN value to indicate whether the bit with the given name was set to 1 in the given flag or not. The flag fields as defined by the SAM specification were used.

flag: the integer flag value of an assignment record

name: the name of the bit that should be checked. The following names are used for the different flags:
 

Bit Name Description (from the SAM specification)
0x1 mult_seqm template having multiple segments in sequencing
0x2 prop_alig each segment properly aligned according to the aligner
0x4 segm_unma segment unmapped
0x8 next_unma next segment in the template unmapped
0x10 segm_reve SEQ being reverse complemented
0x20 next_reve SEQ of the next segment in the template being reversed
0x40 firs_segm the first segment in the template
0x80 last_segm the last segment in the template
0x100 seco_alig secondary alignment
0x200 qual_cont not passing quality controls
0x400 opti_dupl PCR or optical duplicate
0x800 supp_alig supplementary alignment

Example query: selecting all primary alignments:

SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = false;
bam.reverse_seq(seq STRING) RETURNS STRING

Computes the reverse complement of the given DNA sequence. The function uses the following complement mapping:

A ↔ T
C ↔ G
R ↔ Y
S ↔ S
W ↔ W
K ↔ M
H ↔ D
V ↔ B
N ↔ N
bam.reverse_qual(qual STRING) RETURNS STRING

Computes the reverse of the given quality string. Example:

sql>select bam.reverse_qual('h@!j');
+---------------------------+
| reverse_qual_single_value |
+===========================+
| j!@h                      |
+---------------------------+
1 tuple (0.410ms)


bam.seq_length(cigar STRING) RETURNS INT

Use the CIGAR string of an alignment to compute the length of the area of the reference string, to which this alignment is mapped. Example (taken from http://genome.sph.umich.edu/wiki/SAM):

Assume we have a read that is aligned like this to the reference string:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T

This would yield the CIGAR string 3M1I3M1D5M. From this example, it is easy to see that the alignment has length 12 (from position 5 up to and including position 16). However, we can use just the CIGAR string to figure out this length:

sql>SELECT bam.seq_length('3M1I3M1D5M');
+-------------------------+
| seq_length_single_value |
+=========================+
|                      12 |
+-------------------------+
1 tuple (0.388ms)

Hence, it is easy to compute this length for all of the alignments in a particular file, by issuing the following query:

SELECT bam.seq_length(cigar) FROM bam.alignments_1;

 

bam.seq_char(ref_pos INT, alg_seq STRING, alg_pos STRING, alg_cigar STRING) RETURNS CHAR

Use alignment information to figure out which character is aligned with a certain position in the reference genome.

Returns the character that was found at the reference position of interest, or null if no alignment was found at this position.

ref_pos: The position in the reference

alg_seq: The sequence string of the alignment of interest

alg_pos: The value in the pos field of the alignment of interest (which stores the starting position of the alignment)

alg_cigar: The CIGAR string of the alignment of interest

Let's take a look at the example we used earlier:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T

We can now use the seq_char function to confirm what we see above:

sql>SELECT bam.seq_char(5, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| A                     |
+-----------------------+
1 tuple (0.458ms)

In words, this query selects the character that is present on reference position 5, which is indeed an A. Some other examples figure out the characters on positions 11, 16 and 17:

sql>SELECT bam.seq_char(11, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(16, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| T                     |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(17, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.110ms)

 

 

Given this function, it becomes really simple to e.g. select all read alignments of a SAM/BAM file that overlap with a certain position:

sql>SELECT seq, pos, cigar FROM bam.alignments_1 WHERE  bam.seq_char(15600025, seq, pos, cigar) IS NOT NULL;
+------------------------------------------------------------------------------------------------------+----------+-------------+
| seq                                                                                                  | pos      | cigar       |
+======================================================================================================+==========+=============+
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600021 | 3=1X96=     |
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600023 | 1=1I1=1I96= |
+------------------------------------------------------------------------------------------------------+----------+-------------+
2 tuples (1.551ms)

Note: The bam.seq_char function is currently only available in the bamloader branch, which will be merged into the next feature release of MonetDB.

Export SAM/BAM data

Export SAM/BAM data zhang Mon, 07/14/2014 - 17:55

There are two ways to export query results into SAM files: using the server-side sam_export() function, or using the client-side SAM formatter. The server side export function works independent of which client interface is used, however, it can only write data to files to which the MonetDB server has access. With the client side export function, one can export data to any files the client interface has access to, independent of where the MonetDB server resides. However, each client interface needs to be extended with its own SAM formatter. Currently, only mclient is supported.

Server-side SAM export

  • bam.sam_export(output_path STRING)

    This procedure exports all records in the table "bam.export" to the file "output_path". If "output_path" is a relative path, the file is created in the database farm currently served by this MonetDB server (see the man-pages monetdbd and mserver5 for the definition of "dbfarm" and "dbpath"). If the file "output_path" already exists, it is overwritten. This procedure will also delete all records from the table "bam.export" after the export.

The queries below show how to i) populate the "bam.export" table, ii) export data, iii) load exported data into the database, and finally iv) conduct some checks. Note that when insert data into the "bam.export" table, one needs to select the individual columns. This is because the "bam.export" table does not contain a column for the virtual offsets, because the virtual offsets can only be computed afterwards when constructing BAM files.

Populate the export table with a sequentially stored file

sql> INSERT INTO bam.export ( 
SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_1);
sql> CALL bam.sam_export('OUTPUT_1'); -- Export data to SAM file
sql> CALL bam.bam_loader_file('OUTPUT_1', 0); -- Load exported data back into a sequential table
sql> -- Data inside original table should be exactly the same as the newly imported file
sql> SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_1
EXCEPT
SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_13; -- assume OUTPUT_1 is loaded into bam.alignments_13
sql> SELECT * FROM bam.export; -- Verify that the export table is now empty

Populate the export table with a pairwise stored file

sql> INSERT INTO bam.export (
    SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
    FROM bam.unpaired_all_alignments_3
);
sql> CALL bam.sam_export('OUTPUT_2'); -- Export data to SAM file
sql> CALL bam.bam_loader_file('OUTPUT_2', 0); -- Load exported data back into a sequential table
sql> -- Data inside original table should be exactly the same as the newly imported file
sql> SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.unpaired_all_alignments_3
EXCEPT
SELECT qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
FROM bam.alignments_14; -- assume OUTPUT_2 is loaded into bam.alignments_14
sql> SELECT * FROM bam.export; -- Verify that the export table is now empty

Client-side SAM export

If you are using mclient to communicate with MonetDB, you can use the built-in SAM formatter to export query results to a SAM formatted file. Note that this requires the output columns to have the appropriate names, i.e., the renderer looks for the column names as defined by the alignments_N table as defined in the sequential schema (see SAM/BAM storage schemas). If there are columns that cannot be mapped, you will receive a notification about this and the renderer will simply discard the data in the not recognised columns.

To use the SAM formatter, type in mclient:

sql> \f sam

Now, if you execute a query using mclient, the result will be displayed in SAM format.

To write the query result to a SAM file, e.g., "/tmp/out.sam", type the following command in mclient:

sql> \> /tmp/out.sam

Now, results of all subsequent queries will be appended to this SAM file.

You can easily generate multiple SAM files by passing the "\>" multiple output file names:

sql> \> /tmp/out1.sam
sql> SELECT * FROM bam.alignments_1;
sql> \> /tmp/out2.sam
sql> SELECT * FROM bam.alignments_2;