Sam Bam Functions

The MonetDB SAM/BAM module contains a small library of SQL functions which implement the most common operations in DNA sequence alignment data analysis.

bam.bam_flag(flag SMALLINT, name STRING) RETURNS BOOLEAN

Returns a 'BOOLEAN' value to indicate whether the bit with the given 'name' was set to 1 in the given 'flag' or not. The flag fields as defined by the SAM specification were used.

flag: the integer flag value of an assignment record

name: the name of the bit that should be checked. The following names are used for the different flags:

Example query: selecting all primary alignments:

SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = false;

bam.reverse_seq(seq STRING) RETURNS STRING

Computes the reverse complement of the given DNA sequence. The function uses the following complement mapping:

BitNameDescription (from the SAM specification)
0x1mult_seqmtemplate having multiple segments in sequencing
0x2prop_aligeach segment properly aligned according to the aligner
0x4segm_unmasegment unmapped
0x8next_unmanext segment in the template unmapped
0x10segm_reveSEQ being reverse complemented
0x20next_reveSEQ of the next segment in the template being reversed
0x40firs_segmthe first segment in the template
0x80last_segmthe last segment in the template
0x100seco_aligsecondary alignment
0x200qual_contnot passing quality controls
0x400opti_duplPCR or optical duplicate
0x800supp_aligsupplementary alignment
A ↔ T
C ↔ G
R ↔ Y
S ↔ S
W ↔ W
K ↔ M
H ↔ D
V ↔ B
N ↔ N

bam.reverse_qual(qual STRING) RETURNS STRING

Computes the reverse of the given quality string. Example:

sql>select bam.reverse_qual('h@!j');
+---------------------------+
| reverse_qual_single_value |
+===========================+
| j!@h                      |
+---------------------------+
1 tuple (0.410ms)

bam.seq_length(cigar STRING) RETURNS INT

Use the CIGAR string of an alignment to compute the length of the area of the reference string, to which this alignment is mapped. Example (taken from http://genome.sph.umich.edu/wiki/SAM):

Assume we have a read that is aligned like this to the reference string:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T ```

This would yield the CIGAR string 3M1I3M1D5M. From this example, it is easy to see that the alignment has length 12 (from position 5 up to and including position 16). However, we can use just the CIGAR string to figure out this length:

sql>SELECT bam.seq_length('3M1I3M1D5M');
+-------------------------+
| seq_length_single_value |
+=========================+
|                      12 |
+-------------------------+
1 tuple (0.388ms)

Hence, it is easy to compute this length for all of the alignments in a particular file, by issuing the following query:

SELECT bam.seq_length(cigar) FROM bam.alignments_1;

bam.seq_char(ref_pos INT, alg_seq STRING, alg_pos STRING, alg_cigar STRING) RETURNS CHAR

Use alignment information to figure out which character is aligned with a certain position in the reference genome.

Returns the character that was found at the reference position of interest, or null if no alignment was found at this position.

ref_pos: The position in the reference

alg_seq: The sequence string of the alignment of interest

alg_pos: The value in the pos field of the alignment of interest (which stores the starting position of the alignment)

alg_cigar: The CIGAR string of the alignment of interest

Let's take a look at the example we used earlier:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T

We can now use the seq_char function to confirm what we see above:

sql>SELECT bam.seq_char(5, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| A                     |
+-----------------------+
1 tuple (0.458ms)

In words, this query selects the character that is present on reference position 5, which is indeed an A. Some other examples figure out the characters on positions 11, 16 and 17:

sql>SELECT bam.seq_char(11, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(16, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| T                     |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(17, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.110ms)

Given this function, it becomes really simple to e.g. select all read alignments of a SAM/BAM file that overlap with a certain position:

sql>SELECT seq, pos, cigar FROM bam.alignments_1 WHERE  bam.seq_char(15600025, seq, pos, cigar) IS NOT NULL;
+------------------------------------------------------------------------------------------------------+----------+-------------+
| seq                                                                                                  | pos      | cigar       |
+======================================================================================================+==========+=============+
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600021 | 3=1X96=     |
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600023 | 1=1I1=1I96= |
+------------------------------------------------------------------------------------------------------+----------+-------------+
2 tuples (1.551ms)

Note: The bam.seq_char function is currently only available in the bamloader branch, which will be merged into a future release of MonetDB.