Documentation > DEVELOPER GUIDE > SQL Extensions > Life Sciences > Sam Bam Functions

Sam Bam Functions

The MonetDB SAM/BAM module contains a small library of SQL functions which implement the most common operations in DNA sequence alignment data analysis.

bam.bam_flag(flag SMALLINT, name STRING) RETURNS BOOLEAN

Returns a 'BOOLEAN' value to indicate whether the bit with the given 'name' was set to 1 in the given 'flag' or not. The flag fields as defined by the SAM specification were used.

flag: the integer flag value of an assignment record

name: the name of the bit that should be checked. The following names are used for the different flags:

Example query: selecting all primary alignments:

SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = false;

bam.reverse_seq(seq STRING) RETURNS STRING

Computes the reverse complement of the given DNA sequence. The function uses the following complement mapping:

Bit	Name	Description (from the SAM specification)
0x1	mult_seqm	template having multiple segments in sequencing
0x2	prop_alig	each segment properly aligned according to the aligner
0x4	segm_unma	segment unmapped
0x8	next_unma	next segment in the template unmapped
0x10	segm_reve	SEQ being reverse complemented
0x20	next_reve	SEQ of the next segment in the template being reversed
0x40	firs_segm	the first segment in the template
0x80	last_segm	the last segment in the template
0x100	seco_alig	secondary alignment
0x200	qual_cont	not passing quality controls
0x400	opti_dupl	PCR or optical duplicate
0x800	supp_alig	supplementary alignment

A ↔ T
C ↔ G
R ↔ Y
S ↔ S
W ↔ W
K ↔ M
H ↔ D
V ↔ B
N ↔ N

bam.reverse_qual(qual STRING) RETURNS STRING

Computes the reverse of the given quality string. Example:

sql>select bam.reverse_qual('h@!j');
+---------------------------+
| reverse_qual_single_value |
+===========================+
| j!@h                      |
+---------------------------+
1 tuple (0.410ms)

bam.seq_length(cigar STRING) RETURNS INT

Use the CIGAR string of an alignment to compute the length of the area of the reference string, to which this alignment is mapped. Example (taken from http://genome.sph.umich.edu/wiki/SAM):

Assume we have a read that is aligned like this to the reference string:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T ```

This would yield the CIGAR string 3M1I3M1D5M. From this example, it is easy to see that the alignment has length 12 (from position 5 up to and including position 16). However, we can use just the CIGAR string to figure out this length:

sql>SELECT bam.seq_length('3M1I3M1D5M');
+-------------------------+
| seq_length_single_value |
+=========================+
|                      12 |
+-------------------------+
1 tuple (0.388ms)

Hence, it is easy to compute this length for all of the alignments in a particular file, by issuing the following query:

SELECT bam.seq_length(cigar) FROM bam.alignments_1;

bam.seq_char(ref_pos INT, alg_seq STRING, alg_pos STRING, alg_cigar STRING) RETURNS CHAR

Use alignment information to figure out which character is aligned with a certain position in the reference genome.

Returns the character that was found at the reference position of interest, or null if no alignment was found at this position.

ref_pos: The position in the reference

alg_seq: The sequence string of the alignment of interest

alg_pos: The value in the pos field of the alignment of interest (which stores the starting position of the alignment)

alg_cigar: The CIGAR string of the alignment of interest

Let's take a look at the example we used earlier:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  A  C  T

We can now use the seq_char function to confirm what we see above:

sql>SELECT bam.seq_char(5, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| A                     |
+-----------------------+
1 tuple (0.458ms)

In words, this query selects the character that is present on reference position 5, which is indeed an A. Some other examples figure out the characters on positions 11, 16 and 17:

sql>SELECT bam.seq_char(11, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(16, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| T                     |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(17, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null                  |
+-----------------------+
1 tuple (0.110ms)

Given this function, it becomes really simple to e.g. select all read alignments of a SAM/BAM file that overlap with a certain position:

sql>SELECT seq, pos, cigar FROM bam.alignments_1 WHERE  bam.seq_char(15600025, seq, pos, cigar) IS NOT NULL;
+------------------------------------------------------------------------------------------------------+----------+-------------+
| seq                                                                                                  | pos      | cigar       |
+======================================================================================================+==========+=============+
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600021 | 3=1X96=     |
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600023 | 1=1I1=1I96= |
+------------------------------------------------------------------------------------------------------+----------+-------------+
2 tuples (1.551ms)

Note: The bam.seq_char function is currently only available in the bamloader branch, which will be merged into a future release of MonetDB.