The MonetDB SAM/BAM module contains a small library of SQL functions which implement the most common operations in DNA sequence alignment data analysis.
bam.bam_flag(flag SMALLINT, name STRING) RETURNS BOOLEAN
Returns a 'BOOLEAN' value to indicate whether the bit with the given 'name' was set to 1 in the given 'flag' or not. The flag fields as defined by the SAM specification were used.
flag
: the integer flag value of an assignment record
name
: the name of the bit that should be checked. The following names are used for the different flags:
Example query: selecting all primary alignments:
SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = false;
bam.reverse_seq(seq STRING) RETURNS STRING
Computes the reverse complement of the given DNA sequence. The function uses the following complement mapping:
Bit | Name | Description (from the SAM specification) |
---|---|---|
0x1 | mult_seqm | template having multiple segments in sequencing |
0x2 | prop_alig | each segment properly aligned according to the aligner |
0x4 | segm_unma | segment unmapped |
0x8 | next_unma | next segment in the template unmapped |
0x10 | segm_reve | SEQ being reverse complemented |
0x20 | next_reve | SEQ of the next segment in the template being reversed |
0x40 | firs_segm | the first segment in the template |
0x80 | last_segm | the last segment in the template |
0x100 | seco_alig | secondary alignment |
0x200 | qual_cont | not passing quality controls |
0x400 | opti_dupl | PCR or optical duplicate |
0x800 | supp_alig | supplementary alignment |
A ↔ T
C ↔ G
R ↔ Y
S ↔ S
W ↔ W
K ↔ M
H ↔ D
V ↔ B
N ↔ N
bam.reverse_qual(qual STRING) RETURNS STRING
Computes the reverse of the given quality string. Example:
sql>select bam.reverse_qual('h@!j');
+---------------------------+
| reverse_qual_single_value |
+===========================+
| j!@h |
+---------------------------+
1 tuple (0.410ms)
bam.seq_length(cigar STRING) RETURNS INT
Use the CIGAR string of an alignment to compute the length of the area of the reference string, to which this alignment is mapped. Example (taken from http://genome.sph.umich.edu/wiki/SAM):
Assume we have a read that is aligned like this to the reference string:
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G A C T ```
This would yield the CIGAR string 3M1I3M1D5M
. From this example,
it is easy to see that the alignment has length 12 (from position 5 up to and including position 16).
However, we can use just the CIGAR string to figure out this length:
sql>SELECT bam.seq_length('3M1I3M1D5M');
+-------------------------+
| seq_length_single_value |
+=========================+
| 12 |
+-------------------------+
1 tuple (0.388ms)
Hence, it is easy to compute this length for all of the alignments in a particular file, by issuing the following query:
SELECT bam.seq_length(cigar) FROM bam.alignments_1;
bam.seq_char(ref_pos INT, alg_seq STRING, alg_pos STRING, alg_cigar STRING) RETURNS CHAR
Use alignment information to figure out which character is aligned with a certain position in the reference genome.
Returns the character that was found at the reference position of interest, or null if no alignment was found at this position.
ref_pos
: The position in the reference
alg_seq
: The sequence string of the alignment of interest
alg_pos
: The value in the pos field of the alignment of interest (which stores the starting position of the alignment)
alg_cigar
: The CIGAR string of the alignment of interest
Let's take a look at the example we used earlier:
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G A C T
We can now use the seq_char function to confirm what we see above:
sql>SELECT bam.seq_char(5, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| A |
+-----------------------+
1 tuple (0.458ms)
In words, this query selects the character that is present on reference position 5, which is indeed an A. Some other examples figure out the characters on positions 11, 16 and 17:
sql>SELECT bam.seq_char(11, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(16, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| T |
+-----------------------+
1 tuple (0.118ms)
sql>SELECT bam.seq_char(17, 'ACTAGAATGGCT', 5, '3M1I3M1D5M');
+-----------------------+
| seq_char_single_value |
+=======================+
| null |
+-----------------------+
1 tuple (0.110ms)
Given this function, it becomes really simple to e.g. select all read alignments of a SAM/BAM file that overlap with a certain position:
sql>SELECT seq, pos, cigar FROM bam.alignments_1 WHERE bam.seq_char(15600025, seq, pos, cigar) IS NOT NULL;
+------------------------------------------------------------------------------------------------------+----------+-------------+
| seq | pos | cigar |
+======================================================================================================+==========+=============+
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600021 | 3=1X96= |
| CACATTTTCAAAAAACAAAAAAAAGTCTGAGCTCCTACTGTTGATTTAAATTCTTTTATAAATCTCTATCAAACTTTTCATGTTTACAGTTCTTATGCAA | 15600023 | 1=1I1=1I96= |
+------------------------------------------------------------------------------------------------------+----------+-------------+
2 tuples (1.551ms)
Note: The bam.seq_char function is currently only available in the bamloader branch, which will be merged into a future release of MonetDB.