Skip to main content

Query DNA sequence data

The use of the MonetDB SAM/BAM module to query DNA alignment data is shown using a dump of an actual session. More examples can be found in the MonetDB test directory code sql/backends/monet5/bam/Tests for further inspiration.

We start off by loading some SAM and BAM files into MonetDB (see Loading SAM/BAM data):

sql> CALL bam.bam_loader_file('/tmp/file1.bam', 0);
sql> CALL bam.bam_loader_file('/tmp/file2.sam', 0);
sql> CALL bam.bam_loader_file('/tmp/file1-sorted.bam', 1);
sql> CALL bam.bam_loader_file('/tmp/file2-sorted.sam', 1);

All loaded SAM and BAM files are stored in the database table bam.files:

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)

A summary of the database now looks as follows:

sql>set schema bam;
auto commit mode: on
sql>\d
TABLE  bam.alignments_1
TABLE  bam.alignments_2
TABLE  bam.alignments_extra_1
TABLE  bam.alignments_extra_2
TABLE  bam.alignments_extra_3
TABLE  bam.alignments_extra_4
TABLE  bam.export
TABLE  bam.files
TABLE  bam.paired_primary_alignments_3
TABLE  bam.paired_primary_alignments_4
TABLE  bam.paired_secondary_alignments_3
TABLE  bam.paired_secondary_alignments_4
TABLE  bam.pg
TABLE  bam.rg
TABLE  bam.sq
TABLE  bam.unpaired_alignments_3
TABLE  bam.unpaired_alignments_4
VIEW   bam.unpaired_all_alignments_3
VIEW   bam.unpaired_all_alignments_4
VIEW   bam.unpaired_primary_alignments_3
VIEW   bam.unpaired_primary_alignments_4
VIEW   bam.unpaired_secondary_alignments_3
VIEW   bam.unpaired_secondary_alignments_4
sql>set schema sys;
auto commit mode: on

Now we can use SQL queries to conduct various analyse steps on the alignment data. The module has several distinguished features to facilitate this:

  • Auxilliary functions for common SAM/BAM analysis operations (see also SAM/BAM SQL functions).
  • Possibility of doing analyses on data from multiple files, which is impossible with many other tools.
  • Both unpaired (i.e., sequential) and pairwise storage schemas to ease and accelerating data analysis depending on the types of queries.

Queries using sequential schema

Q1. Count the number of alignment records and chromosomes contained in "file1.bam":

sql> SELECT COUNT(*), COUNT(DISTINCT rname) FROM bam.alignments_1;
+------+------+
| L1   | L2   |
+======+======+
|   71 |    7 |
+------+------+
1 tuple (9.272ms)

Q2. Count the number of alignment records for every chromosome:

sql> SELECT rname, COUNT(*) FROM bam.alignments_1 GROUP BY rname;
+-------+------+
| rname | L1   |
+=======+======+
| chr13 |    1 |
| chr14 |    2 |
| chr18 |    1 |
| chr22 |   63 |
| chr2  |    1 |
| chr3  |    1 |
| chr9  |    2 |
+-------+------+
7 tuples (9.179ms)

Q3. Find all alignments with a mapping quality above 200 and marked as reversed, sort the results by their rname and pos:

sql> SELECT * FROM bam.alignments_1 WHERE mapq > 200 AND bam.bam_flag(flag, 'segm_reve') = TRUE ORDER BY rname, pos;
+----------+------------+-----+-------+----------+-----+----------+-----+----------+----+----------------------------+-----------------------------+
| virtual_ | qname      | fla | rname | pos      | map | cigar    | rne | pnext    | tl | seq                        | qual                        |
: offset   :            : g   :       :          : q   :          : xt  :          : en :                            :                             :
+==========+============+======+=======+==========+======+========+=====+==========+====+============================+=============================+
| 34538632 | sim_22_1_a | 147 | chr22 | 15047749 | 255 | 100=     | =   | 15047509 |  0 | ATGGTTTTTGCCTGGTACTGTTGAAG | ##A#A####D5A>F#E@:G2#3ADBD: |
:          :            :     :       :          :     :          :     :          :    : TTAGGCTTAATTTTGAACCAGTAGCT : ?##@CE-?D>BEG?D5DEABG3B5EC? :
:          :            :     :       :          :     :          :     :          :    : TTGTTGTTTACCTTATGTGGTTTTGG : CGDEDE=B@AE5@BDDGD>/DGBAE5@ :
:          :            :     :       :          :     :          :     :          :    : GTTCATTTGTTCTATAAGTATA     : EGFEGEEEFEAD;BB?FGC         :
| 34539304 | sim_22_1_9 |  83 | chr22 | 15600266 | 255 | 100=     | =   | 15600021 |  0 | TTTAAAATATTAAAAGATGAATTACT | 5<###E?##AB#E=DB#<==;#E#?=A |
:          :            :     :       :          :     :          :     :          :    : ATCAATTGTTTTGAATTTTAAACTAA : FEE@C@AEA6GDGFFG?DE??::?=@B :
:          :            :     :       :          :     :          :     :          :    : AAATCAGTAGTTACTATAAAATTATT : C#G>GB>GEEDA=G>ADEFDEGABEDD :
:          :            :     :       :          :     :          :     :          :    : ATTAAATGTTCTAATAATTGTA     : EBCGGEFFF7;=GEGGFGC         :
...
| 34555209 | sim_22_1_7 |  83 | chr22 | 45060375 | 255 | 41=1X56= | =   | 45060162 |  0 | ACCAGCCTGGCCAACATGGTGAAACG | 5#>?CG=E,A=#AG#?#>GAEA@=DB< |
:          :            :     :       :          :     : 1x1=     :     :          :    : CTGTCTCTACTAAAAGTACAAAAAAA : >ED@?CFC@A?=EG65GGD:C@EE(BF :
:          :            :     :       :          :     :          :     :          :    : TTAGCTAGGCGTGGTGGTGGGCACCT : ABFG5B?G;EFE?@5EG5GAGD5@;DF :
:          :            :     :       :          :     :          :     :          :    : GTAATCCCAGCTACTCGGGGAG     : GG?F?7GAGGAFEFA>E-E         :
| 34555657 | sim_22_1_4 | 147 | chr22 | 46558385 | 255 | 1X49=1X9 | =   | 46558150 |  0 | TGACTCCAGGGAAGTTAAATAAATGC | #E###BF#?;;#>@=1#@=E=B#E-ED |
:          :            :     :       :          :     : =1x39=   :     :          :    : CAAGAGGATTCACGTGGCAGGGCCAT : F?DBC#EDEF?;6FE>+GE@GDE.AED :
:          :            :     :       :          :     :          :     :          :    : GGAGGTGGGGAGGAAGGAGCACAGGA : E+DAF:4E>FF=EDA;DAEGEGAEFD? :
:          :            :     :       :          :     :          :     :          :    : GACACCAGGTGCTCACTGTCCA     : GD=DG=G=C5FEGDGGFAG         :
+----------+------------+-----+-------+----------+-----+----------+-----+----------+----+----------------------------+-----------------------------+
21 tuples (4.390ms)

Q4. Compute the reverse complement of the sequence and the quality strings.

sql> SELECT seq, bam.reverse_seq(seq) AS rev_seq, qual, bam.reverse_qual(qual) AS rev_qual FROM bam.alignments_1;
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------------------------+
| seq                                  | rev_seq                              | qual                                 | rev_qual                              |
+======================================+======================================+======================================+=======================================+
| TATACTTATAGAACAAATGAACCCAAAACCACATAA | ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5E | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D> |
: GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : A@B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@## : BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD>/ :
: AACTTCAACAGTACCAGGCAAAAACCAT         : TTTTGGGTTCATTTGTTCTATAAGTATA         : ?:DBDA3#2G:@E#F>A5D####A#A##         : DGBAE5@EGFEGEEEFEAD;BB?FGC            :
| TATACTTATAGAACAAATGAACCCAAAACCACATAA | ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5E | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D> |
: GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : A@B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@## : BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD>/ :
: AACTTCAACAGTACCAGGCAAAAACCAT         : TTTTGGGTTCATTTGTTCTATAAGTATA         : ?:DBDA3#2G:@E#F>A5D####A#A##         : DGBAE5@EGFEGEEEFEAD;BB?FGC            :
...
| ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | TATACTTATAGAACAAATGAACCCAAAACCACATAA | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5EA |
: TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : >BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD : @B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@##?: :
: TTTTGGGTTCATTTGTTCTATAAGTATA         : AACTTCAACAGTACCAGGCAAAAACCAT         : >/DGBAE5@EGFEGEEEFEAD;BB?FGC         : DBDA3#2G:@E#F>A5D####A#A##            :
| ATGGTTTTTGCCTGGTACTGTTGAAGTTAGGCTTAA | TATACTTATAGAACAAATGAACCCAAAACCACATAA | ##A#A####D5A>F#E@:G2#3ADBD:?##@CE-?D | CGF?BB;DAEFEEEGEFGE@5EABGD/>DGDDB@5EA |
: TTTTGAACCAGTAGCTTTGTTGTTTACCTTATGTGG : GGTAAACAACAAAGCTACTGGTTCAAAATTAAGCCT : >BEG?D5DEABG3B5EC?CGDEDE=B@AE5@BDDGD : @B=EDEDGC?CE5B3GBAED5D?GEB>D?-EC@##?: :
: TTTTGGGTTCATTTGTTCTATAAGTATA         : AACTTCAACAGTACCAGGCAAAAACCAT         : >/DGBAE5@EGFEGEEEFEAD;BB?FGC         : DBDA3#2G:@E#F>A5D####A#A##            :
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------------------------+
71 tuples (5.768ms)

Q5. Select alignment records that are flagged as being both primary and a first segment:

sql> SELECT * FROM bam.alignments_1 WHERE bam.bam_flag(flag, 'seco_alig') = FALSE AND bam.bam_flag(flag, 'firs_segm') = TRUE;
+----------+-------------+------+-------+----------+------+-------------------+------+----------+------+--------------------------+--------------------------+
| virtual_ | qname       | flag | rname | pos      | mapq | cigar             | rnex | pnext    | tlen | seq                      | qual                     |
: offset   :             :      :       :          :      :                   : t    :          :      :                          :                          :
+==========+=============+======+=======+==========+======+===================+======+==========+======+==========================+==========================+
| 34538416 | sim_22_1_a  |   99 | chr22 | 15047509 |  255 | 100=              | =    | 15047749 |    0 | TAAAAACTTGCTGGTTTTGCGGCT | F?CBF7BGFGGGDD@GFF@B;:GG |
:          :             :      :       :          :      :                   :      :          :      : TGGGGGGCATCACGGAACCTACTG : GEGCFF>EFGGDFFEDEFFG1EGC :
:          :             :      :       :          :      :                   :      :          :      : ACACGTGATGTCTCCCCTGGATGC : C?BDF#F=ED=E?@GBF@=GDED; :
:          :             :      :       :          :      :                   :      :          :      : CCAGCTTTAAAATTTCCCACTTTT : 856FD?@=F#FE?EB5CEGCFE#F :
:          :             :      :       :          :      :                   :      :          :      : GTAC                     : ####                     :
| 34539304 | sim_22_1_9  |   83 | chr22 | 15600266 |  255 | 100=              | =    | 15600021 |    0 | TTTAAAATATTAAAAGATGAATTA | 5<###E?##AB#E=DB#<==;#E# |
:          :             :      :       :          :      :                   :      :          :      : CTATCAATTGTTTTGAATTTTAAA : ?=AFEE@C@AEA6GDGFFG?DE?? :
:          :             :      :       :          :      :                   :      :          :      : CTAAAAATCAGTAGTTACTATAAA : ::?=@BC#G>GB>GEEDA=G>ADE :
:          :             :      :       :          :      :                   :      :          :      : ATTATTATTAAATGTTCTAATAAT : FDEGABEDDEBCGGEFFF7;=GEG :
:          :             :      :       :          :      :                   :      :          :      : TGTA                     : GFGC                     :
...

| 34555209 | sim_22_1_7  |   83 | chr22 | 45060375 |  255 | 41=1X56=1X1=      | =    | 45060162 |    0 | ACCAGCCTGGCCAACATGGTGAAA | 5#>?CG=E,A=#AG#?#>GAEA@= |
:          :             :      :       :          :      :                   :      :          :      : CGCTGTCTCTACTAAAAGTACAAA : DB<>ED@?CFC@A?=EG65GGD:C :
:          :             :      :       :          :      :                   :      :          :      : AAAATTAGCTAGGCGTGGTGGTGG : @EE(BFABFG5B?G;EFE?@5EG5 :
:          :             :      :       :          :      :                   :      :          :      : GCACCTGTAATCCCAGCTACTCGG : GAGD5@;DFGG?F?7GAGGAFEFA :
:          :             :      :       :          :      :                   :      :          :      : GGAG                     : >E-E                     :
| 34555441 | sim_22_1_4  |   99 | chr22 | 46558150 |  255 | 100=              | =    | 46558385 |    0 | CCCTGGCTGGAGTACAAGTTACTG | G6EG5GGGGDGGADFGEB?GFDGG |
:          :             :      :       :          :      :                   :      :          :      : GGGCTCAGTTCTTAGGATTCCTAC : AEGGCFFBECGGB*BEA9BFG6E= :
:          :             :      :       :          :      :                   :      :          :      : AAAGCCCAATTTGGGTCACCTGAC : ?=)C5FGAFE6G2FB=CFECF#DA :
:          :             :      :       :          :      :                   :      :          :      : TTGTTAAGTCAGGTGAAGGTGACT : EAG<AGD7F*44D:#B>+?F##;E :
:          :             :      :       :          :      :                   :      :          :      : TACC                     : ;#DD                     :
+----------+-------------+------+-------+----------+------+-------------------+------+----------+------+--------------------------+--------------------------+
21 tuples (7.336ms)

Q6. Count the number of alignment records for every chromosome from all data files that have been loaded into the sequential schema:

sql> SELECT rname, COUNT(*) FROM (SELECT * FROM bam.alignments_1 UNION SELECT * FROM bam.alignments_2) AS tmp GROUP BY rname;
+-------+------+
| rname | L1   |
+=======+======+
| chr13 |    2 |
| chr14 |    4 |
| chr18 |    2 |
| chr22 |  126 |
| chr2  |    2 |
| chr3  |    2 |
| chr9  |    4 |
+-------+------+
7 tuples (32.554ms)

Queries using pairwise schema

Q7. Count the number of paired primary alignments

sql> SELECT COUNT(*) FROM bam.paired_primary_alignments_4;
+------+
| L1   |
+======+
|   21 |
+------+
1 tuple (2.408ms)

Q8. Extract paired sequences

sql> SELECT l_seq, r_seq FROM bam.paired_primary_alignments_3;
+---------------------------------------------------------------------+----------------------------------------------------------------------+
| l_seq                                                               | r_seq                                                                |
+=====================================================================+======================================================================+
| ACTCTGTCGCCCAGGCCGGGGCGTAGCCATGCAAACACGGCTCACTGCAGCCTCGACCTCCCCGGCT | AGTAGATGGGATTACAGGCACCCGCCACCGCGCCCAGCTAATTTTTATATATTTTTTAAGTAGAGATG |
: CAAGCCATCCTCCCACCTCAACCTCCCAAGTAG                                   : GGGTTTCACCATGTTGGCCAGGCTGGTCTTGA                                     :
| TGATGCTGAGCTTGACGGCGGTCCCCCGGGTACCACTCCATCACCCAGTGAAGGAGGCAGGTGGGAC | TGGCCAAGTAATCATGCTAACTAGAGCTCTAGCTTCCTCTCCCTGGTTTCGCTTTTCTAAGTATAAAT |
: CATCTCAGACATGGGCTGTGTTGGGACCCCCAG                                   : AAATAAATAAATGATAAATTGCCTTATTTTTT                                     :
...

| AATATAAATTAAAATCTCAGGCCTCCATCCCCTATCTACTGAATCAGAATCTGCATTTTAACAGATC | CGTAGTGAGACCCCATCTCCAAAAAATAAATAAATTAAGGACAGGATGGGCATGGACCACTCTTCAAG |
: TTTTGTGGTTCATATACAGTTGAGTTCTTTTTT                                   : ATTACTGACCAGCTTCTTGATATTGGACAACT                                     :
| CTTTTCTATATAAAACAAATAATTTATAATACTGTATTTGTTTTCTATTACTATTGAAATTACTACA | TGCCTTTTCCAGCTGATAGAGACAGCCATGTTTCATGGCTTGTAGCCCCTTCCTCCATCTTCAAAGCC |
: AATGTAGTAGCTAAAACAACACATATTTTTCAT                                   : AACAACAACTGGTCAAGTTGTCACATTGCATC                                     :
+---------------------------------------------------------------------+----------------------------------------------------------------------+
21 tuples (2.831ms)

Q9. Find the primary alignment pairs whose left and right reads have different mapping qualities:

sql> SELECT qname FROM bam.paired_primary_alignments_3 WHERE l_mapq <> r_mapq;
+-------+
| qname |
+=======+
+-------+
0 tuples (2.736ms)

Q10. Compute the distances of the paired primary alignments

sql> SELECT qname, r_pos - (l_pos + bam.seq_length(l_cigar)) AS dist FROM bam.paired_primary_alignments_4 WHERE l_pos < r_pos;
+-------------+------+
| qname       | dist |
+=============+======+
| sim_22_1_2  |   91 |
| sim_22_1_3  |   93 |
| sim_22_1_4  |  135 |
| sim_22_1_5  |  108 |
| sim_22_1_6  |  108 |
| sim_22_1_10 |  108 |
| sim_22_1_14 |  153 |
| sim_22_1_16 |  108 |
| sim_22_1_a  |  140 |
| sim_22_1_b  |  112 |
| sim_22_1_d  |   93 |
| sim_22_1_f  |  108 |
+-------------+------+
12 tuples (8.376ms)

Q11. Create a histogram of the distances of the paired primary alignments

sql> SELECT r_pos - (l_pos + bam.seq_length(l_cigar)) AS dist, COUNT(*) AS cnt 
FROM bam.paired_primary_alignments_4 WHERE l_pos < r_pos GROUP BY dist;
+------+------+
| dist | cnt  |
+======+======+
|   91 |    1 |
|   93 |    2 |
|  135 |    1 |
|  108 |    5 |
|  153 |    1 |
|  140 |    1 |
|  112 |    1 |
+------+------+
7 tuples (6.406ms)