Loading SAM/BAM data

Loading SAM/BAM data zhang Tue, 07/08/2014 - 18:06

To demonstrate the MonetDB SAM/BAM library, we prepared four sample input data files: file1.bam and file2.sam are unsorted; while file1-sorted.bam and file2-sorted.sam are sorted by the QNAME fields of the alignment records. They can also be found in the subdirectory sql/backends/monet5/bam/Tests/files of your MonetDB source directory. Assume these files are saved on your computer in directory "/tmp". There are several ways to load the SAM/BAM files into MonetDB.

Load individual files

The SQL function bam_loader_file(bam_file STRING, dbschema SMALLINT) allows loading one SAM/BAM file at a time:

sql> CALL bam.bam_loader_file('/tmp/file1.bam', 0);
sql> CALL bam.bam_loader_file('/tmp/file2.sam', 0);
sql> CALL bam.bam_loader_file('/tmp/file1-sorted.bam', 1);
sql> CALL bam.bam_loader_file('/tmp/file2-sorted.sam', 1);

The first argument is the absolute path to the SAM/BAM file. The second argument is either 0 for the sequential schema, or 1 for the pairwise schema (see the definition of both storage schemas in SAM/BAM storage). Note that to use the pairwise schema, the SAM/BAM files must be sorted by the QNAME field.

To check which SAM/BAM files have been loaded into the database, you can consult the database catalogue table "bam.files":

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)

Load SAM/BAM repositories

The SQL function bam_loader_repos(bam_repos STRING, dbschema SMALLINT, nr_threads SMALLINT) allows loading a repository of SAM/BAM files in the given directory:

sql> CALL bam.bam_loader_repos('/tmp', 0, 4);

The first argument is the absolute path to the SAM/BAM repository (note that the loading function does not search in the subdirectory for SAM/BAM files). The second argument is either 0 for the sequential schema, or 1 for the pairwise schema (see SAM/BAM storage). The third argument is the number of threads to use during the loading process. Note: the last argument is removed in the bamloader branch of MonetDB, since a user on SQL level should not have to deal with such a low-level detail. This change will be merged into the next feature release of MonetDB. 

Load list of SAM/BAM files

The SQL function bam_loader_files(bam_files STRING, dbschema SMALLINT, nr_threads SMALLINT) loads all SAM/BAM files listed in the given file:

sql> CALL bam.bam_loader_files('/tmp/files.txt', 0, 4);

The first argument is the absolute path of the file which contains a list of absolute paths to the SAM/BAM files to load:

$ cat /tmp/files.txt
/tmp/file1.bam
/tmp/file2.sam

The second argument is either 0 for the sequential schema, or 1 for pairwise schema (see SAM/BAM storage). The third argument is the number of threads to use during the loading process. Note: the last argument is removed in the bamloader branch of MonetDB, since a user on SQL level should not have to deal with such a low-level detail. This change will be merged into the next feature release of MonetDB. 

Remove files

The SQL function bam_drop_file(file_id BIGINT, dbschema SMALLINT) allows removing all data contained in a SAM/BAM file from the database:

sql> SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       2 | /tmp/file2.sam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
4 tuples (3.425ms)
sql>CALL bam.bam_drop_file(2, 0);

The first argument is the file_id of the file that will be removed. The second argument is the dbschema of this file, which can be either zero or one for the sequential or the pairwise schema respectively.

sql>SELECT * FROM bam.files;
+---------+-----------------------+----------+----------------+---------------+----------+
| file_id | file_location         | dbschema | format_version | sorting_order | comments |
+=========+=======================+==========+================+===============+==========+
|       1 | /tmp/file1.bam        |        0 | 1.0            | unsorted      | null     |
|       3 | /tmp/file1-sorted.bam |        1 | 1.0            | queryname     | null     |
|       4 | /tmp/file2-sorted.sam |        1 | 1.0            | queryname     | null     |
+---------+-----------------------+----------+----------------+---------------+----------+
3 tuples (5.334ms)

 

Known issues

There exists a loading issue in the Oct2014 release of MonetDB. When loading a list or a repository of files, using respectively the SQL commands bam_loader_files and bam_loader_repos, too many file descriptors will be opened. This might cause the loading process to fail with the error message that the bam_wrapper code could not open a certain file. This issue has been fixed in the bamloader branch and will be merged into the next feature release of MonetDB.