Genome sequence alignment data processing

To deal with the deluge of data produced by modern sequencers, life science researchers are turning to analytical databases. To help with this (really) big data problem, the MonetDB team has been working together with the CWI Life Sciences group [2]. The result collaboration is the MonetDB/BAM module - a bioinformatics extension, designed to easily load and quickly process genomic data. The authors have also tested the new module, publishing a case study on Ebola virus diversity [1]. This proves that the improved performance and reduced complexity of this new tool can be put to good use, when scientists are looking for quick answers to pressing questions.

Sequence alignment data

One of the research fields that deals with very large volumes of data is Life Science, as scientist now rely on the so-called Next-generation sequencing (NGS). The more advanced technology generates vast volumes of data, produced from sequencing DNA, RNA and protein molecules. Without efficient tools they face a very time consuming and/or error-prone process. And while there are tools that provide basic functionality for data processing, demands for more complex data exploration and analysis mean that researches must write their own custom programs or scripts. And yet, such tools do not solve the problem of efficiently storing and querying the large amount of data produced by NGS sequencers. Sequence alignment data itself, often comes in two formats: SAM or BAM. SAM stands for Sequence Alignment/Map and is a TAB-delimited text format consisting of an optional header and an alignment section

  • holding the actual sequenced data [3]. While the alignment data for some organisms cam be small (in MB of size), data of a single human genome is in hundreds of gigabytes, even when stored compressed. As you can imagine, when processing a large collection of genomes, the data volume reaches terabyte-scale.

Terabyte-scale data processing

Although Hadoop-based solutions have gained considerable interest in big data processing, they are not the best candidates for sequence alignment data analysis. Hadoop systems can be extremely fast in executing simple queries on large volumes of data. But this is somewhat comparable to a distributed version of the Linux grep (a plain-text search utility). However, the performance of such systems quickly degrades when processing complex analytical queries, especially such involving aggregations and joins. Modern DBMS, optimized for storing and quickly processing complex queries, can lend a hand to life science researchers - by storing the sequence alignment data in a database and analyzing it with either generic (SQL) or domain specific functions. Such an approach exploits the advances in data storage and processing in analytics/scientific databases systems, allowing improved processing time and reduced complexity, helping life scientists cope with the deluge of data. The MonetDB team has been collaborating with life science researchers in developing a MonetDB module for bioinformatics. This work has already been put to good use, analyzing the diversity of Ebola virus genomes [1]. The MonetDB/BAM module currently adds support for:

  • Importing and exporting data to/from SAM/BAM format
  • Dedicated SQL functions for DNA sequence alignment processing
  • A query output SAM formatter * Automatic query results forwarding to IGV, the Integrative Genomics Viewer [5]

The SAM/BAM data can be imported with a single command. This allows querying the stored sequence alignment data with either regular SQL or dedicated functions, ported from SAMtools. While the current built-in functions are mostly DNA specific, there are plans to add more of the native SAMtools functions. In addition any compatible visualisation or processing tool can be connected to the database using JDBC/ODBC. One can even build a small Bioinformatics-specific webapp with Node.js for data exploration or visualisation. The MonetDB-R integration creates even more possibilities for advanced processing alignment data, directly in the database, using the extended functionality delivered by R.

For more information (including detailed how-to instructions) on the BAM/SAM module go to the Life Science section on the website. To see the module in actions check out the video on YouTube [4]. For full details on the Genome sequence analysis with MonetDB: a case study on Ebola virus diversity paper, visit the link below [1].

[1] Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

[2] Life Sciences, CWI Amsterdam

[3] Sequence Alignment/Map Format Specification

[4] MonetDB BAMloader demo

[5] IGV, the Integrative Genomics Viewer