MonetDB Data Vaults (DV) are database-attached external files or file repositories1. Based on a concept similar to SQL/MED, they can be used for managing external data. One of the primary purposes of a data vault is to allow faster and easier processing of data stored in large file repositories.

While most of the DV extensions rely on the still experimental MonetDB SciQL - an array database extension and query language2, a few are available in "plain" SQL. This includes Data Vaults for mSEED (seismology), SAM/BAM (bioinformatics/genetics) and FITS (astronomy) formats. We'll illustrate the latter in more detail here.

Modern astronomical instruments generate a large number of Flexible Image Transport System (FITS) files, an open-standard file format for scientific images3. The FITS repositories are at a later stage used in science pipelines for detailed content-based analysis. Data Vaults provide a simple solution to one of the main challenges faced by astronomers using the FITS repositories, i.e. easy access to their content for modern database and statistical processing. The FITS content is made accessible in a two-step process:

  1. Attach a file (or file repository) as a Data Vault.
    • At this point, MonetDB will access the metadata, if any, and load it for each of reference in a domain specific database catalog.
  2. Selectively load the data from the attached objects.
    • The data is parsed and loaded into persistent tables in MonetDB. Often at limited cost due to a binary copy into scheme.
    • Alternatively, data can be dynamically loaded - only when a query requires it. This means that no processing time is spent on data, that is never queried.

Here's a small example how astronomers working with the Isaac Newton Telescope (INT) working on the The INT Photometric Hα Survey of the Northern Galactic Plane (IPHAS) use MonetDB DV and FITS to process imaging data of the Northern Milky Way in visible light4. Each FITS file represents an observation on only a very small portion of the sky, but since IPHAS surveys the galactic center, each image has a very high number of light sources, hence a lot of data.

create procedure listdir(dirname string) external name fits.listdir;
create procedure fitsattach(fname string) external name fits.attach;
create procedure fitsload(tname string) external name fits.load;
create procedure listdirpat(dirname string,pat string) external name fits.listdirpattern;

Then, attach a FITS file or directory with FITS file accessible on the server where the MonetDB server running:

call fitsattach('/bigdata/iphas/r564532_cast.fits');

As a result, a small catalog of FITS metadata becomes available for inspection as a Data Vault. Its content can be queried as follows:

SELECT * FROM fits_files;
SELECT * FROM fits_tables;
SELECT * FROM fits_columns;

Amongst others it will show all FITS objects, which can be selectively loaded like this:

call fitsload('r564532_cat_1');
call fitsload('r564532_cat_2');
call fitsload('r564532_cat_3');
call fitsload('r564532_cat_4');

At this point, the data from each extension is fully available for content-based querying.

 SELECT * FROM "r564532_cat_1";

To complete the sky survey5, astronomers have to associated light sources obtained from new observations (the data just loaded in the database) with existing light sources. The results are merged into a single big table for improved ease of use and allowing for a more advanced data exploration or mining. With the MonetDB/R integration enabled, data processing powered by R can be done directly in the database.

The plot below illustrates the DV performance when loading FITS files with varying source densities.

IPHAS DV num sources