Astronomical data analysis with MonetDB Data Vaults

MonetDB Data Vaults (DV) are database-attached external files or file repositories [1]. Based on a concept similar to SQL/MED, they can be used for managing external data. One of the primary purposes of a data vault is to allow faster and easier processing of data stored in large file repositories.

While most of the DV extensions rely on the still experimental MonetDB SciQL - an array database extension and query language [2], a few are available in “plain” SQL. This includes Data Vaults for mSEED (seismology), SAM/BAM bioinformatics/genetics) and FITS (astronomy) formats. We’ll illustrate the latter in more detail here.

Modern astronomical instruments generate a large number of Flexible Image Transport System FITS) files, an open-standard file format for scientific images [3]. The FITS repositories are at a later stage used in science pipelines for detailed content-based analysis. Data Vaults provide a simple solution to one of the main challenges faced by astronomers using the FITS repositories, i.e. easy access to their content for modern database and statistical processing. The FITS content is made accessible in a two-step process:

  1. Attach a file (or file repository) as a Data Vault.
    • At this point, MonetDB will access the metadata, if any, and load it for each of reference in a domain specific database catalog.
  2. Selectively load the data from the attached objects.
    • The data is parsed and loaded into persistent tables in MonetDB. Often at limited cost due to a binary copy into scheme.
    • Alternatively, data can be dynamically loaded - only when a query requires it. This means that no processing time is spent on data, that is never queried.

Here’s a small example how astronomers working with the Isaac Newton Telescope (INT) working on the The INT Photometric Hα Survey of the Northern Galactic Plane (IPHAS) use MonetDB DV and FITS to process imaging data of the Northern Milky Way in visible light [4]. Each FITS file represents an observation on only a very small portion of the sky, but since IPHAS surveys the galactic center, each image has a very high number of light sources, hence a lot of data.

In MonetDB up to version Oct2014-SP4, the SQL functions for handling FITS data were not yet predefined in the system catalog. Thus, as a first step, these need to be defined once (*).

create procedure listdir (dirname string) external name fits.listdir;
create procedure fitsattach (fname string) external name fits.attach;
create procedure fitsload (tname string) external name fits.load;
create procedure listdirpat (dirname string,pat string) external name fits.listdirpattern;

Then, attach a FITS file or directory with FITS file accessible on the server where the MonetDB server running:

call fitsattach('/bigdata/iphas/r564532_cast.fits');

As a result, a small catalog of FITS metadata becomes available for inspection as a Data Vault. Its content can be queried as follows:

SELECT * FROM fits_files;
SELECT * FROM fits_tables;
SELECT * FROM fits_columns;

Amongst others it will show all FITS objects, which can be selectively loaded like this:

call fitsload('r564532_cat_1');
call fitsload('r564532_cat_2');
call fitsload('r564532_cat_3');

At this point, the data from each extension is fully available for content-based querying.

SELECT * FROM "r564532_cat_1";

To complete the sky survey [5], astronomers have to associated light sources obtained from new observations (the data just loaded in the database) with existing light sources. The results are merged into a single big table for improved ease of use and allowing for a more advanced data exploration or mining. With the MonetDB/R integration enabled, data processing powered by R can be done directly in the database.

The plot below illustrates the DV performance when loading FITS files with varying source densities.

[IPHAS-DV-num_sources.png)

(*) As of the next major release on MonetDB, FITS DVs will be better supported and creating the FITS SQL functions will not longer be required.

[1] Data vaults: a symbiosis between database technology and scientific file repositories

[2] SciQL, A Query Language for Science Applications

[3] Flexible Image Transport System

[4] IPHAS - The INT/WFC Hα Survey of the Northern Galactic Plane

[5] Astronomical survey