Difference between revisions of "Astronomy: Bulk Source Association"

From MonetDB
Jump to navigationJump to search
Line 102: Line 102:
 
   SELECT idc, ids, dist FROM k3m_query((SELECT id, ra, decl, 0.01745329 FROM sourcelist));
 
   SELECT idc, ids, dist FROM k3m_query((SELECT id, ra, decl, 0.01745329 FROM sourcelist));
  
 +
for getting the matches between sources from the list and the catalog. And
  
for getting the matches between sources from the list and the catalog.
+
  SELECT * FROM k3m_free();
 +
 
 +
to release the memory again.

Revision as of 11:19, 18 January 2016

Introduction

In the near future several optical and radio telescopes will produce large field-of-view images at second to minute cadence. One of their common and main goals is to find transient and variable events on these short time scales. As a consequence of the high resolution and large images the number of sources is larger than ever before and may peak up to 300,000 sources per image. Analysis of time series data, called light curves in astronomy, of all the sources is essential to find new, varying and patterns in the source properties. Constructing these light curves in near real time requires fast cross matching of source lists with potential counterparts in known catalogues, having about 500 million to 1 billion sources.

It is this last part, the bulk association, where you need to cross match 300,000 sources with 1 billion sources, that is the most challenging query from a database point of view. We want to keep the processing time as short as possible, because above this we need more queries to run and the next image comes in pretty soon, and then the next and the next... A typical bulk association time should be below the 10% of the overall time to process the image.

The challenge is to have a quick as possible cross-match algorithm that can be executed from within the database engine. This might be by using plane, but smart, SQL or by implementing external C functions as UDFs in SQL.

Approaches

Solutions exist, both implemented in C and SQL, whereas the latter works for MonetDB using the zone algorithm and for PostgreSQL using GiST indexing. However, the C function, using kdtree indexing is roughly an order of magnitude faster, when we do not take into account the time to build of tree.

SQL

MonetDB

Or any other database

 DECLARE iradius, isint2 DOUBLE;
 SET iradius = CAST(0.5 AS DOUBLE)/3600; /* [degrees] */
 SET isint2 = 4 * SIN(RADIANS(0.5 * iradius)) * SIN(RADIANS(0.5 * iradius));
SELECT runcatid
      ,xtrsrcid
      ,3600*DEGREES(2*ASIN(SQRT(dist)/2)) AS dist_arcsec
  FROM (SELECT z0.id AS runcatid
              ,t0.id AS xtrsrcid
              ,  (z0.x - t0.x) * (z0.x - t0.x)
               + (z0.y - t0.y) * (z0.y - t0.y)
               + (z0.z - t0.z) * (z0.z - t0.z)
               AS dist
          FROM rc_zone z0 
              ,(SELECT id
                      ,dec_deg - iradius AS decmin
                      ,dec_deg + iradius AS decmax
                      ,ra_deg - alpha(dec_deg, iradius) AS ramin
                      ,ra_deg + alpha(dec_deg, iradius) AS ramax
                      ,x
                      ,y
                      ,z
                  FROM xtrsrc_548 
               ) t0
         WHERE z0."dec" BETWEEN t0.decmin AND t0.decmax
           AND z0.ra BETWEEN t0.ramin AND t0.ramax
       ) t1
  WHERE t1.dist < isint2
;

PostgreSQL (GiST)

Within PostgreSQL we can explore GiST indexing.

 SET work_mem TO '2GB';
 
 # 6371008.771415059454739093780517578125 / 206264.80624709636
 # need to divide our radius by the WGS84 radius as the last radius is used 
 # in the _ST_Expand function
 CREATE OR REPLACE FUNCTION M_DWithin(geography, geography, float8, boolean)
 RETURNS boolean
 AS 'SELECT $1 && _ST_Expand($2, $3 * 30.887522148508772) AND _ST_DWithin($1, $2, $3, $4)'
 LANGUAGE 'sql' IMMUTABLE
 ;
 CREATE TABLE new_match AS 
 SELECT DISTINCT ON (newsrc.id)  
        runcat.id as runcat_id
       ,newsrc.id as xtrsrc_id
       ,ST_DISTANCE(runcat.location, newsrc.location) as distance
   FROM runcat  
        RIGHT OUTER JOIN (SELECT * 
                            FROM xtrsrc 
                           WHERE xtrsrc.id < {0}
                         ) as newsrc 
        ON M_DWithin(runcat.location, newsrc.location, 1, false)
 ORDER BY newsrc.id, distance
 ;

External C as UDF for MonetDB/SQL

Pim Schellart developed K3Match, a C library (with Python bindings) for fast matching of points in 3D space. It uses 3-dimensional binary trees to find matches between large datasets in O(N log N) time.

The core of K3Match can be used within the MonetDB engine to build the k3-tree against which sources can be cross-matched.

Solution

Something like:

 SELECT * FROM k3m_build((SELECT id, ra, decl FROM catalog AS s));

for building a persistent tree in database memory. And something like

 SELECT idc, ids, dist FROM k3m_query((SELECT id, ra, decl, 0.01745329 FROM sourcelist));

for getting the matches between sources from the list and the catalog. And

 SELECT * FROM k3m_free();

to release the memory again.