DataFungi, from Rotting Data to Purified Information

Martin Kersten presented his idea to cope with the ever increasing big data growth in a keynote at the 32nd IEEE International Conference on Data Engineering, May 16-20, 2016, in Helsinki, Finland.

To truly understand the complexities of global warming, gene diversity and the complexity of the universe, the sciences more and more rely on database technology to digest and manipulate large amounts of raw facts. It requires daily injection of Terabytes, a synergy with large-scale existing science repositories, efficient support for the preferred computational paradigm (e.g. SQL, Python, R), and linked-in libraries (e.g. BLAS, GSL, Numpy). However, the staggering amount of science data collected creates a challenge beyond a mere buying of more hardware or elastic growth in the Cloud. Scientific database management systems should fundamentally change.

To cope with big data growth, a DBMS should provide a data freshness decay model to ensure its proper functioning within the storage, processing and responsiveness bounds given. Data should after injection into a database become subject to a data rotting process, or be distilled into meaningful half-products using a purification process. Data rotting uses data agnostic techniques to remove data from the repository when the sustainability of the system becomes at risk. The counter measure is to exploit domain knowledge to enable data purification, i.e. replace raw data by sound statistical micro-models to reduce their resource claims. The project challenges the data durability axiom underlying all database systems. Instead, I coin that a DBMS may selectively forget raw data on its own initiative. Ideally by harvesting micromodels and to forget noisy facts. The experimental context is provided by the emerging in-memory database technology, which provides a significant improvement over disk-based approaches.

If successful, the research brings down the resource requirements of scientific databases significantly, it provides fast and robust statistical query responses and it harvests the use patterns by identifying the laws of data. A new substrate is created for data driven scientific discoveries.