Exploratory Data Science
over Raw Data
Problems and motivation to carry out the R & D project
Machine learning (ML) applications based on large data are more and more applied in enterprise to improve value chain and gain competitive advantages. In contrast to traditional ML, the objectives are, however, under-specified, allow for different types of analysis, and can leverage a wide variety of heterogeneous, distributed and partially inaccessible data sources. In contrast to traditional ML, the objectives are under-specified, allow different kind of analyses, and leverage a wide variety of heterogeneous, distributed and partially inaccessible data sources. Therefore, the typical industry data science process is exploratory. Data scientists investigate hypotheses, integrate the necessary data, run different analytics, and search for interesting patterns and models. Since added value is unknown in advance, very little effort is made for a systematic acquisition, integration, and preprocessing of the data. This lack of infrastructure results in redundancy of manual steps and inefficient computation. Furthermore, central consolidation is not always technically or economically desired or possible (e.g., sensitive personal data). These scenarios share the necessity of federated execution and dedicated elimination of redundancy.
Innovation content compared to the state of the art / state of knowledge
The main idea of exdra is an investigation of suitable system support for the exploratory data science process over heterogeneous and distributed raw data sources, showcased in a demonstrator for practical applications. In detail, this approach entails the following research aspects: (1) ad-hoc and federated data integration over raw data, (2) data organization and reuse of intermediates, (3) horizontal optimization over the entire data science life-cycle, and (4) query planning for partially accessible data. Use cases from the process industry will be provided by Siemens AG. In this context, there are large amounts of data, distributed over locations and appliances, of which consolidation is technically, economically, and legally limited.
Desired results and findings
The overall goal leads to four research goals.
- Data integration, data processing and analysis over raw data needs to be enabled via a suitable declarative specification of data sources and preprocessing steps, as well as efficient primitives for local and federated computation. In the context of exploratory data science, this requires sampling and incremental maintenance.
- unnecessary redundancy and inefficiency of repeated computations need to be addressed via dedicated techniques for data organization and reuse. The high communication overhead of federated analysis could further benefit from leveraging compression techniques and the performance-accuracy trade-off.
- We aim to improve the understanding of exploratory analysis results and simplify future analyses via systematic model management and optimization of experiments.
- federated computation is an essential part of exploratory analysis over raw data. Accordingly, we intend to investigate system architectures, as well as query optimization and processing. In order to provide evidence for practical relevance, all results will be integrated and evaluated as part of a demonstrator software.