Changes in cells while they are undergoing transformation from "normal" to malignant cells (e.g. during infections) happen on many biological levels, such as genome, transcriptome, proteome and metabolome. Following the central dogma of molecular biology and its extensions these levels are highly interconnected and depend on each other. Within the MODAL:MedLab, we develop new mathematical methods that allow (1) identification of multivariate disease signatures that describe changes in multiple data-sources and (2) development of multi-level models that embeds these findings into the actual biological context. Both parts combined will eventually lead to a thorough understanding of the modeled process and open up the opportunity to use the respective model for diagnostic purposes for individuals, thus allowing high-throughput classification of biological samples. These techniques can then be adjusted to an individual by using its -omics data and thus allows to derive information about the individual's state, for example, as a diagnostic tool for a certain disease that is captured by the data and the model. All algorithms will be implemented using state-of-the art software frameworks that can cope with the very large data volumes.
The MODAL AG (MAG) is a ZIB spin-off that works as a bridge between research and industry. MAG offers the students in this project access to real world data and expertise from leading hospitals and companies working in this field. Within the MAG infrastructure, students will have the opportunity to experience creation of industry-strength technology and software solutions.
Building on state-of-the-art database technology, students will develop new machine-learning techniques to analyze medical massive data sets. First, students will learn the necessary biological foundation needed to successfully complete the project. They will then use data from a large clinical trial to model medical phenomena based on ideas from the areas of compressed sensing, machine learning, and network-of-networks theory.
Background: Tumor diseases rank among the most frequent causes of death in Western countries coinciding with an incomplete understanding of the underlying pathogenic mechanisms and a lack of individual treatment options. Hence, early diagnosis of the disease and early relapse monitoring are currently the best available options to improve patient survival. This calls for two things: (1) identification of disease specific sets of biological signals that reliably indicate a disease outbreak (or status) in an individual. We call these sets of disease specific signals fingerprints of a disease. And (2), development of new classification methods allowing for robust identification of these fingerprints in an individual's biological sample. In this project we will use -omics data sources, such as proteomics or genomics. The advantage of -omics data over classical (e.g. blood) markers is that for example a proteomics data set contains a snapshot of almost all proteins that are currently active in an individual, opposed to just about 30 values analyzed in a classical blood test. Thus, it contains orders of magnitudes more potential information and describes the medical state of an individual much more precisely. However, to date there is no gold-standard of how to reliably and reproducible analyze these huge data sets and find robust fingerprints that could be used for the ultimate task: (early) diagnostics of cancer.
Problems and (some) hope: -omics data is ultra high-dimensional and very noisy - but only sparsely filled with information: Biological -omics data (e.g. proteomics or genomics data) is typically very large (millions of dimensions), which increases the complexity of algorithms for analyzing the parameter space significantly or makes them even infeasible. At the same time, this data exhibits a very particular structure, in the sense that it is highly sparse. Thus the information content of this data is much lower than its actual dimension seems to suggest, which is the requirement for any dimension reduction with small loss of information.
However, the sparsity structure of this data is highly complex, since not only do the large entries exhibit a particular clustering with the amplitudes forming Gaussian-like shapes, but also the noise affecting the signal is by no means Gaussian noise -- a customarily assumed property. In addition, considering different sample sets, those clusters also slightly differ in the locations from sample set to sample set, hence also do not coincide with normal patterns such as joint sparsity. This means, although the data is highly sparse, the sparsity structure as well as the noise distribution is non-standard. However, specifically adapted automatic -- without cumbersome by-hand-identification of significant values -- dimension reduction strategies such as compressed sensing have actually never been developed for instance for proteomics data. In our project, such a dimension reduction step will be a crucial ingredient and shall precede the analysis of parameter space, thereby then enabling low complexity algorithms.
The major challenge in these applications is to extract a set of features, as small as possible, that accurately classifies the learning examples.
The goal: In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.
The prospective participant should:
- have a background in mathematics, bioinformatics or computer science,
- have experience with a high-level programming language (e.g. C++ or Java) and a statistical software package such as SPSS or R, have attended classes in the area of data mining or acquired the foundations of this field by some other means be prepared to work with very large datasets from industry partners (which involves preprocessing, e.g. to overcome inconsistencies and incompleteness).
- Ideally he or she is familiar with the biological background and has already worked with biological data-sets, has experience in working in a Linux/Unix environment and collaborative work on source code (e.g. working with revision control systems).