Predictive Modeling from High-Dimensional, Sparsely and Irregularly Sampled, Longitudinal Data


Sponsoring Agency
National Science Foundation


Longitudinal data resulting from repeated observations from a set of individuals over time are commonplace in many applications, including health sciences, learning sciences, social sciences, life sciences, and economics. Such data present unprecedented opportunities to uncover the relationship between the time- varying patterns of certain measured variables (features or covariates) and outcomes of interest e.g., economic meltdown societal unrest, disease onset, health risk, etc. In real-world settings, the number of variables is often very large; often only a small subset of variables is recorded at any given time, resulting in sparse data with a high proportion of missing observations. Furthermore, such data exhibit complex correlations which if not properly accounted for, can lead to misleading statistical inferences. Additional complications arise from the fact that the data exhibit abrupt discontinuities that are often driven by transitions between states that are not directly observable (e.g., from "healthy" to "infected"). Large size of data sets demand methods that are scalable. And in high stakes applications, e.g., healthcare, human interpretability of the predictive models is of paramount importance. The project will yield substantial advances over the current state-of-the-art in scalable machine learning methods for predictive modeling of longitudinal outcomes from high-dimensional, irregularly sampled, sparse, longitudinal health data. The open-source implementations of the predictive modeling tools will find applications in many domains including behavioral, social, environmental, economic, learning, and health sciences. The project will enhance the research-based training of a diverse graduate and undergraduate students in Data Sciences and Computer Science (especially Artificial Intelligence), areas of great national importance. The educational activities associated with the project will help equip a diverse cadre of Data Scientists, AI experts, and health sciences, social sciences, learning sciences, and related areas with state-of-the-art machine learning tools for predictive modeling from longitudinal data. The project will produce a new graduate course and course modules, sample projects, etc. on predictive modeling from longitudinal data to be integrated into Data Sciences curricula.

The project will help introduce students from diverse backgrounds, including women and underrepresented minorities, to a broad range of educational, research, and career opportunities in Data Sciences. The broader impacts of the project will be further enhanced by broad dissemination of all research results (publications, software, data sets, course materials). The project will develop a family of scalable deep kernel gaussian process regression algorithms for interpretable predictive modeling from high dimensional, sparsely and irregularly time sampled, longitudinal data with complex, a priori unknown correlation structure. The resulting methods will be able to discover the patterns of transitions between unobserved or hidden states, account for abrupt discontinuities in outcomes. They will be able to explain their predictions by learning the underlying complex correlation structure exhibited by the data and by identifying not only the variables that drive the predictions, but also the temporal context in which they do so. The project will rigorously empirically evaluate the resulting methods with simulated longitudinal data (with different correlation structures, different missingness mechanisms, different time-dependent variable importance), several benchmark longitudinal data sets, and, most importantly, deidentified longitudinal electronic health records data and socio-demographic data from real-world healthcare applications (in collaboration with clinical experts).