Effective Labeled Data Generation via Generative Adversarial Learning

Suhang Wang
Dongwon Lee

Sponsoring Agency
National Science Foundation


Recent successes in applying deep learning to solve many challenging data science problems is in part due to the availability of large-scale labeled training data. However, creating large-scale labeled datasets is time consuming, labor-intensive, costly, and often requires significant domain knowledge. Many real-world applications, therefore, come with only data with limited label information (i.e., a small amount of labeled data or no labeled data). Thus, lack of labeled training data is still one of major roadblocks in applying deep learning techniques to challenging data science problems. On the other hand, recent advancements in generative adversarial learning have shown promising results in generating realistic data, which could enable a new perspective for alleviating the problem of lacking labeled training data. Thus, this project explores effective labeled data generation via generative adversarial learning. The proposed research extends the state-of-the-art labeled data generation and generative adversarial learning to a new frontier, investigates original problems that entreat innovative solutions and paves the way for a new research endeavor effectively tame synthetic labeled data generation. As many real-world problems face the challenge of limited labeled data, the project has potential to benefit many real-world applications from various disciplines such as Computer Science, Education, Politics, Healthcare and Bioinformatics.

This project proposes novel approaches based on generative adversarial learning for effective labeled data generation to facilitate deep learning with limited label information, investigates associated fundamental research issues and develops effective algorithms. It has three primary research objectives. First, when a small amount of labeled data is available, it explores to estimate the underlying data distribution from unlabeled data and incorporate the label information for labeled data generation, including extremely imbalanced data and incomplete label scenarios. Second, when labeled data is not available, it adopts an alternative weak supervision (e.g., inaccurate labels, inexact labels and pairwise constraints) for generating labeled data. Third, when neither labeled data nor weak supervision is available, it explores to integrate human involvement to generative adversarial learning for providing supervision. Disparate means are planned to disseminate the project and its findings, such as web enabled data and software repositories, books, journal and conference publications, special purpose workshops or tutorials, and industrial collaborations. The project can be effectively integrated to undergraduate and graduate courses as well as in student research projects.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Research Area
Artificial Intelligence and Big Data