Training Computers and Humans to Detect Misinformation by Combining Computational and Theoretical Analysis

Dongwon Lee

Sponsoring Agency
National Science Foundation


Awareness of misinformation online is becoming an increasingly important issue, especially when information is presented in the format of a news story, because (a) people may over-trust content that looks like news and fail to critically evaluate it, and (b) such stories can be easily spread, amplifying the effect of misinformation. Using machine learning methods to analyze a large database of articles labeled as more or less likely to contain misinformation, along with theoretical analyses from the fields of communication, psychology, and information science, the project team will first characterize what distinguishes stories that are likely to contain misinformation from others. These characteristics will be used to build a tool that calls out characteristics of a given article that are known to correlate with misinformation; they will also be used to develop training materials to help people make these judgments. The tool and training materials will be tested through a series of experiments in which articles are evaluated by the tool and by people both before and after undergoing training. The goal is to have a positive impact on online discourse by improving both readers' and moderators' ability to reduce the impact of misinformation campaigns. The team will make the models, tools, and training materials publicly available for others to use in research, in classes, and online.

The team will use two main approaches to characterize articles that are more likely to contain misinformation. The first is a concept explication approach from the social sciences based on a deep analysis of research writing around information dissemination and evaluation. The second is a supervised machine learning approach to be trained on large datasets of labeled articles, including verified examples of misinformation. Both approaches will consider characteristics of the content; of its visual presentation; of the people who create, consume, and share it; and of the networks it moves through. These models will be translated into a set of weighted rules that combine the insights from the two approaches, then instantiated in Markov Logic Networks. These leverage the strengths of both first order logic and probabilistic graphic models, allow for a variety of efficient inference methods, and have been applied to a number of related problems; the models will be evaluated offline against test data using standard machine learning techniques. Finally, the team will develop training materials based on existing work from the International Federation of Library Associations and Institutions and on heuristic guidelines derived from the modeling work in the first two tasks, evaluate them through the experiments described earlier, and disseminate them online along with the developed models.

Research Area
Artificial Intelligence and Big Data
Social and Organizational Informatics