Improving Technical Paper Database Search through Math-Aware Search Engines

C. Lee Giles

Sponsoring Agency
Sloan Foundation


Today's search engines make use of sophisticated techniques for searching based upon words, but are not able to make nuanced use of mathematical notation. This project aims to allow scientists, engineers, mathematicians, and students to locate technical information using words, mathematical notation, or some of each. For example, a mathematician studying graph theory could use these new capabilities to find related applications in physics, ecology, and social network analysis, despite any differences in the notation and terminology used in those disciplines. Given a large collection of technical documents, we will apply machine learning techniques to construct associations between the formulae and words used to explain mathematical ideas, and determine how to translate automatically between those two forms of expression. These associations and translations can then be used by students who write what they are looking for using words, with the search engine finding documents that express those same ideas, even if only in mathematical notation. These new math-aware search engines will accelerate innovation by allowing searchers to discover information both across technical disciplines and, by using mathematical notation as a pivot, even across human languages.

To accomplish these goals, the project will develop novel scalable techniques for indexing and retrieval of mathematical content in technical documents. These methods will accommodate a broad range of notational conventions, formats, and encodings. New context-based methods for inferring associations between formulae and related text will be used to build rich and flexible models of content equivalence. These equivalence models will be used in new ranking algorithms that integrate results found using words or using mathematical notation into a single ranked list. Open-source reference implementations will be shared publicly, and new test collections created to evaluate these implementations will be shared with other researchers. To gain experience with the use of these new capabilities, the project will add math-aware search to the CiteSeerX digital library of scientific literature. CiteSeerX is an open Web service that can be used to compare alternative retrieval methods in actual use. For further information see the project Web page:

Research Area
Artificial Intelligence and Big Data
Privacy and Security