CiteSeerX: Toward Sustainable Support of Scholarly Big Data

Researcher(s)

Sponsoring Agency
National Science Foundation

Summary

Access to the scientific and scholarly literature has changed radically in recent decades. Increasingly researchers and scholars make their publications freely available on the Web. Taking advantage of this opportunity, new scientific search engine tools have been developed such as Google Scholar, Semantic Scholar, and CiteSeer, now CiteSeerX. CiteSeerX has become one of the most comprehensive and widely-used online public resources for the Computer and Information Science and Engineering (CISE) research community. Millions of CiteSeerX Portable Document Format (PDF) documents are indexed by Google. CiteSeerX is unique among digital library search engines. It is open access, most all of its documents are harvested from the public Web, and users have full-text access to all documents searchable on its website. Moreover, it provides all automatically extracted metadata and citation context via an Open Archive Initiative (OAI) metadata service interface and bulk downloads on a public cloud - all under a Creative Commons license. This service is usually not available from other scholarly search engines. CiteSeerX performs automatic extraction and indexing of tables (in production), figures (developed)}, and algorithms (developed), capabilities rarely seen in other scholarly search engines. CiteSeerX provides its open source software and architecture on GitHub. At this time none of the other above-mentioned systems release their digital library software.

Utilizing the established CiteSeerX infrastructure, this proposal aims to create a sustainable CiteSeerX system with new data resources and a much larger data collection. We will develop a new system that runs with low operation overhead, without a single point of failure, and that provides quality and enriched data and metadata in portable formats that will be available through accessible user interfaces. We will ingest all freely accessible scientific documents on the Web, currently estimated to be 30 million. CiteSeerX will make available high-quality metadata through an accessible Web User Interface, Application Programming Interface, and data dumps. SeerSuite, the platform on which CiteSeerX is built, will be refactored so as to be an easily deployable and configurable scholarly digital library framework. It will be built on commercial grade open source software. In addition, we will provide searchable semantic metadata, such as key phrases and disambiguated author names, and non-textual content such as data from figures, tables, algorithms, and equations. For long-term sustainability we will explore different monetization models. The result will be a refactored digital library search engine that provides stable, usable, and reliable data services on multiple types of scientific documents built on a portable, maintainable, and self-contained framework that can be deployed for other research document digital collections. Source code will be hosted at https://github.com/SeerLabs. System development and related research will be published in relevant venues and be made publicly available.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Term
 -