Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data Intensive Science
This project develops a virtual data collaboratory that can be accessed by researchers, educators, and entrepreneurs across institutional and geographic boundaries, fostering community engagement and accelerating interdisciplinary research. A federated data system is created, using existing components and building upon existing cyberinfrastructure and resources in New Jersey and Pennsylvania. Seven universities are directly involved (the three Rutgers University campuses, Pennsylvania State University, the University of Pennsylvania, the University of Pittsburgh, Drexel University, Temple University, and the City University of New York); indirectly, other regional schools served by the New Jersey and Pennsylvania high-speed networks also participate. The system has applicability to a several science and engineering domains, such as protein-DNA interaction and smart cities, and is likely to be extensible to other domains. The cyberinfrastructure is to be integrated into both graduate and undergraduate programs across several institutions.
The end product is a fully-developed system for collaborative use by the research and education community. A data management and sharing system is constructed, based largely on commercial off-the-shelf technology. The storage system is based on the Hadoop Distributed File System (HDFS), a Java-based file system providing scalable and reliable data storage, designed to span large clusters of commodity servers. The Fedora and VIVO object-based storage systems are used, enabling linked data approaches. The system will be integrated with existing research data repositories, such as the Ocean Observatories Initiative and Protein Data Bank repositories. Regional high-performance computing and network infrastructure is leveraged, including New Jersey's Regional Education and Research Network (NJEdge), Pennsylvania's Keystone Initiative for Network Based Education and Research (KINBER), the Extreme Science and Engineering Discovery Environment (XSEDE) computing capabilities, Open Science Grid, and other NSF Campus Cyberinfrastructure investments. The project also develops a custom site federation and data services layer; the data services layer provides services for data linking, search, and sharing; coupling to computation, analytics, and visualization; mechanisms to attach unique Digital Object Identifiers (DOIs), archive data, and broadly publish to internal and wider audiences; and manage the long-term data lifecycle, ensuring immutable and authentic data and reproducible research.