It's Not Just GitHub: Identifying Data and Software Sources Included in Publications

by   Emily Escamilla, et al.

Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists and institutions including Software Heritage, Internet Archive, and Zenodo are working to preserve data and software products as valuable parts of reproducibility, a cornerstone of scientific research. While some hosting platforms are well-known and can be identified with regular expressions, there are a vast number of smaller, more niche hosting platforms utilized by researchers to host their data and software. If it is not feasible to manually identify all hosting platforms used by researchers, how can we identify URIs to open-access data and software (OADS) to aid in their preservation? We used a hybrid classifier to classify URIs as OADS URIs and non-OADS URIs. We found that URIs to Git hosting platforms (GHPs) including GitHub, GitLab, SourceForge, and Bitbucket accounted for 33% of OADS URIs. Non-GHP OADS URIs are distributed across almost 50,000 unique hostnames. We determined that using a hybrid classifier allows for the identification of OADS URIs in less common hosting platforms which can benefit discoverability for preserving datasets and software products as research products for reproducibility.


Reproducible Research is more than Publishing Research Artefacts: A Systematic Analysis of Jupyter Notebooks from Research Articles

With the advent of Open Science, researchers have started to publish the...

The Rise of GitHub in Scholarly Publications

The definition of scholarly content has expanded to include the data and...

From Data Processes to Data Products: Knowledge Infrastructures in Astronomy

We explore how astronomers take observational data from telescopes, proc...

Can scientists and their institutions become their own open access publishers?

This article offers a personal perspective on the current state of acade...

Linking Mathematical Software in Web Archives

The Web is our primary source of all kinds of information today. This in...

Forking Without Clicking: on How to Identify Software Repository Forks

The notion of software ”fork” has been shifting over time from the (nega...

An Assessment Tool for Academic Research Managers in the Third World

The academic evaluation of the publication record of researchers is rele...

Please sign up or login with your details

Forgot password? Click here to reset