Essential Guide to Sharing Code in Biomedical Research

gitlab code git GitHub data-management-plan code-sharing open-source github-enterprise nih-compliance-made-easy gpl oss foss

Author: Ian Burnette
DOI: https://doi.org/10.6084/m9.figshare.26166058	Code: https://github.com/DepressionCenter/EFDC-Repo-Template

Introduction

Research projects generate valuable assets such as data, analytic code, and computational models, but this value often goes unrealized due to limited sharing and curation (Vuorre & Crump, 2021). Sharing data and code is a growing practice in biomedical research (Loder et al., 2024), yet public data sharing remains low, with actual public data availability hovering around 2% and public code sharing stuck at a rate less than 0.5% between 2016 and 2021 (Hamilton, 2023).

This article explores the importance of sharing, addresses common reservations among researchers, and provides practical advice on how to share effectively. At the core of this discussion is the idea that by increasing transparency and releasing code as open source, researchers not only meet the requirements of funding agencies and publications but also stimulate institutional, national, and global research progress.

Why Share Data and Source Code?

Expanding Research and Impact

Sharing research data and code can significantly expand the reach and impact of your work (Piwowar, 2007), while also enhancing the quality of science. Among other activities that drive discovery and replicability, Hamilton et al. and Vuorre & Crump suggest (2023; 2021) that sharing gives researchers opportunities to:

strengthen their methods
validate preexistent findings
answer questions not originally considered
synthesize existing datasets to accelerate discovery

Enhancing Research Quality

Analytical approaches can significantly influence results. For instance, Silberzahn et al. (2018) directed 29 teams to analyze the same data to address whether soccer referees were more likely to give red cards to dark-skin-toned players than to light-skin-toned players. They found that 20 teams (69%) observed a statistically significant positive effect, while 9 teams (31%) did not, demonstrating that well-intentioned, justifiable approaches can yield significantly different results.

Facilitating Secondary Analysis and Data Integration

Access to data allows researchers to conduct secondary analyses and integrate existing datasets, increasing statistical efficiency and accelerating discovery. For example, at the d3center, we’re developing methods for data integration in micro-randomized trials (MRTs) to develop just-in-time adaptive interventions (JITAIs), enabling mobile health researchers to leverage data from previous MRTs that employ similar interventions (Huch, 2024).

Compliance with Funding and Publishing Requirements

Increasingly, sharing data and code is becoming a matter of compliance. Many funding agencies and publications are reinforcing existing policies and creating new standards for data and code accessibility. For example, the NIH issued a new Data Management and Sharing (DMS) policy effective January 25, 2023, requiring all applicants planning to generate scientific data to submit a DMS Plan (NIH, 2023). Similarly, starting May 1, 2024, the BMJ will require authors to post data and code to a repository and submit relevant analytical code in a supplementary file (BMJ, 2024).

Takeaway

There are many reasons to share data and code. These examples illustrate that open science is at once a matter of increasing the visibility and impact of your work, improving the quality of science and advancing the discovery process, and complying with mandates to ensure science and the public benefit from your work.

Institutional Considerations

In addition to the numerous benefits of sharing data and code, there are several important considerations when dealing with code developed at the University of Michigan:

Institutional Ownership and Obligations: Code developed at U-M is owned by the University, even if its development was funded by a federal grant. Research paid for by industry may have different restrictions based on the contract. It’s essential to ensure there are no obligations to sponsors that would prohibit the release of software as open source. This requires a thorough review of all relevant contracts and agreements.
Team Alignment: Engaging in discussions with the Principal Investigator (PI) and the research team is crucial. This ensures everyone is in alignment on the decision to release the code as open source and understands the implications and benefits of doing so.
Forward IP Management and Commercialization: It’s important to consider the future management of intellectual property, including commercialization opportunities. One approach could be dual licensing, which allows for academic access while preserving commercial options. This strategy helps balance open access with potential revenue generation.
Collaboration with Innovation Partnerships: Engaging with Innovation Partnerships’ MOSS office is vital in navigating the open-source software release process. They can provide guidance and support to ensure all legal, ethical, and practical considerations are addressed, aligning the release with institutional policies and goals.
License Compliance Review: Before open-source software is released, the Michigan Open Source Software (MOSS) office can help select the right type of license, particularly if researchers are using other open source libraries in their software.

Considering these factors helps foster a collaborative, transparent, and legally sound approach to sharing data and code. Researchers can contribute to the open science movement while ensuring compliance with institutional policies by adhering to the following procedure before publishing code:

Submit an invention report to the MOSS office
For internal or non-commercial code, publish code to GitHub/GitLab (see repository template) with an open source license
For commercial code or code for industry-sponsored projects, discuss with MOSS office before publishing code to ensure the correct license is selected

What Are the Barriers?

Despite the benefits, several barriers make researchers hesitant to share. Common concerns include:

Fear of releasing code that is not fully polished or "ready"
The burden of maintaining shared code and providing user support
Safeguarding intellectual property

These barriers can be managed with the right strategies and mindset. It's about setting expectations for progress and collaboration, not perfection.

Addressing Barriers while Sharing Effectively

Sharing data and source code effectively involves more than just making them available; it also means ensuring they are usable and valuable to others and that they protect your research practice from potential risks. Here are some best practices:

Addressing Fear of Releasing Unpolished Code

One common concern among researchers is the fear of releasing code that is not fully polished or “ready,” which could lead to scrutiny. Here are some strategies to mitigate this fear:

Set Realistic Expectations: Clearly communicate the status of your code in the README file. Specify if the code is a work in progress and outline any known limitations or areas for improvement.

Encourage Collaboration: Emphasize the collaborative nature of open-source projects. Sharing code, even if not perfect, invites feedback and contributions from others, which can lead to improvements and innovations.

Document Thoroughly: Include a README file with your code that contains:

A summary and "quick start" guide
Expectations of when to contact you and the level of support you are able to provide (if any)
A list of authors/contributors
An open-source license notice and appropriate copyright notice (e.g. The Regents of the University of Michigan). The Depression Center has developed a template that researchers can use when creating new repositories.

Think of the README file as both a quick-start guide for other researchers and a marketing tool for your work. Discuss your other studies, link to more of your code, and link to your lab.

Include a LICENSE file with a copy of the open source license you selected.
Use comments throughout your code, especially to clarify assumptions and logic. Assume that other researchers have no prior knowledge of your type of work and the programming language you used.

Addressing Concerns About Maintenance and Responsibility

For researchers worried about the ongoing maintenance of their shared code or the expectation to provide user support, several strategies can mitigate these concerns:

Clear Documentation: Providing comprehensive documentation can reduce the number of inquiries from users and help them solve issues independently.
Community Support Channels: Establish or engage with existing forums or other channels where users can help each other. Both GitHub and GitLab provide discussion boards, and the Depression Center’s Knowledge Base also has a discussion board available for discussions about code.
Versioning and Release Practices: Clearly stating that the code is provided "as-is" in the repository's license or documentation. Be specific about the level of support you can provide, e.g., you might specify that you or your team will look at serious vulnerabilities but cannot respond to enhancement requests.
Limiting Contact Options: GitHub and GitLab allow researchers to turn off collaborative features such as issues, tasks, and pull requests.

Protecting Intellectual Property

To safeguard their work from being stolen, researchers can take the following steps:

Licensing: Choosing an appropriate open-source license is crucial to dictate how others can use, modify, and distribute your code. This legal framework protects the creator’s rights while allowing for sharing. Here are key points about licensing:

Types of OSS Licenses: There are two major types of OSS licenses: permissive and strong copyleft.

Permissive Licenses: Examples include Apache, BSD, and MIT. These allow derivative software to be patented.
Strong Copyleft Licenses: Examples include GPLv3, LGPLv3, and AGPLv3. These licenses automatically remove the right to patent derivative works.

NIH Requirements: Note that the NIH does not require open-source licenses, only that code is available, particularly code needed for reproducibility.
GPL License Implications: Software that uses libraries with a GPL license must also be released under a GPL license.

Patent: If applicable, filing patents for innovative algorithms or methodologies before releasing the code can protect commercial interests. Investigators should contact the Innovation Partnerships office for more information. Here are some considerations:

Rarity of Software Patents: Software patents are extremely rare and very difficult to obtain. Only about 2% of all software projects at U-M apply for patents.
License Recommendations:

If a patent will be needed: Apache License
For other cases: GPLv3

Avoid Custom Licenses: Do not write your own license. Always choose a license type that is well-known and has a proven track record (e.g., Apache, BSD, MIT, GPL, AGPL, LGPL).

Contribution Tracking: Using version control systems like Git allows for clear attribution of contributions, which can help in asserting authorship and ownership. Here are some key points:

Citation Example with DOI: Including a citation example with a DOI number will help ensure the code is cited appropriately. When you attach a DOI number, you can track attributions in other code and papers (learn more).
ORCiD: Create an ORCiD account and link it to Michigan Research Experts and GitHub/GitLab. This facilitates the automated discovery of publications and citations.

By following these guidelines, researchers can effectively protect their intellectual property while contributing to the open-source community.

Best Practices for Security and Preventing Data Leaks

Concerns about passwords and sensitive data being accidentally included in shared code are valid but manageable through best practices:

Use Environment Variables: Instead of hard-coding passwords or tokens in the source code, use environment variables. This practice keeps sensitive data out of your code base (learn more).
Leverage Security Tools: GitHub and GitLab offer tools to help detect sensitive information before it is pushed to the repository, including security scans to detect leaked passwords and other confidential information. Repository visibility can be set to private (only specific people can access) or internal (only University of Michigan users can access). See GitHub’s Best Practices for Preventing Leaks in Your Organization for more information.
Audits: ITS Information Assurance can help with periodic audits and security scans to review your code base for accidentally committed sensitive information and other vulnerabilities.

Take Action

We encourage all researchers to consider the wide-ranging benefits of sharing their data and code. Start small if necessary — perhaps by sharing data from past projects or smaller, less critical pieces of code. Use the guidelines and best practices outlined in this article to prepare your resources for sharing. Remember, the goal is to cultivate a more collaborative and transparent scientific environment.

Engage with platforms and communities that support open science. Starting this summer, Investigators can list their code at the U-M Open Source Portfolio. To get listed, request a MOSS consultation from the Innovation Partnerships office. In addition, U-M faculty, staff and students have free access to GitHub Enterprise. Michigan Medicine and Medical School faculty and staff also have free access to GitLab Enterprise. Investigators can also publish code under the Depression Center's GitHub enterprise license, which is easier than setting up their own enterprise project.

Whether it’s depositing data in a well-regarded repository or sharing code via platforms like GitHub, every action you take builds towards a more open and productive scientific community.

Sharing data and source-code is more than a best practice; it's a commitment to the future of science. By embracing open sharing, researchers not only comply with funding mandates but also contribute to a culture of transparency and cooperation that is essential for rapid scientific advancement. Let’s all take steps towards this goal, ensuring that our research can have the broadest possible impact.

Resources

GNU licenses - https://www.gnu.org/licenses/
U-M Fast Forward Medical Innovation - https://medresearch.umich.edu/office-research/about-office-research/our-units/fast-forward-medical-innovation
How to create a tracking code for code published to GitHub and KB articles - https://teamdynamix.umich.edu/TDClient/210/DepressionCenter/KB/ArticleDet?ID=11833
Eisenberg Family Depression Center repository template - https://github.com/DepressionCenter/EFDC-Repo-Template
Overview of GitHub and GitLab for Research Code | MDEN (video) - https://www.youtube.com/watch?v=M5xSojUhoR0

References

BMJ. (2024). Mandatory data and code sharing for research published by The BMJ. 384:q324. https://doi.org/10.1136/bmj.q324
Cheng, H. G., & Phillips, M. R. (2014). Secondary analysis of existing data: opportunities and implementation. Shanghai archives of psychiatry, 26(6), 371–375. https://doi.org/10.11919/j.issn.1002-0829.21417
Hamilton, D. G., Hong, K., Fraser, H., Rowhani-Farid, A., Fidler, F., & Page, M. J. (2023). Prevalence and predictors of data and code sharing in the medical and health sciences: systematic review with meta-analysis of individual participant data. BMJ (Clinical research ed.), 382, e075767. https://doi.org/10.1136/bmj-2023-075767
Huch, E. (2024) Data integration methods for micro-randomized trials.

https://doi.org/10.48550/arXiv.2403.13934
Ohmann, C., Moher, D., Siebert, M., Motschall, E., & Naudet, F. (2021). Status, use and impact of sharing individual participant data from clinical trials: a scoping review. BMJ open, 11(8), e049228. https://doi.org/10.1136/bmjopen-2021-049228
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLOS ONE 2(3): e308. https://doi.org/10.1371/journal.pone.0000308
Silberzahn R, Uhlmann EL, Martin DP, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science. 2018;1(3):337-356. https://doi.org/10.1177/2515245917747646
Vuorre, M., & Crump, M. J. C. (2021). Sharing and organizing research products as R packages. Behavior research methods, 53(2), 792–802. https://doi.org/10.3758/s13428-020-01436-x
https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/data-sharing-approaches
https://datascience.nih.gov/tools-and-analytics/best-practices-for-sharing-research-software-faq
https://innovationpartnerships.umich.edu/resource/moss/
https://github.com/DepressionCenter/EFDC-Repo-Template/
https://docs.gitlab.com/ee/ci/variables/
https://docs.github.com/en/code-security/getting-started/best-practices-for-preventing-data-leaks-in-your-organization
https://innovationpartnerships.umich.edu/resource/moss/
github.com/depressioncenter
https://www.gnu.org/licenses/

About the Author

Ian Burnette is a Senior Brand/Product Analyst at the University of Michigan’s Institute for Social Research. Ian co-leads research dissemination efforts for the d3center, a NIDA Center of Excellence. They specialize in developing innovative training products that equip scientists with novel methods for designing adaptive interventions that respond effectively to individuals' changing needs.

| |

In collaboration with d3center

The Data Science for Dynamic Intervention Decision-Making Center (d3c) is an interdisciplinary collective of data scientists revolutionizing the design and delivery of interventions in healthcare and education.

Learn more

0 reviews

Print Article

Updating...