Guidance for sharing mobile data in human-subject research studies

data-storage data-management-plan mobiletech mobile-data research-best-practices best-practices deep-blue NIH data-sharing

Summary

While the field of mobile technology is constantly evolving, researchers at the University of Michigan and Michigan Medicine are eager to share lessons learned and best practices. The Mobile Technology Research & Innovation Collaborative (MeTRIC) is a campus-wide collaboration through which Michigan experts can share knowledge and resources for utilizing mobile tech in research studies. This article is the result of conversations with MeTRIC members. It provides general guidance for sharing mobile data based on the lessons learned in U-M research studies, as well evolving best practices from government and academic institutions worldwide.

Note: This article may be updated often as we discover new information and receive feedback from the research community.

Details

Mobile data should be open, publicly accessible, and easily discoverable.

To meet NIH requirements and to advance the fields of mobile technology and mental health, studies should find ways to make de-identified mobile data available to the public in a way that is:
1. Open (e.g. no login or payment required, data dictionaries are available, common data models used to allow interoperability)
2. Publicly accessible (e.g. not behind firewalls)
3. Easily discoverable (e.g. metadata is found through search engines or search pages)
Mobile data repositories should withstand the test of time:
1. Using long-term repositories such as University of Michigan's Deep Blue Data help ensure that data is preserved for future research for at least 10 years
2. Considerations should be made in advance (during grant funding) for long-term storage costs, keeping in mind that mobile data can typically span 5-100MB per day per participant depending on the granularity and the specific sensors used
Data Management and Sharing Plans should be written with specifics of mobile technologies in mind, and should follow funder guidance. Learn about Data Management & Sharing Plan Resources.

Studies that use mobile data should be easily reproducible.

Reproducibility in mobile studies is not 100% guaranteed, especially when using consumer-grade devices that have closed-source algorithms. However, to meet NIH requirements and to advance the fields of mobile technology and mental health, researchers should aim to make the studies as reproducible as possible while documenting unknowns and assumptions made
Aim to share all code used for ETL, ELT, data pipelines, data flows, automation, etc. in public repositories such as GitHub
1. As a publicly-funded institution and by virtue of receiving federal funds, the code written to support studies is tax-payer funded and belongs to the tax payers
2. Sharing code publicly from the start can help researchers receive feedback from the global scientific community. Contributing code opens the doors to receive contributions in return
3. NIH-funded projects must make data-processing and data-transfer code publicly available, to ensure future reproducibility
Ensure all code is well documented
1. Create a "quick start" or "getting started" guide on how to setup the development environment, how to run your code, what inputs to use, and what outputs to expect
2. Include comments throughout your code, explaining any assumptions, what each code block does, and providing examples of expected input and output
3. Assume that a non-software developer will view your code, and help them understand what you are doing and why you are doing that way
4. If possible, use software tools that generate code documentation based on your code and comments, and publish that code documentation alongside your data dictionary
Whenever possible, use data analysis tools, libraries and frameworks that are open source and/or freely available to ease the burden of costs needed to reproduce the study
1. Pay close attention to the terms of licenses which may prevent sharing of code that used specific commercial libraries. U-M Fast Forward Innovation recommends using the GPL v3 open source license for your code, and using libraries that are released under a similar open source license
2. Using commercial statistical, analytics or modeling software is fine as long as any vendor-supplied functions utilized can be easily reproduced without the vendor's software. For example, functions based on mathematical formulas or industry standards, such as AVG() for averages, would allow doing the same thing in a different software package. Software-specific formulas that do not provide public documentation of the underlying algorithms would be problematic as other researchers would not be able to reproduce the algorithms without using the exact same software package, which maybe cost-prohibitive
Document the specific versions of libraries and all software tools used
1. If the implementation of a library or tool changes over time, knowing the specific version would allow researchers to reproduce the study by downloading that specific version
Provide a complete data dictionary
1. Document all files, tables, columns, functions, stored procedures, data flows / data pipelines, and automation processes to meet NIH requirements and to ensure metadata is easy to find
2. At the very least, ensure you provide a document with the names of every table and column, data types, meaningful descriptions, and relationships between tables. See data dictionary examples here.
3. Make use of software tools to generate data dictionaries, but always review and provide descriptions for tables and columns instead of relying on the tools
4. Whenever possible, use enterprise data dictionary tools that provide a search page, so it is easy to find keywords in column names and comments
5. Whenever possible, metadata such as data dictionaries should be searchable through Internet search engines like Google, Bing and DuckDuckGo. This implies that files should not be compressed inside Zip or Tar/GZip files, and should not be in binary formats. Use enterprise data dictionary tools or standard formats like HTML, open documents format, Excel, etc.
If applicable, provide a data model
1. Include a logical diagram or entity-relationship diagram (ERD) of the files/tables to show relationships between the different files/tables
2. Include data flow diagrams of any data pipelines used
Ensure that data visualizations are also easily reproducible
1. Any visualizations should either include the underlying data, or explain which tables/columns were used to generate them
2. Visualizations created in dashboard or reporting software such as Tableau or Power BI should allow access to underlying data, and the dashboard files themselves should be shared in a public repository
3. Visualizations and graphs created in statistical software packages like R should document the code used for generating them
4. Visualizations and graphs created programatically should make the code available in a public repository, and should document the specific library versions used to create them

Notes

This article may be updated often as we discover new information and receive feedback from the research community.

Resources

National Institutes of Health - Which Policies Apply to My Research?
National Institutes of Health - Simplified DMS Policy | Full Data Management and Sharing Policy
University of Michigan Library / Deep Blue - Make a Data Management Plan | DMP Boilerplate
Free Software Foundation - What is Free Software? (Can Free Software be Commercial?)
GNU Operating System - How to Choose a License for Your Own Work
GNU Operating System - A Quick Guide to GPLv3

About the Author

Gabriel Mongefranco is a Mobile Data Architect at the University of Michigan Eisenberg Family Depression Center. Gabriel has over a decade of experience in data analytics, dashboard design, automation, back end software development, database design, middleware and API architecture, and technical writing.

| |

0 reviews

Print Article

Updating...