Guidance for sharing mobile data in human-subject research studies

Summary

While the field of mobile technology is constantly evolving, researchers at the University of Michigan and Michigan Medicine are eager to share lessons learned and best practices. The Mobile Technology Research & Innovation Collaborative (MeTRIC) is a campus-wide collaboration through which Michigan experts can share knowledge and resources for utilizing mobile tech in research studies. This article is the result of conversations with MeTRIC members. It provides general guidance for sharing mobile data based on the lessons learned in U-M research studies, as well evolving best practices from government and academic institutions worldwide.

Note: This article may be updated often as we discover new information and receive feedback from the research community.

 

Details

Mobile data should be open, publicly accessible, and easily discoverable.

  1. To meet NIH requirements and to advance the fields of mobile technology and mental health, studies should find ways to make de-identified mobile data available to the public in a way that is:
    1. Open (e.g. no login or payment required, data dictionaries are available, common data models used to allow interoperability)
    2. Publicly accessible (e.g. not behind firewalls)
    3. Easily discoverable (e.g. metadata is found through search engines or search pages)
       
  2. Mobile data repositories should withstand the test of time:
    1. Using long-term repositories such as University of Michigan's Deep Blue Data help ensure that data is preserved for future research for at least 10 years
    2. Considerations should be made in advance (during grant funding) for long-term storage costs, keeping in mind that mobile data can typically span 5-100MB per day per participant depending on the granularity and the specific sensors used
       
  3. Data Management and Sharing Plans should be written with specifics of mobile technologies in mind, and should follow NIH guidance
     

Mobile Study Reproducibility Chart

 

Studies that use mobile data should be easily reproducible.

  1. Reproducibility in mobile studies is not 100% guaranteed, especially when using consumer-grade devices that have closed-source algorithms. However, to meet NIH requirements and to advance the fields of mobile technology and mental health, researchers should aim to make the studies as reproducible as possible while documenting unknowns and assumptions made
     
  2. Aim to share all code used for ETL, ELT, data pipelines, data flows, automation, etc. in public repositories such as GitHub
    1. As a publicly-funded institution and by virtue of receiving federal funds, the code written to support studies is tax-payer funded and belongs to the tax payers
    2. Sharing code publicly from the start can help researchers receive feedback from the global scientific community. Contributing code opens the doors to receive contributions in return
    3. NIH-funded projects must make data-processing and data-transfer code publicly available, to ensure future reproducibility
       
  3. Ensure all code is well documented
    1. Create a "quick start" or "getting started" guide on how to setup the development environment, how to run your code, what inputs to use, and what outputs to expect
    2. Include comments throughout your code, explaining any assumptions, what each code block does, and providing examples of expected input and output
    3. Assume that a non-software developer will view your code, and help them understand what you are doing and why you are doing that way
    4. If possible, use software tools that generate code documentation based on your code and comments, and publish that code documentation alongside your data dictionary
       
  4. Whenever possible, use data analysis tools, libraries and frameworks that are open source and/or freely available to ease the burden of costs needed to reproduce the study
    1. Pay close attention to the terms of licenses which may prevent sharing of code that used specific commercial libraries. U-M Fast Forward Innovation recommends using the GPL v3 open source license for your code, and using libraries that are released under a similar open source license
    2. Using commercial statistical, analytics or modeling software is fine as long as any vendor-supplied functions utilized can be easily reproduced without the vendor's software. For example, functions based on mathematical formulas or industry standards, such as AVG() for averages, would allow doing the same thing in a different software package. Software-specific formulas that do not provide public documentation of the underlying algorithms would be problematic as other researchers would not be able to reproduce the algorithms without using the exact same software package, which maybe cost-prohibitive
       
  5. Document the specific versions of libraries and all software tools used
    1. If the implementation of a library or tool changes over time, knowing the specific version would allow researchers to reproduce the study by downloading that specific version
       
  6. Provide a complete data dictionary
    1. Document all files, tables, columns, functions, stored procedures, data flows / data pipelines, and automation processes to meet NIH requirements and to ensure metadata is easy to find
    2. At the very least, ensure you provide a document with the names of every table and column, data types, meaningful descriptions, and relationships between tables. See data dictionary examples here.
    3. Make use of software tools to generate data dictionaries, but always review and provide descriptions for tables and columns instead of relying on the tools
    4. Whenever possible, use enterprise data dictionary tools that provide a search page, so it is easy to find keywords in column names and comments
    5. Whenever possible, metadata such as data dictionaries should be searchable through Internet search engines like Google, Bing and DuckDuckGo. This implies that files should not be compressed inside Zip or Tar/GZip files, and should not be in binary formats. Use enterprise data dictionary tools or standard formats like HTML, open documents format, Excel, etc.
       
  7. If applicable, provide a data model
    1. Include a logical diagram or entity-relationship diagram (ERD) of the files/tables to show relationships between the different files/tables
    2. Include data flow diagrams of any data pipelines used
       
  8. Ensure that data visualizations are also easily reproducible
    1. Any visualizations should either include the underlying data, or explain which tables/columns were used to generate them
    2. Visualizations created in dashboard or reporting software such as Tableau or Power BI should allow access to underlying data, and the dashboard files themselves should be shared in a public repository
    3. Visualizations and graphs created in statistical software packages like R should document the code used for generating them
    4. Visualizations and graphs created programatically should make the code available in a public repository, and should document the specific library versions used to create them

 

 

Notes

  • This article may be updated often as we discover new information and receive feedback from the research community.

 

Resources

 

 

About the Author

Gabriel Mongefranco is a Mobile Data Architect at the University of Michigan Eisenberg Family Depression Center. Gabriel has over a decade of experience in data analytics, dashboard design, automation, back end software development, database design, middleware and API architecture, and technical writing.

 |  | 

 

Print Article

Related Articles (7)

A listing of U-M and MM offices and research cores that provide data-related consultation services, including those with expertise with mobile data and mental health data.
Data Management and Sharing Plans (DMSPs) are increasingly becoming a requirement when submitting research proposals. Even when not required, proposals with DMSPs are often scored higher than proposals without one. Developing a DMSP helps researchers plan technology use, create a more accurate budget, and assist in the Information Assurance (IA) review process, if needed, for their study. DMSPs are recommended for all studies utilizing wearable and mobile technologies.
This article explores the importance of sharing code, addresses common reservations among researchers, and provides practical advice on how to share effectively. By increasing transparency and releasing code as open source, researchers not only meet the requirements of funding agencies and publications but also stimulate institutional, national, and global research progress.
This article shows how researchers can make use of an open source data dictionary tool called SchemaSpy to help create professional, easy-to-understand documentation for their datasets. Specific instructions are provided for Oracle, Microsoft SQL Server, CSV, SQlite, R, and Python + Pandas.
This article is a listing of data dictionaries and data models for datasets that utilize mobile data (wearables, mobile apps, surveys, smartwatch apps, phone sensors, and more).
Review of best practices and examples of enrolling and consenting participants using mobile or web applications.
A compilation of software tools used by MDEN mobile tech researchers for working with mobile data, including: wearable programming, mobile app development, data extraction, data analysis, data presentation and visualizations, and data pipelines.