EMA-CleanR: Ecological Momentary Assessment (EMA) Data Processing in R

10.5281/zenodo.17982076

Summary

Research teams that make use of Ecological Momentary Assessment (EMA) surveys often have a need to clean and pre-process the data. Dr. Sarah Sperry and Victoria Murphy of the Emotion and Temporal Dynamics (EmoTe) Lab at the University of Michigan created EMA-CleanR, an R-based program for efficient pre-processing, cleaning, and visualization of EMA survey data. This article documents how to use EMA-CleanR to pre-process EMA data.

 

Screenshot of EMA-CleanR output in HTML

Screenshot of EMA-CleanR visualizations

 

Setup and Usage

EMA-CleanR is written in R-Markdown, which allows formatting R code and displaying it in sections, along with tables and plots, to make the data analysis steps easier to follow. The code is easier to view as HTML (a web page), which can be easily generated by opening the EMA-CleanR.Rmd file in RStudio and clicking Knit. The code takes as input a single CSV file called "EMA-Data.csv" with certain required columns. Each EMA item (question) must be in a separate column, and the column headings should begin with "EMA_". Customizable parameters can be configured in the top section of the .Rmd file, in YAML format.

 

Project Setup

  • Download the code from GitHub at: https://github.com/DepressionCenter/EMA-CleanR

    File structure
     
  • Replace EMA-Data.csv with your own file.
    • Ensure it has at least these columns: participantidentifier, surveyname, start_datetime, end_datetime.
    • There should be one column per EMA item (question), and the column headings should start with "EMA_" (e.g. EMA_01, EMA_02, etc.). This prefix is configurable. under Parameters (more on that later).
    • Each row represents one survey taken by one participant at one point in time.

      Uploaded Image (Thumbnail)
  • Open EMA-CleanR.Rmd with RStudio. If asked, install any missing packages.

    Uploaded Image (Thumbnail)
  • Edit the parameters at the top if needed (e.g. input file name), in the YAML section.

 

Configure Project Parameters

To customize the analysis for your particular study, set the project parameters at the top of EMA-CleanR.Rmd, in the YAML-formatted section. After changing the parameters, click the Knit button again to re-run the analysis.

  • input_file: The name of your input CSV. It defaults to "EMA-Data.csv" which contains sample data.
  • input_file_has_headers: Indicates that your CSV has column headers.
  • output_dir: The sub-directory (relative to the .Rmd file) where the output CSVs will be stored.
  • late_survey_cutoff_hour: If you want to allow late survey responses after midnight, set this parameter to at least one hour before the start of your first daily survey. Any responses received between midnight and this hour will be counted under the previous day. In the sample data, the earliest survey occurs at 10AM, so this parameter is set to 9 to allow late responses up until 9am. To disable, set it to 0.
  • ignore_surveys: Specific survey names to ignore (e.g. practice surveys).
  • surveys_per_day: Number of surveys per day. In some EMA software, we must schedule each time as a separate survey. Defaults to 4.
  • total_surveys_in_study: Total number of surveys. In this example, 4 surveys per day x 28 days = 112 surveys.
  • ema_item_prefix: All EMA columns (one per question) must start with this prefix. Defaults to "EMA_" (e.g. EMA_01, EMA_02, EMA_03, etc.)
  • ema_item_labels: Optional: setup friendly names for each of your EMA items. For example, EMA_01: "Nervous/Anxious" will make the code display "Nervous/Anxious" in graphs instead of "EMA_01".
  • participant_group_map: Optional: diagnosis codes, cohorts or groups mapped to the first letter/digit of the participant ID. For example, "1": "Cohort 1" will group all participant IDs starting with 1 (10, 101, 10002, etc.) as "Cohort 1".
  • plot_colors: U-M Colors for graphs. Go Blue!

 

Run the Analysis

  • In RStudio, open EMA-CleanR.Rmd and setup per the above instructions.
  • To run the analysis, click the "Knit" button (or Ctrl+Shift+K) to generate a new EMA-CleanR.html file. This will contain a walk-through analysis of your data and visualizations.

    Uploaded Image (Thumbnail)
     
  • The output directory will contain exports of the data analysis in CSV format.

    Uploaded Image (Thumbnail)

 

Navigating the Analysis

After running the analysis by clicking the Knit button in RStudio, you will be shown a web page containing the code, comments, tables and plots. This page is also saved locally as EMA-CleanR.html, which can be opened in any web browser.

Uploaded Image (Thumbnail)

  • Use the left menu to navigate through the different sections of the code.
  • Code blocks are displayed in gray, and the results of each block in white.
  • You can hide individual sections by clicking the Hide buttons. On the top right of the page, you can also click the blue Code button to show or hide all code blocks (useful for printing to PDF).

 

Files

  • /images: Contains screenshots and background images used in the demo files, this knowledge base article, and on GitHub.
  • /styles: Contains CSS style sheets with University of Michigan colors and digital accessibility features used when generating the HTML output.
  • .gitignore, .nojekyll: Internal files used by Git during check-in and when generating the sample output. Do not modify.
  • index.html: Redirection page for the demo site. Do not modify.
  • LICENSE, NOTICE: Copyright and license information.
  • README.md: This is the home page shown in the GitHub repo.
  • EMA-Data.csv: Contains sample EMA data.
  • EMA-CleanR.Rmd: Contains the R/Markdown code for the project. Use this file to configure the input parameters and to run the analysis. Best opened in RStudio.
  • EMA-CleanR.html: Contains the output of the analysis in HTML. This file is refreshed every time you click the Knit button in the .Rmd file.​​

 

Code Walk-Through

The following is a short walk-through of the code. More information can be found in the comments throughout the code.

 

YAML Parameter Block

This block contains the project parameters used when running the analysis, in YAML format.

  • params: These are the input parameters to use when running the analysis. See the Parameters section above for more details.
  • output: These are configuration settings for the HTML output.

 

Project Setup

This section sets up the display options for code blocks going forward, reads the YAML parameters and converts them to R variables, and loads packages.

  • Global chunk options. Suppresses warnings/messages, echoes code for reproducibility.
  • Load required packages. Loads the R libraries used by the project (deplyr, corrplot, rlang, etc.). A full list of libraries is displayed.
    • Note that these external libraries each have their own licenses. Please refer to their individual websites or the R CRAN repository for license information.
  • Set project parameters. This section reads the parameters from YAML and adds them to the global variables. This makes project setup easier by unifying all parameters at the top, in an easy-to-read format.
    • The parameter values are shown in a white box, so everyone can see what assumptions were made when running the analysis.
      Uploaded Image (Thumbnail)
    • Decision Point: Handles output directory creation if missing.
  • Global functions. This section creates functions for common tasks used throughout the code, such as generating color gradients using the colors defined in the parameters.
  • Load data sets. This section reads the input file (EMA-Data.csv, or as defined in the input_file parameter).
    • The ingested row count is shown for a quick sanity check.
    • Assumption: This file already contains combined participant data, with one row per survey taken per participant. This is often the case when exporting directly from mobile technology platforms or clinical trial management systems.

 

Cleanup EMA Data File

This section attempts to do some data cleaning, de-duplication, and grouping. It also creates global variables to access the clean data.

Remove practice and test surveys

  • Filters out surveys named in ignore_surveys.
  • Assumption: Surveys to be excluded are correctly listed; could miss unexpected test names.

Sort dataset and filter invalid participant IDs

  • Removes rows with NA, empty, or invalid string identifiers ("NA", "null", etc.).
  • Special Case: Catches common issues with identifier entry errors.
  • The remaining rows after participant ID cleaning are shown.

Cleanup dates and de-duplicate

  • Parses start/end datetime as POSIX objects.
  • Creates date-only column (start_day).
  • Identifies and merges duplicate records (same participant, date, end time).
  • Decision Point: Merges partial responses via coalesce(); does not arbitrarily drop duplicates, but combines them.
  • The remaining rows after de-duplication are shown.

Create participant groups

  • Prefix-based mapping for participant groups (diagnosis/cohort). The mapping is configured in the participant_group_map parameter.
  • Assumption: Group distinction encoded in participant ID format. The first digit of the participant ID determines the group.

Identify and clean EMA items

  • Identifies survey items by prefix (per the ema_item_prefix parameter) and removes rows where all items are NA.
  • Decision Point: Only retains rows with some valid EMA response.

Create variables

  • Days in Study: Calculates day-in-study per participant, allowing for varying start dates.
  • Late Survey Correction: If survey completed after midnight but before late_survey_cutoff_hour, counts toward previous day using lag().
  • Survey Sequence: Assigns sequential survey number within each participant.
  • Weekday/Weekend Classification: Adds columns for day type for context in analyses.

Time to completion (TTC) flag

  • Calculates time difference (seconds) for survey completion.
  • Sets cutoff for unusually high completion times: mean + 2 SD.
    • TTCFlag_High: Flag for abnormally slow surveys.
  • Sets minimum for abnormally fast completion: < 1 second per EMA item.
    • TTCFlag_Low: Flag for possibly inattentive (too fast) responses.
  • Assumption: Typical response times cluster around group mean, distributions are reasonable.
  • Plots a histogram of time differences.

 

Flags

See code comments for details.

 

Compliance

See code comments for details.

 

Item Distribution

See code comments for details.

 

Correlation

See code comments for details.

 

Output

See code comments for details.

 

 

Notes

 

Resources

 

 

About the Author

Gabriel Mongefranco is a Mobile Data Architect at the University of Michigan's Eisenberg Family Depression Center. Gabriel has over a decade of experience with automation, data analytics, database architecture, dashboard design, software development, and technical writing. He supports U-M researchers with data cleaning, data pipelines, automation and enterprise architecture for wearables and other mobile technologies.

 |   |   | 

 

Print Article

Related Articles (2)

This resource guide was developed for attendees of the MeTRIC Symposium at the University of Michigan held on November 1st, 2024. MeTRIC is a campus-wide collaboration designed to foster knowledge sharing and create a single access point for University of Michigan investigators who are looking to use wearables, apps, or other mobile technologies in their health research.
A compilation of software tools used by MDEN mobile tech researchers for working with mobile data, including: wearable programming, mobile app development, data extraction, data analysis, data presentation and visualizations, and data pipelines.