Developing data sets

Tags how-to


Overview of data collection and creation. 

NOTE: See the IRB and PEERRS training sites before you begin collecting or working with Human Subjects Data. If you have completed similar training at another institution you can request a waiver.



Data Creation/Procurement

There are various trainings that may be relevant to all kinds of data.

Survey and Interview Collection

  1. Training:
    1. There is almost always a course on survey design and collection.
    2. Consulting is also available at the SRC.
    3. Coursera sometimes has courses on survey design and collection.
    4. Advisors and faculty mentors are also great resources for learning about collection.
  2. Tools and Services:
    1. Qualtrics — A survey tool for collecting data including voice data.
    2. Paper form/pdf to digital — Data entry service:
      1. Digital Divide Data
      2. Casting Words
      3. Rev Transcription
    3. Field Work
      1. LSA Field Equipment Evaluation Pool
      2. Consultation on this is also available via LSA Research Support.
      3. Tools that can help you capture what you collected where and meta data:
        1. For photos for example Lightroom
        2. For documents for example Atlas.ti

Purchasing or Data with data agreements that require signatures

See the UFA process and talk to your department about the purchase.

Scanning (OCR) to create data

  1. OCR: getting text and data from pdfs or images
  2. Training and support for OCR data extraction is available.
  3. Tools and Services:
    1. The Scholar Space has software and machines you can use.
      1. ABBYY
      2. Omnipage
      3. Adobe Professional
    2. You can purchase OCR software.  If you use university funds such as those from Rackham we can help you reach out to make your purchase with UmichITAM to get the best discount possible. Some good options are:
      1. Adobe Professional
      2. Omnipage
      3. ABBYY
    3. LSA Research Support has several mouse scanners that do OCR to Word/CSV and in other languages that you can try before you buy and these come with conversion software.
    4. The Library can assist for big projects.

Recordings as data

  1. Training for using recorders or doing digital recording:
  2. Tools and Services:
    1. Audacity
    2. Dragon mobile app
    3. Digital recorders
    4. Qualtrics
    5. Transcription Services for recordings:

Scraping for data

  1. Training:
    1. Big Data Summer Camp
    2. ARC: CSCAR events
    3. Python:
      2. this can be printed at the book station at the Library.
      3. Coursera has some great SI classes:
    4. Code 4 everyone
    5. Code academy
    6. Coursera
  2. Tools and Services:
    1. Python’s
      1. For parsing text, Beautiful soup
      2. For filling web forms and getting data from a web pages:
        1. Robobrowser]
        2. Not as reliable but an option is mechanize
    2. R tools:
      1. xml
      2. curl
    3. Scholar space suggest maybe to look at Voyant or Antconc
    4. wget to get page downloads and the main site for wget
    5. Check to see if where you are getting your data has an API. This allows you to write code to pull data from the API.  Each is different and it takes some reading to get the right calls to the API to get what you need. For consulting with learning how to approach this or any of the tools above, email
    6. Other options:
      1. For downloading multiple pdfs there are plugins for browsers:
      2. Mass download of full websites:
        1. Mac
        2. Windows
      3. Pay for research programming from LSA
    7. Whatever you scrape make sure you make a replicable copy in an archive:
      1. The wayback machine: See this blog post: and they have an API for searches of existing data: Note: Archive-It enables you to capture, manage and search collections of digital content without any technical expertise or hosting facilities. Visit Archive-It to build and browse the collections.
      2. can do this as well.

Downloading public datasets

  1. Training: See the site for the data you are requesting to see if they have codebooks, sample coding or data dictionaries you can utilize in your analysis.
  2. Tools and Services:
    1. Sources of Data (either free or free to you)
      1. Research Guide to finding sources for statistics
      2. Research Guide to finding geospatial data
    2. Talk to your subject specialist to learn what other data is available at U-M Libraries.
    3. Check out ICSPR.
    4. Browser add-on to get multiple files:
      1. FlashGot
      2. Download them all

Visualization and GIS

  1. Training
    1. The Clark Library is available by appointment. Email or visit their website.
    2. Classes (search for GIS)
    3. Online self study. There are some free courses; talk to the Clark library about what our agreement may also include.
  2. Tools and Services:
    1. LSA TS has GIS consulting and a Server infrastructure.
    2. Clark Library has a lab for GIS. You can install ArcGIS and use QGIS as well.
    3. Explore data visualization at the Digital Media Commons.


Data Security

The IRB's expectations for safely storing research data are laid out in this web page.

When you word your IRB, Imagine being in the shoes of a person reviewing the application for the first time. The description of how sensitive data handled from start (when it is gathered from participants) to finish (when it is expunged or archived) should be clear and complete. It could say something like this:

  • Initial data collection will be in the form of audio recordings of participants via one-on-one video interviews as permitted by U-M's sensitive data guide.

  • Audio recordings of the video meetings will be captured then downloaded to a laptop owned by U-M.

  • The laptop is configured and maintained by LSA TS appropriately for sensitive identifiable human subject research. It has full disk encryption, receives security updates within a week of release, and... (write out the relevant details)

  • The interviews and recording download/uploads will be done with the laptop on the MWireless network at (some U-M facility)

  • The Audio recordings will be uploaded from the laptop to Dropbox at U-M, as permitted by U-M's sensitive data guide.

Note the language "as permitted by U-M's sensitive data guide" intentionally implies that the study team is using the service in the way stipulated in the sensitive data guide.

IRB Data Security Guidelines apply the same way for the personally owned computer if you are a student. It will  take more to assure that the personal computer is appropriately secured. SPG 601.33 puts the onus on the student to secure the computer. The sensitive data guide provides some guidance as well.

Additionally, LSA TS offers consulting. Reach out if you have questions or you need to bring in secure data with restrictions on its use or disclosure.


Data Storage

  • ITS resources:
    1. Dropbox at U-M
    2. Google Docs
  • [lsatsweb:services/computer-desktop-support/file-storage-services/lsa-research-storage.html|LSA TS resources]]
  • Secure data storage for field work:
    1. Encrypted external hard drives are a good idea particularly working in rural areas that may not have network backup as a feasible option.
    2. Pin protected and encrypted hard drives are available:


Data Backup and Archiving

  • Backup:
    1. LSA TS has backup for U-M owned resources.
    2. Here are the Top 10 reviews for backup options for Windows and Mac.
  • Archive:
    1. U-M Resources: Deep Blue
    2. Other: Many professional organizations offer archiving of data as do many journals. If you have grant funding, it may stipulate where you must put your data.


Data Management Resource


Additional notes

Sometimes you may need funding to get your research data purchased or collected. Look to the following resources:


Article ID: 1709
Wed 5/27/20 10:04 AM
Tue 6/20/23 12:34 PM