Developing data sets

Tags how-to

Data Creation/Procurement

Where do you get it?

 Caution
See the IRB and PEERRS training sites before you begin collecting or working with Human Subjects Data. If you have completed similar training at another institution you can request a waiver.
 Note
There are various trainings that may be relevant to all kinds of data.

Survey and Interview Collection

  1. Training:
    1. There is almost always a course on survey design and collection.
    2. Consulting is also available at the SRC.
    3. Coursera sometimes has courses on survey design and collection.
    4. Advisors and faculty mentors are also great resources for learning about collection.
  2. Tools and Services:
    1. Qualtrics — A survey tool for collecting data including voice data.
    2. BlueJeans for interviews — A video conference tool that is cleared for use to collect qualitative research from human subjects. See also Bluejeans for Qualitative Interviews.
    3. Paper form/pdf to digital — Data entry service:
      1. Digital Divide Data
      2. Casting Words
      3. Rev Transcription
    4. Field Work
      1. LSA Field Equipment Evaluation Pool
      2. Consultation on this is also available via LSA Research Support.
      3. Tools that can help you capture what you collected where and meta data:
        1. For photos for example Lightroom
        2. For documents for example Atlas.ti

Purchasing or Data with data agreements that require signatures

See the UFA process and talk to your department about the purchase.

Scanning (OCR) to create data

  1. OCR: getting text and data from pdfs or images
  2. Training and support for OCR data extraction is available.
  3. Tools and Services:
    1. The Scholar Space has software and machines you can use.
      1. ABBYY
      2. Omnipage
      3. Adobe Professional
    2. You can purchase OCR software.  If you use university funds such as those from Rackham we can help you reach out to make your purchase with UmichITAM to get the best discount possible. Some good options are:
      1. Adobe Professional
      2. Omnipage
      3. ABBYY
    3. LSA Research Support has several mouse scanners that do OCR to Word/CSV and in other languages that you can try before you buy and these come with conversion software.
    4. The Library can assist for big projects.

Recordings as data

  1. Training for using recorders or doing digital recording: http://www.lib.umich.edu/digital-media-commons/training.
  2. Tools and Services:
    1. BlueJeans and Bluejeans for Qualitative Interviews
    2. Audacity
    3. Dragon mobile app
    4. Digital recorders
    5. Qualtrics
    6. Transcription Services for recordings:

Scraping for data

  1. Training:
    1. Big Data Summer Camp
    2. ARC: CSCAR events
    3. Python:
      1. http://learnpythonthehardway.org/
      2. http://www.pythonlearn.com/book.php this can be printed at the book station at the Library.
      3. Coursera has some great SI classes:
        1. https://www.coursera.org/specializations/data-science-python
        2. http://www.lsa.umich.edu/cg/cg_detail.aspx?content=2160SOC542001&termArray=f_17_2160
    4. Code 4 everyone
    5. Code academy
    6. Coursera
  2. Tools and Services:
    1. Python’s
      1. For parsing text, Beautiful soup
      2. For filling web forms and getting data from a web pages:
        1. https://pypi.python.org/pypi/robobrowser Robobrowser]
        2. Not as reliable but an option is mechanize
    2. R tools:
      1. xml
      2. curl
    3. Scholar space suggest maybe to look at Voyant or Antconc
    4. wget to get page downloads and the main site for wget
    5. Check to see if where you are getting your data has an API. This allows you to write code to pull data from the API.  Each is different and it takes some reading to get the right calls to the API to get what you need. For consulting with learning how to approach this or any of the tools above, email lsait@umich.edu.
    6. Other options:
      1. For downloading multiple pdfs there are plugins for browsers:
        1. https://chrome.google.com/webstore/detail/multiple-file-download/opncjdadngnekakilfcgjlgbmekljdbm?hl=en-US
        2. https://addons.mozilla.org/en-US/firefox/addon/flashgot/
        3. https://chrome.google.com/webstore/detail/tab-save/lkngoeaeclaebmpkgapchgjdbaekacki
      2. Mass download of full websites:
        1. Mac
        2. Windows
      3. Pay for research programming from LSA
    7. Whatever you scrape make sure you make a replicable copy in an archive:
      1. The wayback machine: https://archive.org/web/ See this blog post: https://blog.archive.org/2017/01/25/see-something-save-something/ and they have an API for searches of existing data: https://archive.org/help/wayback_api.php Note: Archive-It enables you to capture, manage and search collections of digital content without any technical expertise or hosting facilities. Visit Archive-It to build and browse the collections.
      2. archive.is: http://archive.is/ can do this as well.

Downloading public datasets

  1. Training: See the site for the data you are requesting to see if they have codebooks, sample coding or data dictionaries you can utilize in your analysis.
  2. Tools and Services:
    1. Sources of Data (either free or free to you)
      1. Research Guide to finding sources for statistics
      2. Research Guide to finding geospatial data
    2. Talk to your subject specialist to learn what other data is available at U-M Libraries.
    3. Check out ICSPR.
    4. Browser add-on to get multiple files:
      1. FlashGot
      2. Download them all

Visualization and GIS

  1. Training
    1. The Clark Library is available by appointment. Email clarklibrary@umich.edu or visit their website.
    2. Classes (search for GIS)
    3. Online self study. There are some free courses; talk to the Clark library about what our agreement may also include.
  2. Tools and Services:
    1. LSA TS has GIS consulting and a Server infrastructure.
    2. Clark Library has a lab for GIS. You can install ArcGIS and use QGIS as well.
    3. Explore data visualization at the Digital Media Commons.

Data Security

The IRB's expectations for safely storing research data are laid out in this web page.

When you word your IRB, Imagine being in the shoes of a person reviewing the application for the first time. The description of how sensitive data handled from start (when it is gathered from participants) to finish (when it is expunged or archived) should be clear and complete. It does not have to detail how BlueJeans protects sensitive data, but, for example, it could say something like this:

  • Initial data collection will be in the form of audio recordings of participants via one-on-one video interviews over BlueJeans, as permitted by U-M's sensitive data guide.
  • Audio recordings of the video meetings will be captured in BlueJeans, then downloaded to a laptop owned by U-M.
  • The laptop is configured and maintained by LSA TS appropriately for sensitive identifiable human subject research. It has full disk encryption, receives security updates within a week of release, and... (write out the relevant details)
  • The interviews and recording download/uploads will be done with the laptop on the MWireless network at (some U-M facility)
  • The Audio recordings will be uploaded from the laptop to Box at U-M, as permitted by U-M's sensitive data guide.

Note the language "as permitted by U-M's sensitive data guide" intentionally implies that the study team is using the service in the way stipulated in the sensitive data guide.

A special note on the Box page, includes this: "Be sure to use a Shared Account in U-M Box for sensitive university data."

IRB Data Security Guidelines apply the same way for the personally owned computer if you are a student. It will  take more to assure that the personal computer is appropriately secured. SPG 601.33 puts the onus on the student to secure the computer. The sensitive data guide provides some guidance as well.

Additionally, LSA TS offers consulting. Reach out if you have questions or you need to bring in secure data with restrictions on its use or disclosure.

Data Storage

  • ITS resources:
    1. U-M Box
    2. Google Docs
  • [lsatsweb:services/computer-desktop-support/file-storage-services/lsa-research-storage.html|LSA TS resources]]
  • Secure data storage for field work:
    1. Encrypted external hard drives are a good idea particularly working in rural areas that may not have network backup as a feasible option.
    2. Pin protected and encrypted hard drives are available:
      1. http://www.ironkey.com/en-US/
      2. https://www.apricorn.com/

Data Backup and Archiving

  • Backup:
    1. LSA TS has backup for U-M owned resources.
    2. Here are the Top 10 reviews for backup options for Windows and Mac.
  • Archive:
    1. U-M Resources: Deep Blue
    2. Other: Many professional organizations offer archiving of data as do many journals. If you have grant funding, it may stipulate where you must put your data.

Data Management Resource

Additional notes

Sometimes you may need funding to get your research data purchased or collected. Look to the following resources:

Metadata

UFA Research Data Scraping OCR

Details

Article ID: 1709
Created
Wed 5/27/20 10:04 AM
Modified
Thu 10/1/20 10:28 AM