Getting text and data from PDFs or images with optical character recognition (OCR)

Questions

What computers, software and other tools are available to extract the text and data I need from my pdf, jpg or document? Where is the best place to go for help?

Environment

  • Researchers
  • Paper documents that need to be converted into tables or searchable text for tagging, analysis or to be used as content in other documents.
  • Digital documents that need to be converted into tables or searchable text for tagging, analysis or to be used as content in other documents.

Answer

Occasionally, the data is already available in digitized format. It's best to speak with the subject specialist in case you can save the time of extraction.

Expert Help: The ScholarSpace has the most software and support for optical character recognition (OCR). CSCAR and HPC support can also be of help when using programming methods.

Computers and Tools available:

  • The ScholarSpace has the best setup for machines to run the software. Additionally, some software below can be added to various U-M owned computers including some computer labs in departments and campus computing sites upon request.
  • LSA Technology Services also has field equipment you can test with that can help.


Software:

  • ABBYY PDF Transformer is available for Mac and Windows and is licensed for windows on university owned resources.  It can be installed in various labs with university owned windows machines.
  • PDFpenPro is available for Mac and is licensed for University owned resources. It can be installed in various labs with university owned Mac hardware.
  • Adobe Acrobat is available for Mac and Windows and is licensed by LSA for university owned machines. It can be installed in various labs on university owned equipment.
  • ABBYY Fine Reader is available for Mac and Windows.  It is available for use in the Scholar Space in the library.  This requires you to purchase a license if you want to use it elsewhere.
  • OmniPage is available for Windows.  It is available for use in the Scholar Space in the library.  This requires you to purchase a license if you want to use it elsewhere.
  • Scanner Pro 7 turns your iOS phone into a scanner that directly does OCR. This requires an app purchase.
  • Prizmo is available for Mac and iOS and scans directly to OCR. This requires a purchase of a license.
  • With more programming skills:

Additional notes

The ScholarSpace has experts on using OCR to extract data.

Details

Article ID: 1827
Created
Wed 5/27/20 11:23 AM
Modified
Mon 8/31/20 9:16 AM