Extracting text and data from PDFs or images using optical character recognition (OCR)

Summary

Information on software and other tools that are available to extract text and data from a pdf, jpg or document, and where to go for help.

Environment

  • Researchers
  • Paper documents that need to be converted into tables or searchable text for tagging, analysis or to be used as content in other documents.
  • Digital documents that need to be converted into tables or searchable text for tagging, analysis or to be used as content in other documents.
  • ScholarSpace

Directions

Where to go for help

  • The ScholarSpace has the most software and support for optical character recognition (OCR). CSCAR and HPC support can also be of help when using programming methods.
    • Occasionally, the data is already available in digitized format. It's best to speak with the subject specialist in case you can save the time of extraction.

Computers and Tools available

  • The ScholarSpace has the best setup for machines to run the software. Additionally, some software below can be added to various U-M owned computers including some computer labs in departments and campus computing sites upon request.
  • LSA Technology Services also has field equipment you can test with that can help.


Software

  • ABBYY PDF Transformer is available for Mac and Windows and is licensed for windows on university owned resources.  It can be installed in various labs with university owned windows machines.
  • PDFpenPro is available for Mac and is licensed for University owned resources. It can be installed in various labs with university owned Mac hardware.
  • Adobe Acrobat is available for Mac and Windows and is licensed by LSA for university owned machines. It can be installed in various labs on university owned equipment.
  • ABBYY Fine Reader is available for Mac and Windows.  It is available for use in the Scholar Space in the library.  This requires you to purchase a license if you want to use it elsewhere.
  • OmniPage is available for Windows.  It is available for use in the Scholar Space in the library.  This requires you to purchase a license if you want to use it elsewhere.
  • Scanner Pro 7 turns your iOS phone into a scanner that directly does OCR. This requires an app purchase.
  • Prizmo is available for Mac and iOS and scans directly to OCR. This requires a purchase of a license.
  • With more programming skills: