Optical Character Recognition (OCR) and Working with Messy Text Data

May 19, 2022, 2 p.m. - May 19, 2022, 4 p.m.

Organizer -

DataLab: Data Science and Informatics

Contact -


Location -



Optical Character Recognition (OCR) involves computational techniques for converting scanned images of printed or handwritten text into computer-readable formats. OCR helps make documents more searchable and can allow for analyses including text mining and natural language processing. This workshop will provide an overview of existing and emerging tools for unlocking the text in printed images, and will demonstrate practical techniques for OCR with Python using Tesseract OCR engine. Additionally, this workshop will include a discussion and practical examples of evaluating OCR viability, as well as tips for using OCR extracted data in NLP pipelines. This workshop qualifies as an elective for the Text Mining and NLP micro-credential through UC Davis GradPathways.

Learning Objectives

· After this workshop learners should be able to:

· Define "OCR"

· Describe an example of when OCR has aided “distant” (computational) reading and analysis

· List potential off-the-shelf solutions for simple OCR

· Identify possible technical challenges for performing OCR on a given document

· Describe an OCR workflow

· Use the course notebook to perform OCR on provided documents

· Assess and propose solutions for increasing accuracy.

Software: Python

Instructors: Arthur Koehl, TA: Tyler Shoemaker

Instructor Bio

Arthur Koehl is a research data scientist. He graduated from UC Davis with degrees in history, economics, and computer science. Prior to DataLab he worked for several years as a scientific computing intern at the Center for BioImaging Sciences at the National University of Singapore, where he learned the basics of Linux system administration. His interests include natural language processing, computer vision, and web programming. At DataLab he develops tools and provides technical expertise on interdisciplinary research projects with an emphasis in the humanities and social sciences.


Registration is closed for this event