Scalable OCR pipelines using Python, Tensorflow and Tesseract

Scale

06/12/2018 - 15:40 to 16:00

Palais Atelier

short talk (20 min)

Intermediate

Session abstract:

In this talk we make a trip through the world of text recognition with free software and go step by step through the individual sections of a flexible and scalable OCR application. In a live demo you will be shown how Tesseract is used for text recognition and how the quality can be significantly improved doing a little pre-processing with openCV. Subsequently the documents are stored and indexed in Elasticsearch to allow full text search. All this with just a few lines of code and all in the sense of interactive programming with Jupyter.

Agenda

Quirks and pitfalls in text recognition of scanned documents
Potential of pre-processing with openCV
Use Tesseract at scale
Quantify, compare and revaluate results
Use of Tensorflow in a production-ready application

Video:

#bbuzz 2018: Mark Keinhörster – Scalable OCR pipelines using Python, Tensorflow and Tesseract

Slide:

bbuzz_2018.pdf

Berlin Buzzwords

Scalable OCR pipelines using Python, Tensorflow and Tesseract

Session abstract:

Video:

#bbuzz 2018: Mark Keinhörster – Scalable OCR pipelines using Python, Tensorflow and Tesseract

Slide:

bbuzz_2018.pdf

Newsletter

Partners

Gold Partner

Past conferences