Scalable OCR pipelines using Python, Tensorflow and Tesseract

Scale
06/12/2018 - 15:40 to 16:00
Palais Atelier
short talk (20 min)
Intermediate

Session abstract: 

In this talk we make a trip through the world of text recognition with free software and go step by step through the individual sections of a flexible and scalable OCR application. In a live demo you will be shown how Tesseract is used for text recognition and how the quality can be significantly improved doing a little pre-processing with openCV. Subsequently the documents are stored and indexed in Elasticsearch to allow full text search. All this with just a few lines of code and all in the sense of interactive programming with Jupyter.

Agenda

  • Quirks and pitfalls in text recognition of scanned documents
  • Potential of pre-processing with openCV
  • Use Tesseract at scale
  • Quantify, compare and revaluate results
  • Use of Tensorflow in a production-ready application  

Video: 

Slide: