Session abstract:
In this talk we make a trip through the world of text recognition with free software and go step by step through the individual sections of a flexible and scalable OCR application. In a live demo you will be shown how Tesseract is used for text recognition and how the quality can be significantly improved doing a little pre-processing with openCV. Subsequently the documents are stored and indexed in Elasticsearch to allow full text search. All this with just a few lines of code and all in the sense of interactive programming with Jupyter.
Agenda
- Quirks and pitfalls in text recognition of scanned documents
- Potential of pre-processing with openCV
- Use Tesseract at scale
- Quantify, compare and revaluate results
- Use of Tensorflow in a production-ready application