Improving the Recognition Accuracy of Tesseract-OCR Engine on Nepali Text Images via Preprocessing
December 2020
Authors
Abstract
Image Documents scanned or captured by digital cameras on mobile phones suffer from a number of limitations like geometric distortions, focus loss, uneven lightning conditions, low scanning resolution etc. Because of these limitations, the quality of image documents is often degraded and because of this, the recognition accuracy of OCR engines gets affected. This work focuses on improving the recognition of Tesseract-OCR engine for Nepali image documents via preprocessing. For this purpose, we developed an image preprocessing pipeline consisting of 8 steps and tested with several Nepali text images which were collected from different sources like Nepali news corpus, books, printed documents etc. Our test results showed that the recognition accuracy improved from 90.69%, 54.34% and 38.45 to 94.84%, 71.15% and 51.21% respectively for high, medium and low quality images