Optical Character Recognition systems for Document Understanding Mikuláš Bankovič 456421@mail.muni.cz Faculty of Informatics, Masaryk University October 10, 2022 Research sources A Survey of Deep Learning Approaches for OCR and Document Understanding [5] ICDAR 2019 - Scanned Receipts OCR and Information Extraction (SROIE) [2] Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 2 / 22 Computer Vision problems Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 3 / 22 OCR pipeline Figure 1 given by Subramani et al. [5] shows different approaches to OCR systems. Figure: Object detection and segmentation need transcription(text recognition model) Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 4 / 22 Text Detection CRAFT based on CNN, specifically FCN [1] Differentiable Binarization Network (DBNet) [3] Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 5 / 22 Comparisons Figure: Speed and metric comparison Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 6 / 22 Text Recognition CRNN with different backbones (mobilenet, vgg) Master - CNN + Transformer [4] Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 7 / 22 Master Figure: Speed comparison to previous SOTA Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 8 / 22 Master Connectionist Temporal Classification (CTC) loss does not require character-level annotations but word-level annotations high training parallelization compared to RNN Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 9 / 22 OCR frameworks OCR systems allow users to use text detection and text recognition models without having to create own pipeline and visualization tools. EasyOCR - many languages - only a few models DocTR - newer interface and better variety of models including selected ones Both have poor documentation and are in their code infancy. Figure: We do not want to reinvent the wheel Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 10 / 22 DocTR DBNet (pretrained text detection) Master (we want to train for our domain) Figure: Doctr Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 11 / 22 EasyOCR model Figure: Source code of so far unnamed model implemented by EasyOCR Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 12 / 22 Invoice Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 13 / 22 Reconstruction Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 14 / 22 Born-Digital Dataset Scrape pdf invoices from the internet (provided for us) Process pdf files with python libraries (pdfminer, etc) to extract bounding boxes and texts Upload on HuggingFace Hub Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 15 / 22 Born-Digital Dataset We have 699585 pairs cropped image: extracted text. Figure: Example of born-digital cropped word. Custom model predicted: 7.1..2018.8.1..2018 Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 16 / 22 Compare OCR on Born Digital dataset OCR engine Exact Partial No text FPS easyocr_generation2 0.88 0.89 0.00 56.13 easyocr_generation1 0.81 0.82 0.01 3.99 doctr_vgg16_bn* 0.78 0.78 0.00 18.93 doctr_mobilenet_v3_large* 0.77 0.78 0.00 40.70 doctr_mobilenet_v3_small* 0.77 0.77 0.00 51.36 tesseract 0.72 0.72 0.24 2.36 doctr_master 0.82 0.82 X X Table: Comparison of different recognition networks on born-digital dataset 1 1 *These are all CRNN architectures with different backbones Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 17 / 22 Training and validation loss (a) Doctr custom model training loss (b) Doctr custom model validation loss Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 18 / 22 Training and validation exact match (a) Doctr custom model training exact match (b) Doctr custom model validation exact match Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 19 / 22 Bibliography Bibliography I [1] Youngmin Baek et al. Character Region Awareness for Text Detection. 2019. arXiv: 1904.01941 [cs.CV]. [2] CDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction. URL: https: //rrc.cvc.uab.es/?ch=13&com=introduction (visited on 07/21/2022). [3] Minghui Liao et al. “Real-time Scene Text Detection with Differentiable Binarization”. In: (). URL: http://arxiv.org/abs/1911.08947 (visited on 07/21/2022). Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 20 / 22 Bibliography Bibliography II [4] Ning Lu et al. “MASTER: Multi-aspect non-local network for scene text recognition”. In: (). DOI: 10.1016/j.patcog.2021.107980. URL: https: //doi.org/10.1016%5C%2Fj.patcog.2021.107980. [5] Nishant Subramani et al. “A Survey of Deep Learning Approaches for OCR and Document Understanding”. In: (). URL: https://arxiv.org/abs/2011.13534 (visited on 07/21/2022). Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo 21 / 22 1. We created Czech recognition dataset from invoices 2. We trained prototype custom model architecture 3. We compared pretrained solution and evaluated our prototype