Optical Character Recognition
systems for Document
Understanding
Mikuláš Bankovič
456421@mail.muni.cz
Faculty of Informatics, Masaryk University
October 10, 2022
Research sources
A Survey of Deep Learning Approaches for OCR and Document
Understanding [5]
ICDAR 2019 - Scanned Receipts OCR and Information Extraction
(SROIE) [2]
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
2 / 22
Computer Vision problems
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
3 / 22
OCR pipeline
Figure 1 given by Subramani et al. [5] shows different approaches to
OCR systems.
Figure: Object detection and segmentation need transcription(text
recognition model)
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
4 / 22
Text Detection
CRAFT based on CNN, speciﬁcally FCN [1]
Differentiable Binarization Network (DBNet) [3]
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
5 / 22
Comparisons
Figure: Speed and metric comparison
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
6 / 22
Text Recognition
CRNN with different backbones (mobilenet, vgg)
Master - CNN + Transformer [4]
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
7 / 22
Master
Figure: Speed comparison to previous SOTA
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
8 / 22
Master
Connectionist Temporal Classiﬁcation (CTC) loss
does not require character-level annotations but word-level
annotations
high training parallelization compared to RNN
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
9 / 22
OCR frameworks
OCR systems allow users to use text detection and text recognition
models without having to create own pipeline and visualization
tools.
EasyOCR - many languages - only a few models
DocTR - newer interface and better variety of models including
selected ones
Both have poor documentation and are in their code infancy.
Figure: We do not want to reinvent the wheel
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
10 / 22
DocTR
DBNet (pretrained text detection)
Master (we want to train for our domain)
Figure: Doctr
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
11 / 22
EasyOCR model
Figure: Source code of so far unnamed model implemented by EasyOCR
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
12 / 22
Invoice
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
13 / 22
Reconstruction
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
14 / 22
Born-Digital Dataset
Scrape pdf invoices from the internet (provided for us)
Process pdf ﬁles with python libraries (pdfminer, etc) to extract
bounding boxes and texts
Upload on HuggingFace Hub
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
15 / 22
Born-Digital Dataset
We have 699585 pairs cropped image: extracted text.
Figure: Example of born-digital cropped word. Custom model predicted:
7.1..2018.8.1..2018
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
16 / 22
Compare OCR on Born Digital dataset
OCR engine Exact Partial No text FPS
easyocr_generation2 0.88 0.89 0.00 56.13
easyocr_generation1 0.81 0.82 0.01 3.99
doctr_vgg16_bn* 0.78 0.78 0.00 18.93
doctr_mobilenet_v3_large* 0.77 0.78 0.00 40.70
doctr_mobilenet_v3_small* 0.77 0.77 0.00 51.36
tesseract 0.72 0.72 0.24 2.36
doctr_master 0.82 0.82 X X
Table: Comparison of different recognition networks on born-digital dataset
1
1
*These are all CRNN architectures with different backbones
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
17 / 22
Training and validation loss
(a) Doctr custom model training
loss
(b) Doctr custom model validation
loss
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
18 / 22
Training and validation exact match
(a) Doctr custom model training
exact match
(b) Doctr custom model validation
exact match
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
19 / 22
Bibliography
Bibliography I
[1] Youngmin Baek et al. Character Region Awareness for Text
Detection. 2019. arXiv: 1904.01941 [cs.CV].
[2] CDAR 2019 Robust Reading Challenge on Scanned Receipts OCR
and Information Extraction. URL: https:
//rrc.cvc.uab.es/?ch=13&com=introduction (visited
on 07/21/2022).
[3] Minghui Liao et al. “Real-time Scene Text Detection with
Differentiable Binarization”. In: (). URL:
http://arxiv.org/abs/1911.08947 (visited on
07/21/2022).
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
20 / 22
Bibliography
Bibliography II
[4] Ning Lu et al. “MASTER: Multi-aspect non-local network for
scene text recognition”. In: (). DOI:
10.1016/j.patcog.2021.107980. URL: https:
//doi.org/10.1016%5C%2Fj.patcog.2021.107980.
[5] Nishant Subramani et al. “A Survey of Deep Learning Approaches
for OCR and Document Understanding”. In: (). URL:
https://arxiv.org/abs/2011.13534 (visited on
07/21/2022).
Mikuláš Bankovič 456421@mail.muni.cz ·Optical Character Recognition systems for Document Understanding ·Octo
21 / 22
1. We created Czech recognition dataset from invoices
2. We trained prototype custom model architecture
3. We compared pretrained solution and evaluated our prototype