CIVQA Czech Invoice Visual Question Answering Šárka Ščavnická Faculty of Informatics, Masaryk University November 9, 2023 Background VRD Visually Rich Documents VRD VRD contains such documents whose semantic structure is not determined only by the text but also by the layout and visual elements of the documents Figure: Example of VRD [1] Š. Ščavnická ·CIVQA ·November 9, 2023 2 / 27 Background DVQA Document visual question-answering DVQA seeks to obtain knowledge from the documents’ visual and textual parts to answer questions The asked questions may relate to different parts of the VRDs text inserted images tables forms Š. Ščavnická ·CIVQA ·November 9, 2023 3 / 27 Methodology Dataset Entity Numeric Textual Pattern Shape Invoice number X Variable symbol X Specific symbol X Constant symbol X Bank code X X Account number X X ICO X X Total amount X Invoice date X X Due date X X Name of supplier X IBAN X X X DIC X X X QR code X X Supplier’s address X Table: CIVQA dataset’s entities’ categories Š. Ščavnická ·CIVQA ·November 9, 2023 4 / 27 Methodology Dataset Tesseract and EasyOCR CIVQA dataset Tesseract OCR was developed at HP Research between 1984 and 1994 Open-source project since 2005 Can recognise more than 100 different languages, including Czech EasyOCR Python framework created by Jaded AI Can recognise just over eighty languages, including Czech Each type of these dataset has two different versions Readable by human Ready to use Š. Ščavnická ·CIVQA ·November 9, 2023 5 / 27 Methodology Dataset Sliding window Maximum Length = 512 tokens Special tokens [CLS] question tokens [SEP] word tokens [SEP] [PAD] token Figure: Sliding window technique [2] Š. Ščavnická ·CIVQA ·November 9, 2023 6 / 27 Methodology Models Models LayoutLMv2 LayoutXLM Chinese, Japanese, Spanish, French, Italian, German, and Portuguese LayoutLMv3 Impira LayoutLM Invoices fine-tuned on the SQuAD and DocVQA datasets plus proprietary dataset of invoices Impira LayoutLM Document QA fine-tuned on the SQuAD and DocVQA datasets Š. Ščavnická ·CIVQA ·November 9, 2023 7 / 27 Experiments Experiment 1 Tesseract OCR vs EasyOCR Model Precision Recall F1 score LayoutXML 0,7422 0,7117 0,7079 LayoutLMv2 0,6917 0,6750 0,6634 LayoutLMv3 0,6989 0,6382 0,6410 Impira QA 0,6773 0,6291 0,6313 Impira Invoice 0,6948 0,6440 0,6434 Table: CIVQA Tesseract OCR results Model Precision Recall F1 score LayoutXML 0,6636 0,6633 0,6455 LayoutLMv2 0,6323 0,6129 0,6011 LayoutLMv3 0,6370 0,6164 0,6065 Impira QA 0,6373 0,6015 0,5984 Impira Invoice 0,6345 0,6019 0,5962 Š. Ščavnická ·CIVQA ·November 9, 2023 8 / 27 Experiments Experiment 1 Figure: The precision of the models in the first experiment. Š. Ščavnická ·CIVQA ·November 9, 2023 9 / 27 Experiments Experiment 1 Figure: Validation dataset of CIVQA Tesseract OCR: LayoutXLM model success rate by individual question percentage Š. Ščavnická ·CIVQA ·November 9, 2023 10 / 27 Experiments Experiment 1 Figure: The correct answer is on one line. Figure: The correct answer is on multiple lines, so it was split. Š. Ščavnická ·CIVQA ·November 9, 2023 11 / 27 Experiments Experiment 2 CIVQA and unseen types of questions In this set of experiments, our focus was on developing a practical and robust solution for unseen entities. Invoice number A numerical entity without a fixed shape. ICO A numerical entity with given shape. Supplier’s address Textual and numerical entity without a fixed shape. IBAN Textual and numerical entity with a fixed shape. Due date A numerical entity with given shape. Š. Ščavnická ·CIVQA ·November 9, 2023 12 / 27 Experiments Experiment 2.1 Training with a subset of unknown data In this experiment, we have tried introducing a different amount of unknown data to the trained models. We choose 5%, 15%, 30% and 50% and compare the results to see how they affect the models. Model Precision Recall F1 score LayoutXML 0,19200 0,04128 0,05816 LayoutLMv2 0,03427 0,02695 0,02605 LayoutLMv3 0,10220 0,03411 0,04557 Impira QA 0,15120 0,04554 0,06520 Impira Invoice 0,13600 0,05304 0,07235 Table: CIVQA Tesseract OCR results on unknow entities Š. Ščavnická ·CIVQA ·November 9, 2023 13 / 27 Experiments Experiment 2.1 Model Precision Recall F1 score LayoutXML 0,7002 0,6594 0,6617 LayoutLMv2 0,5944 0,5154 0,5192 LayoutLMv3 0,5793 0,5125 0,5254 Impira QA 0,6186 0,5356 0,5466 Impira Invoice 0,5999 0,5255 0,5369 Table: CIVQA Tesseract OCR results on unknow entities: 5% Model Precision Recall F1 score LayoutXML 0,7078 0,6911 0,6844 LayoutLMv2 0,6201 0,5717 0,5718 LayoutLMv3 0,6377 0,5755 0,5825 Impira QA 0,6491 0,5907 0,5935 Impira Invoice 0,6410 0,5849 0,5880 Table: CIVQA Tesseract OCR results on unknow entities: 15% Š. Ščavnická ·CIVQA ·November 9, 2023 14 / 27 Experiments Experiment 2.1 Model Precision Recall F1 score LayoutXML 0,7297 0,7124 0,7069 LayoutLMv2 0,6852 0,6619 0,6552 LayoutLMv3 0,6751 0,6497 0,6465 Impira QA 0,6815 0,6464 0,6447 Impira Invoice 0,6772 0,6454 0,6421 Table: CIVQA Tesseract OCR results on unknow entities: 30% Model Precision Recall F1 score LayoutXML 0,7360 0,7106 0,7069 LayoutLMv2 0,6923 0,6488 0,6508 LayoutLMv3 0,6876 0,6573 0,6566 Impira QA 0,7004 0,6560 0,6566 Impira Invoice 0,6994 0,6720 0,6559 Table: CIVQA Tesseract OCR results on unknow entities: 50% Š. Ščavnická ·CIVQA ·November 9, 2023 15 / 27 Experiments Experiment 2.1 Figure: Validation dataset of CIVQA Tesseract OCR unknown: LayoutXLM model success rate by individual question percentage with 5% training. Š. Ščavnická ·CIVQA ·November 9, 2023 16 / 27 Experiments Experiment 2.1 Figure: Validation dataset of CIVQA Tesseract OCR unknown: LayoutLMv3 model success rate by individual question percentage with 5% training. Š. Ščavnická ·CIVQA ·November 9, 2023 17 / 27 Experiments Experiment 2.1 Figure: Validation dataset of CIVQA Tesseract OCR unknown: LayoutXML model success rate by individual question percentage with 50% training. Š. Ščavnická ·CIVQA ·November 9, 2023 18 / 27 Experiments Experiment 2.2 Training with a subset of unknown data concatenated with the known data dataset Model Precision Recall F1 score LayoutXML 0,7069 0,6693 0,67 LayoutLMv2 0,6223 0,5726 0,5755 LayoutLMv3 0,6344 0,5528 0,5631 Impira QA 0,6318 0,5487 0,5670 Impira Invoice 0,6353 0,5577 0,5681 Table: CIVQA Tesseract OCR results on unknow entities concatenated with the known data dataset: 5% Š. Ščavnická ·CIVQA ·November 9, 2023 19 / 27 Experiments Experiment 2.2 Model Precision Recall F1 score LayoutXML 0,7069 0,6693 0,67 LayoutLMv2 0,6223 0,5726 0,5755 LayoutLMv3 0,6344 0,5528 0,5631 Impira QA 0,6318 0,5487 0,5670 Impira Invoice 0,6353 0,5577 0,5681 Table: CIVQA Tesseract OCR results on unknow entities concatenated with the known data dataset: 5% Model Precision Recall F1 score LayoutXML 0,7002 0,6594 0,6617 LayoutLMv2 0,5944 0,5154 0,5192 LayoutLMv3 0,5793 0,5125 0,5254 Impira QA 0,6186 0,5356 0,5466 Impira Invoice 0,5999 0,5255 0,5369 Š. Ščavnická ·CIVQA ·November 9, 2023 20 / 27 Experiments Experiment 2.2 Model Precision Recall F1 score LayoutXML 0,7238 0,664 0,6919 LayoutLMv2 0,6428 0,5615 0,5715 LayoutLMv3 0,6591 0,5831 0,5858 Impira QA 0,6629 0,5849 0,5879 Impira Invoice 0,6658 0,6391 0,6359 Table: CIVQA Tesseract OCR results on unknow entities concatenated with the known data dataset: 15% Model Precision Recall F1 score LayoutXML 0,7078 0,6911 0,6844 LayoutLMv2 0,6201 0,5717 0,5718 LayoutLMv3 0,6377 0,5755 0,5825 Impira QA 0,6491 0,5907 0,5935 Impira Invoice 0,6410 0,5849 0,5880 Š. Ščavnická ·CIVQA ·November 9, 2023 21 / 27 Experiments Experiment 2.2 Figure: Validation dataset of CIVQA Tesseract OCR unknown: LayoutXML model success rate by individual question percentage with 5% training. Š. Ščavnická ·CIVQA ·November 9, 2023 22 / 27 Experiments Experiment 2.2 Figure: Validation dataset of CIVQA Tesseract OCR unknown: LayoutXML model success rate by individual question percentage with 5% training. Š. Ščavnická ·CIVQA ·November 9, 2023 23 / 27 Experiments Experiment 2.3 DocVQA and CIVQA known dataset Figure: LayoutXML comparison Š. Ščavnická ·CIVQA ·November 9, 2023 24 / 27 Bibliography Bibliography I [1] ISHITA JAISWAL ANKUR A. PATEL. What is Intelligent Document Processing and How LayoutLM’s Pre-Trained Model for Text and Image Understanding Works. Accessed on 08.11. 2023. URL: https://www.ankursnewsletter.com/p/what-is- intelligent-document-processing. [2] Long Nguyen. Sliding Window — A common technique to solve algorithmic problems involving String/Array. Accessed on 08.11. 2023. URL: https://medium.com/swlh/sliding-window-a- common-technique-for-solving-algorithmic- problems-involving-string-array-44adf35e2d5d. Š. Ščavnická ·CIVQA ·November 9, 2023 25 / 27 Thank You for Your Attention!