Interaktivní osnova
[Dávid Meluš]: Utilization of contextual information for post-OCR error correction using language models 23. 3. 2023
Abstract
In recent years, OCR has become an increasingly important technology in a wide range of applications, including document digitization and automated data entry. Despite its advancements, OCR still makes errors, which can lead to inaccuracies in digitized documents. By incorporating contextual information, such as the surrounding words and their meanings, OCR errors can be corrected.
In particular, we utilize language models trained on the domain-specific dataset to exploit contextual information for post-OCR error correction.
In this presentation, we will discuss potential methods for post-OCR error correction that leverage contextual information to enhance the accuracy of OCR output, as well as the corresponding challenges and limitations associated with these methods.
Slides
Readings
- Gupta et al. : Unsupervised Multi-View Post-OCR Error Correction With Language Models, EMNLP 2021, https://aclanthology.org/2021.emnlp-main.680
- https://medium.com/doma/using-nlp-bert-to-improve-ocr-accuracy-385c98ae174c