'OCR 2.0' model converts images of text, formulas, notes, and shapes into editable text

8 months ago 11

ARTICLE AD BOX

Researchers have created a new universal optical character recognition (OCR) model called GOT (General OCR Theory). Their paper introduces the concept of OCR 2.0, which aims to combine the strengths of traditional OCR systems and large language models.

According to the researchers, an OCR 2.0 model uses a unified end-to-end architecture and requires fewer resources than large language models, while being versatile enough to recognize more than just plain text.

GOT's architecture consists of an image encoder with approximately 80 million parameters and a speech decoder with 500 million parameters. The encoder compresses 1,024 x 1,024 pixel images into tokens, which the decoder then converts into text of up to 8,000 characters.

'OCR 2.0' unlocks automated processing of complex visual data in science, music, and analytics

The new model can recognize and convert various types of visual information into editable text. These include scene text and document text in English and Chinese, mathematical and chemical formulas, musical notes, simple geometric shapes, and diagrams with their components.

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

Three-stage GOT model architecture with vision encoder, linear layer, and language models for OCR 2.0 technology.

To optimize training, the researchers first trained only the encoder on text recognition tasks. They then added Alibaba's Qwen-0.5B as a decoder and fine-tuned the entire model with diverse, synthetic data. The team used rendering tools such as LaTeX, Mathpix-markdown-it, TikZ, Verovio, Matplotlib, and Pyecharts to generate millions of image-text pairs for training.

Three book pages in Chinese with OCR recognition and extracted text below, showing format retention across multiple pages.

The researchers report that GOT's modular design and synthetic data training allow for flexible expansion. New capabilities can be added without retraining the entire model. This approach allows for efficient updates and improvements to the system over time, they say.

Text sources, rendering tools, and visual results for scientific and technical representations.

In experiments, GOT performed well across various OCR tasks. It achieved top scores in document and scene text recognition, even outperforming specialized models and large language models in diagram recognition.