Handwriting Recognition: More Than Just MNIST

A Computationally Efficient Pipeline Approach to Full Page Offline Handwritten Text Recognition

Robert MacWha
Nerd For Tech

--

So there I was, watching Apple’s official release of IOS 15. Then, all of a sudden, they presented a feature that struck me as genuinely impressive. IOS now can copy text from images. It might not sound that impressive, but this just about blew my mind. Optical character recognition (OCR) technology has been able to perform text recognition on printed text, and even basic handwritten text, for a while now. However, it’s always either been either hideously computationally expensive or extremely sensitive to the input photo. From what I was seeing, IOS’ implementation didn’t suffer from either of these flaws. Needless to say: I was interested.

Demo of Apple’s live text software
Photo by CNBC

After a bit of searching, I came across a paper by Jonathan Chung and Thomas Delteil. This paper detailed an offline full-page handwriting detection pipeline. Looking closer, this pipeline had two properties that would allow it to perform very similarly to Apple’s Live Text. Firstly, being offline the algorithm worked with photos of handwritten text. Secondly, the method worked with full pages of handwriting instead of just single words or lines. This meant that it would be able to, say, digitalize a photo of a shopping list all by itself.

This article is a summary of the technologies used in their paper, as well as a few additions I’m planning on using when implementing this algorithm.

Overview

Despite being the first project any new ML scientist completes, OCR remains a challenging task in real-world environments. This is because three relatively complex steps need to happen to produce useful results. Algorithms need to locate text in the image, split it up into manageable chunks, and perform handwriting recognition to determine what characters are in the chunk.

Passage identification

Sample handwritten notebook with bounding boxes around paragraphs
Example handwritten notebook with bounding boxes for passages — Photo by pure julia on Unsplash

Passage detection is the first step in full-page OCR and involves classifying which parts of the image contain text. This involves drawing a bounding box around each text block. For simple environments, such as the one used in the paper, one can assume that only one text block exists. This means that a convolutional model, such as ResNet, can be used to locate it. The model outputs four values that correspond to the x, y, width, and height of the bounding box.

For more complex images containing multiple blocks of text, this won’t work. Algorithms such as YOLO or SSDs can instead be used to detect which chunks of the image contain text. This would allow the algorithm to detect multiple text boxes in a single image.

Line segmentation

Once portions of the image containing text have been found, the next step is to segment the paragraphs into individual lines. This paper opts to first detect bounding boxes for individual words, then combine these bounding boxes using a clustering algorithm. This was done to minimize the chance of an entire line being missed.

Word bounding boxes are generated by using a Single Shot Detector (SSD) model. The SSD architecture can generate multiple bounding boxes in real-time. They do this by first extracting a feature map using an image processor like ResNet34. SSDs then draw bounding boxes around the extracted features. For more information, I’d recommend checking out this Article by Jonathan Hui.

To improve accuracy, post-processing steps are also performed on the bounding boxes. Word boxes whose heights are greater than their widths are discarded. Overlapping boxes are also discarded.

Sample handwritten note with bounding boxes around words
Example handwritten note with bounding boxes for words — Photo by pure julia on Unsplash

Once all the word bounding boxes are detected, they can be clustered into lines based on their y-component. Since lines of text are generally written straight from left to right, the y component of the bounding boxes can be used to determine whether two words are on the same line or not. If two bounding boxes significantly overlap then they are clustered into the same line.

Sample handwritten note with bounding boxes around lines
Example handwritten note with bounding boxes for lines — Photo by pure julia on Unsplash

To improve accuracy the following post-processing steps are performed on the predicted lines.

  1. Lines smaller than some minimum are discarded
  2. Lines that exceed the bounds of the input image are discarded
  3. Lines that are substantially shorter than the median line length are discarded
  4. Lines that are significantly taller than the median height are split into two lines (accounts for double lines)
  5. Lines that overlap with other lines are discarded

All lines which pass these heuristics are used in the handwriting recognition step.

Handwriting recognition

Handwriting recognition is performed using a Convolutional Bi-directional LSTM network (CNN-biLSTM). CNN-biLSTMs are model architectures used in extracting features that are spatially aligned with the input. This means that the output features (letters) are meant to be in the same order as the input features. This paper uses ResNet34 to extract features from the input images. It then uses two biLSTMs to convert the feature encodings into a N x M array where N is the maximum sequence length and M is the number of unique character classifications.

Structure of a CNN-biLSTM.
Example architecture of a CNN-biLSTM network

Language model denoiser

Language model denoisers are programs that transform the N x M result of handwriting recognizers into an output string. The most common method used for this problem is called greedy search. Greedy search works by selecting the character class with the maximum probability for each N slice. This is an intuitive way to solve the problem, but it tends to result in low accuracy. To improve accuracy systems like beam search or pre-trained language models can be used. These methods improve accuracy because instead of just looking at the probability for classifications individually, they consider the probability of sequences of characters. I can’t give a full explanation, but for more information, I’d recommend looking at this article by Jason Brownlee.

This paper compared the results of seven language denoising methods, including one which was custom-built. Their custom language model was pre-trained on a sequence-to-sequence dataset where the input had characters randomly added, removed, or replaced by similar-looking characters. This was done to teach the language model how to fix mistakes relating to incorrectly classifying a character. The N x M array is fed into this denoiser to generate a predicted output.

Evaluation

The IAM dataset was used to evaluate the system. This dataset contains ~1500 pages of scanned handwritten and labelled documents. The dataset was split into a training and testing set. The system was evaluated based on its reported loss and accuracy as well as qualitative visual samples produced by running the validation data through the system.

Training details

This paper opted to use Apache’s MXNet deep learning framework to develop its networks. Each network was trained separately, all using the Adam optimizer. The following losses were used for each network:

  • Passage detection: Mean Squared Error
  • Word detection: Categorical Cross-Entropy
  • Handwriting recognition: CTC Loss
  • Language denoiser: Custom heuristics

Because of the nature of the dataset, many common data augmentation methods were not available. However, the following were used:

  • Translations
  • Shearing
  • Occluding
  • Blanking of random words/lines

Results

Based on qualitative observations we can see that in general, all steps function accurately. Paragraph detection does a reasonable job of locating the bounding boxes of each image, save for column three where it misses the last line. Word detection again tends to locate enough words for the line recognizer to encompass the entire line. However, many shorter words are missed.

Sample results from the paper: Input image -> Word bounding boxes -> Line bounding boxes
Figure 3 — Example of full-page OCR. Input image -> Passage detection -> Word detection -> Line bounding boxes

Once detected the handwriting recognizer and custom denoiser do an excellent job at converting lines into text. In contrast to the argmax [AM] and beam search [BS] methods, the custom denoiser [D] performs significantly better. This is especially prominent as the handwriting becomes less legible, such as in sample (d). Interestingly, the language modelling capabilities of the custom denoiser caused it to leave out some letters from odd words, such as the first ‘t’ in “desterted” in sample ©.

Sample from paper: Single line of text -> Decoded output string
Figure 4

Analyzing the reported Character Error Rates (CER) for each denoising method clearly shows that the custom denoiser is much more accurate than any of the other algorithms. However, it failed to be more accurate than other methods where the input image is cropped (i.e. feeding in images containing only handwritten portions compared to full pages as outlined in this method.)

Table 1 — Text denoising methods compared to CER

Memory and time requirements were also significantly lower for this method than for other methods. This paper’s methods take approximately 1.5x less time than the current fastest comparable methods, and ~3.5x less memory than Wigington’s method.

Table 2 — Text denoising methods compared to memory usage and time taken

So, getting back to my original point, this technology is amazing. On the top of my head, I can think of at least 20 uses for it. I’m really looking forward to working with it more in the future and implementing a full-fledged OCR system.

References

Thanks for reading my article! Feel free to check out my portfolio, message me on LinkedIn if you have anything to say, or follow me on Medium to get notified when I post another article.

--

--

Robert MacWha
Nerd For Tech

ML Developer, Robotics enthusiast, and activator at TKS.