NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?

Mageswaran D

11 min readNov 29, 2021

gyan42 / receipts-form-filling

An online version of Transformer model to extract information from receipts and do a form filling

gitlab.com

Online Colab Notebook for model training.

Use docker compose to launch the demo.

If we are lucky getting a dataset and building a model and evaluating it in development environment is pretty easy with tools we have to throw at the problem. The real pain comes when we wanted to move the model to production.

Three years back for a client we wanted to do a POC to retrieve13 tags from client documents with a limited training set around 2.5K documents.

Initially CRF models were used which where not giving satisfactory results, so we moved towards Deep Learning models using Bidirectional LSTM networks. By using both character and word level embeddings we were able to get some amazing results, where we were able to retrieve 8 high occurring tags with F1 score around 70 to 90 and the rest of the tags were not up to the mark because of their low frequency and individualities in them.

In the age of Transformers, I was thinking why not revisit the topic and see how it performs with receipts datasets which is highly noisy and very hard problem to tackle as the images can have random noise in it along with different camera angles to it.

Here I have outlined the components that are brought together to build a NER system, which uses scalable tools and frameworks but focused to run on developer machine to get a hands on experience.

It’s not a complete system however it gives a glimpse of designing and building a Deep Learning systems for production.

There are lot of external resources, its up the reader to decide how much to dive in them or skip them if they are familiar :)

1. Dataset

TAN O
WOON O
YANN O
BOOK company
TA company
.K(TAMAN company
DAYA) company
SDN company
BND company
789417-W O
NO.53 address
55 address
57 address
& address
59 address
 address
JALAN address
SAGU address
18 address
TAMAN address
DAYA address
81100 address
JOHOR address
BAHRU address
JOHOR. address
DOCUMENT O
NO O
: O
TD01167104 O
DATE: O
25/12/2018 date
8:13:39 date
PM date
CASHIER: O
MANIS O
MEMBER: O
CASH O
BILL O
CODE/DESC O
PRICE O
DISC O
AMOUNT O
QTY O
RM O
9556939040116 O
KF O
MODELLING O
CLAY O
KIDDY O
FISH O
1 O
PC O
* O
9.000 total
0.00 O
9.00 total
TOTAL: O
ROUR O
DING O
ADJUSTMENT: O
ROUND O
D O
TOTAL O
(RM): O
CASH O
10.00 O
CHANGE O
1.00 O
GOODS O
SOLD O
ARE O
NOT O
RETURNABLE O
OR O
EXCHANGEABLE O
*** O
THANK O
YOU O
PLEASE O
COME O
AGAIN O
! O

The dataset we are going to use today is ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

Preparing dataset is paramount work of itself, which I covered in my previous blog @ https://mageswaran1989.medium.com/how-to-build-custom-ner-huggingface-dataset-for-receipts-and-train-with-huggingface-transformers-6c954b84473c

It is highly recommended to read it.

HFSREIO2019Dataset class is a custom HuggingFace dataset wrapper, which downloads the SREIO2019 data from my Github repo and prepares list of (token, tag) pairs.
The CoNLL formated dataset is then loaded and batches are created with DataCollatorForTokenClassification after preprocessing with HFTokenizer

2. DeepLearning Model with HuggingFace Transformers

distilbert-base-uncased is used as the Transformer model loaded with AutoModelForTokenClassification class of HuggingFace.

Training the model follows the usual routine, so I am gonna skip it.

There is a online Colab version which can be used to build the model @ https://colab.research.google.com/gist/Mageswaran1989/442d575d8f5ca11b7ae12b1b061a04d3/receiptsautoformfilling.ipynb

3. OCR

Google Tesseract with PyTesseract package is used to convert the image into text for its simplicity and ease of use.

Tesseract is an easy option for most of the OCR task, though there are other DL options, such as following ones which can be trained for specific datasets.

Remember Tesseract also comes with LSTM module which can also be trained on custom datasets.

import pytesseract
import cv2custom_config = r'--oem 3 --psm 6'
img = cv2.imread(file_path)
text = pytesseract.image_to_string(img, lang='eng', config=custom_config)

The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract.

tesseract --help-oem # for oem. OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

The --psm controls the automatic Page Segmentation Mode used by Tesseract.

tesseract --help-psm # for psm.Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

4.FastAPI Backend

When user uploads the image it needs to be converted to text right? So the image is opened in the browser and the data is send to a python backend process to OCR it.

FastAPI

FastAPI framework, high performance, easy to learn, fast to code, ready for production Documentation…

fastapi.tiangolo.com

FastAPI is used as the REST API backend for its type safety features and ease of testing the endpoints with in-build features.

Setting up CORS is important to allow local testing, however this needs to be changed to meet production requirements.

ALLOWED_ORIGINS = ["*"]app.add_middleware(
    CORSMiddleware,
    allow_origins=ALLOWED_ORIGINS,
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

pydantic is used to model the JSON requests, and this is where FastAPI shines over Flask.
Keeping all endpoints in one file is not a way to handle big projects, then whats the better way? FastAPI router
FastAPI router is used to keep the endpoints files as a separate module. This design greatly helps in maintaining and adding new features without altering the main file after initial setup. Code @ https://gitlab.com/gyan42/receipts-form-filling/-/blob/main/api/routers/ocr/tesseract/tesseract.py

5. TorchServe

Ok, we have trained and got a model to use? How to use the model to predict in real world? How to put it behind a REST API endpoint ? How to scale the model ? How to update the model after deploying?

1. TorchServe - PyTorch/Serve master documentation

TorchServe is a performant, flexible and easy to use tool for serving PyTorch eager mode and torschripted models.

pytorch.org

I will leave you to explore TorchServe and its benefits ;)

To serve a NLP model all we need is the model weights, vocabulary used and the tokenizer, luckily HuggingFace model saves all these for us:

! ls ~/.gyan42/models/hf/sroie2019v1/config.json    special_tokens_map.json  training_args.bin pytorch_model.bin  tokenizer_config.json    vocab.txt

Torch serve has a tool called torch model archiver which packages the model as self contained archive which then can be invoked independently.

%set_env SERIALIZED_MODEL_FILE=/root/.gyan42/models/hf/sroie2019v1/pytorch_model.bin!torch-model-archiver --force \
--model-name sroie2019v1 \
--version 1.0 \
--serialized-file  $SERIALIZED_MODEL_FILE \
--handler gyan42/serving/handler/hf_transformer_handler.py \
--extra-files /root/.gyan42/models/hf/sroie2019v1/config.json,/root/.gyan42/models/hf/sroie2019v1/special_tokens_map.json,/root/.gyan42/models/hf/sroie2019v1/training_args.bin,/root/.gyan42/models/hf/sroie2019v1/tokenizer_config.json,/root/.gyan42/models/hf/sroie2019v1/vocab.txt \
--export-path /root/.gyan42/model-store/

Wait what is hf_transformer_handler.py?

The best part I like about Torch serve is its ease of packaging model unlike Tensorflow (no comments!)

So what needs to be done typically to handle an incoming HTTP request ?

Load the model part of initialization
Extract the text from incoming request
Preprocess the text
Run inference through the model
Post process the results
Send back the reply

This is what done in hf_transformer_handler.py

When torch serve serves the model, it loads the handler which has the knowledge of loading the artifacts for our model and how to process the incoming request. Neat and simple isn’t?

torch-model-archiver creates a file with extension .mar which stand for Model Archive File, which then can be served with :

torchserve --start --model-store data/model-store --models all --ts-config configs/torch_serve_config.properties --foreground

Above command loads all the model under the directory data/model-store

Torch Serve can be configured for three URLs, namely:

inference_address = http://0.0.0.0:6543
management_address = http://0.0.0.0:6544
metrics_address = http://0.0.0.0:6545

Model predictions are done at http://0.0.0.0:6543/predictions/{model_name}

List of models can be found with http://0.0.0.0:6544/models

Management URL has features to load, scale and delete the model over the REST endpoints, comes in handy when we wanted to manage models remotely.

And don’t forget to configure CORS for serving REST API endpoints

# cors_allowed_origin is required to enable CORS, use '*' or your domain name
cors_allowed_origin=*
# required if you want to use preflight request
cors_allowed_methods=GET,POST,PUT,OPTIONS
# required if the request has an Access-Control-Request-Headers header
cors_allowed_headers=X-Custom-Header,content-type

Part of our demo we are loading the model statically, however more refined way is to load the model over the management URL. By storing the model in S3 kind of storage, it can be loaded dynamically.

curl -X POST  "http://localhost:6544/models?initial_workers=1&synchronous=true&url=s3://some_bucket/gyna42/model-store/sroie2019v1.mar"

Torch Serve comes with support to work with Kubernetes.

6. Vue3 WebUI

Without an UI the whole model showcasing becomes command line oriented which is less intuitive.

Though Streamlit seems to be attractive option it lacks the power of customisation what we need sometimes.

Streamlit * The fastest way to build and share data apps

Streamlit turns data scripts into shareable web apps in minutes. All in Python. All for free. No front‑end experience…

streamlit.io

I chose to learn Vue compared to other Web framework, as usual for its simplicity and ease of use.

Vue.js

Already know HTML, CSS and JavaScript? Read the guide and start building things in no time! An incrementally adoptable…

v3.vuejs.org

Bulma CSS framework is used for HTML styling. Read about column styling here which is used to split the web page into different column segments

Bulma: Free, open source, and modern CSS framework based on Flexbox

One of the main contributors to our success at grox.io: the Bulma framework by @jgthms. His work is brilliant. It's…

bulma.io

Axios is used for backend communications.

GitHub - axios/axios: Promise based HTTP client for the browser and node.js

Promise based HTTP client for the browser and node.js New axios docs website: click here Make XMLHttpRequests from the…

github.com

Nginx for reverse proxy and as web server.

NGINX | High Performance Load Balancer, Web Server, & Reverse Proxy

Many of our customers use BIG‑IP for load balancing to their Kubernetes clusters and NGINX Ingress Controller to handle…

www.nginx.com

Environment files are used to load different configurations during build time. I have used following env files, namely linux, mac, maclocal and heroku to setup URLs accordingly.

VUE_APP_API_BASE_URL=http://localhost:8088
VUE_APP_TORCH_PRED_BASE_URL=http://0.0.0.0:6543
VUE_APP_TORCH_MGMNT_BASE_URL=http://0.0.0.0:6544

#End points
VUE_APP_API_OCR_TESSERACT=/gyan42/ocr/engine/pytesseract/file

Each Vue component consists of:

HTML
CSS : Which plays major role in decorating the web pages. One good thing is there are well defined CSS framework like https://bulma.io/ that gives all essential CSS classes for free
JavaScript with some predefined layout

To understand the VueUI following Vue 3 concepts are necessary:

We use following HTML component:

Once you get knowledge of all these components, understanding the Vue file becomes comfortable but it takes time and practice to digest HTML and JS interactions.

ui/src/components/ReceiptsPage.vue · main · gyan42 / receipts-form-filling

An online version of Transformer model to extract information from receipts and do a form filling

gitlab.com

Sample Button

<button class="button is-link ml-5" v-on:click="onImageSample" > Sample a Test Image </button>

Once the sample button is clicked, it calls the onImageSample JS function, which samples the image from static path and converts the blob data into a file data which then passed to backend for OCR

onImageSample() {

      console.info("onImageSample")
      this.startTime = performance.now()
      this.timeElapsed = 0
      this.modelTimeElapsed = 0
      this.ocrTimeElapsed = 0

      this.predictions = []

      // Number of test iamges are 138!
      const rndInt = Math.floor(Math.random() * 138) + 1
      console.log(rndInt)

      const blobUrlToFile = (blobUrl)  => new Promise((resolve) => {
          fetch(blobUrl).then((res) => {
            res.blob().then((blob) => {
            const fileName = blobUrl.split("/")[2]
            console.info(fileName)
            const file = new File([blob], fileName, {type: blob.type})
            resolve(file)
            })
          })
       })

      blobUrlToFile(require("@/assets/images/test/"+rndInt+".jpg")).then( f => {
          this.imageFileName = f
          console.info(this.imageFileName )
          this.createImage(this.imageFileName);
          this.onRunOCR()
        }
      )
    },
    createImage(file) {
      var reader = new FileReader();
      reader.onload = (e) => {
        this.imageFile = e.target.result;
      };
      reader.readAsDataURL(file);
    },
  }

Image holder : Once the file is available through the variable imageFile the image content is displayed on the UI

<div class="box" style="background-color:transparent;">
<p v-if="imageFile.length > 0"> <img v-bind:src="imageFile" /></p>
</div>

The image file data is then given to OCR backend API, which returns the text data that gets stored in inTextData

async onRunOCR() {
      this.status = "Running Tesseract"
      this.ocrStartTime = performance.now()
      console.info("Running Tesseract")
      console.info(process.env.VUE_APP_API_BASE_URL,  process.env.VUE_APP_API_OCR_TESSERACT)
      let formData = new FormData();
      formData.append('file', this.imageFileName);
      console.info(this.imageFileName)
      console.info(formData)
      let headers =   {
        headers: {
          'Content-Type': 'multipart/form-data'
        },
        timeout: 30000
      }

      api.fastapi
          .post(process.env.VUE_APP_API_OCR_TESSERACT, formData, headers)
          .then(res => {
            console.info(res);
            this.inTextData = res["data"]['text']
            this.ocrTimeElapsed = performance.now() - this.ocrStartTime
            this.predict()
          })
          .catch((err) => alert(err));
    },

The text data is then passed to model prediction API, which returns list of tuples of (token, tag) pairs.

predict(){
      this.status = "Running Transformer Model"
      this.modelStartTime = performance.now()
      api.torchserve.post("predictions/" + this.selectedTorchModel, {"text": this.inTextData}, {timeout: 20000})
          .then(value => {
            console.info(value["data"])
            this.extractTags(value["data"])
          })
    },

extractTags function converts the list of tuples into list of map objects
Once the predictions are available in predictions data variable, iterate through it and create the label and input text boxes

<li v-for="prediction in predictions" :key="prediction.id">
  <div class="field is-horizontal">
  <div class="field-label is-normal">
     <label class="label">{{prediction.tag}}</label>
  </div>
  <div class="field-body mr-5">
    <div class="field">
      <div class="control">
        <input class="input" type="text" v-model=prediction.token>
      </div>
    </div>
  </div>
  </div>
  <br>
</li>

Predictions are list of map object as follows:

predictions: [ {id: 1, tag: "TAG1", token: "TOKEN1"}, 
               {id:2, tag: "TAG2", token: "TOKEN2"}]

7. Docker Images

Multi stage build is used

All docker files are found here:

ops · main · gyan42 / receipts-form-filling

An online version of Transformer model to extract information from receipts and do a form filling

gitlab.com

Multistage build id used to reduce the size of the images
Nginx is configured inside the Docker as reverse proxy, which is a must for Vue UI to work out of Docker

COPY ui/nginx.conf /etc/nginx/nginx.conf

.env files are used with Vue builder for environment variables like URLs
https://gunicorn.org/ is used to launch the FastAPI backend as a daemon service
torchserve to launch Torch Serve REST API endpoints from a static model file

With that comes to our demo time : https://gitlab.com/gyan42/receipts-form-filling/-/tree/main#docker-compose

An attempt was made to run the app in Heroku platform, however as suspected the RAM needs are greater than 500MB as we put Tesseract and Transformer under one roof.

On our next POC lets build a model and serve them under 500MB, as we have PyTorch Mobile https://pytorch.org/mobile/home/ or quantization

Thanks for reading and raise issues if something doesn’t work, happy to fix them any time!

NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?

gyan42 / receipts-form-filling

An online version of Transformer model to extract information from receipts and do a form filling

Online Colab Notebook for model training.

1. Dataset

2. DeepLearning Model with HuggingFace Transformers

3. OCR

4.FastAPI Backend

FastAPI

FastAPI framework, high performance, easy to learn, fast to code, ready for production Documentation…

5. TorchServe

1. TorchServe - PyTorch/Serve master documentation

TorchServe is a performant, flexible and easy to use tool for serving PyTorch eager mode and torschripted models.

6. Vue3 WebUI

Streamlit * The fastest way to build and share data apps

Streamlit turns data scripts into shareable web apps in minutes. All in Python. All for free. No front‑end experience…

Vue.js

Already know HTML, CSS and JavaScript? Read the guide and start building things in no time! An incrementally adoptable…

Bulma: Free, open source, and modern CSS framework based on Flexbox

One of the main contributors to our success at grox.io: the Bulma framework by @jgthms. His work is brilliant. It's…

GitHub - axios/axios: Promise based HTTP client for the browser and node.js

Promise based HTTP client for the browser and node.js New axios docs website: click here Make XMLHttpRequests from the…

NGINX | High Performance Load Balancer, Web Server, & Reverse Proxy

Many of our customers use BIG‑IP for load balancing to their Kubernetes clusters and NGINX Ingress Controller to handle…

ui/src/components/ReceiptsPage.vue · main · gyan42 / receipts-form-filling

An online version of Transformer model to extract information from receipts and do a form filling

7. Docker Images

ops · main · gyan42 / receipts-form-filling

An online version of Transformer model to extract information from receipts and do a form filling

Written by Mageswaran D