NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?

Mageswaran D
11 min readNov 29, 2021

Online Colab Notebook for model training.

Use docker compose to launch the demo.

If we are lucky getting a dataset and building a model and evaluating it in development environment is pretty easy with tools we have to throw at the problem. The real pain comes when we wanted to move the model to production.

Three years back for a client we wanted to do a POC to retrieve13 tags from client documents with a limited training set around 2.5K documents.

Initially CRF models were used which where not giving satisfactory results, so we moved towards Deep Learning models using Bidirectional LSTM networks. By using both character and word level embeddings we were able to get some amazing results, where we were able to retrieve 8 high occurring tags with F1 score around 70 to 90 and the rest of the tags were not up to the mark because of their low frequency and individualities in them.

In the age of Transformers, I was thinking why not revisit the topic and see how it performs with receipts datasets which is highly noisy and very hard problem to tackle as the images can have random noise in it along with different camera angles to it.

Here I have outlined the components that are brought together to build a NER system, which uses scalable tools and frameworks but focused to run on developer machine to get a hands on experience.

It’s not a complete system however it gives a glimpse of designing and building a Deep Learning systems for production.

There are lot of external resources, its up the reader to decide how much to dive in them or skip them if they are familiar :)

1. Dataset

TAN O
WOON O
YANN O
BOOK company
TA company
.K(TAMAN company
DAYA) company
SDN company
BND company

789417-W O
NO.53 address
55 address
57 address
& address
59 address
address
JALAN address
SAGU address
18 address
TAMAN address
DAYA address
81100 address
JOHOR address
BAHRU address
JOHOR. address

DOCUMENT O
NO O
: O
TD01167104 O
DATE: O
25/12/2018 date
8:13:39 date

PM date
CASHIER: O
MANIS O
MEMBER: O
CASH O
BILL O
CODE/DESC O
PRICE O
DISC O
AMOUNT O
QTY O
RM O
9556939040116 O
KF O
MODELLING O
CLAY O
KIDDY O
FISH O
1 O
PC O
* O
9.000 total
0.00 O
9.00 total
TOTAL: O
ROUR O
DING O
ADJUSTMENT: O
ROUND O
D O
TOTAL O
(RM): O
CASH O
10.00 O
CHANGE O
1.00 O
GOODS O
SOLD O
ARE O
NOT O
RETURNABLE O
OR O
EXCHANGEABLE O
*** O
THANK O
YOU O
PLEASE O
COME O
AGAIN O
! O

The dataset we are going to use today is ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

Preparing dataset is paramount work of itself, which I covered in my previous blog @ https://mageswaran1989.medium.com/how-to-build-custom-ner-huggingface-dataset-for-receipts-and-train-with-huggingface-transformers-6c954b84473c

It is highly recommended to read it.

2. DeepLearning Model with HuggingFace Transformers

distilbert-base-uncased is used as the Transformer model loaded with AutoModelForTokenClassification class of HuggingFace.

Training the model follows the usual routine, so I am gonna skip it.

There is a online Colab version which can be used to build the model @ https://colab.research.google.com/gist/Mageswaran1989/442d575d8f5ca11b7ae12b1b061a04d3/receiptsautoformfilling.ipynb

3. OCR

Google Tesseract with PyTesseract package is used to convert the image into text for its simplicity and ease of use.

Tesseract is an easy option for most of the OCR task, though there are other DL options, such as following ones which can be trained for specific datasets.

Remember Tesseract also comes with LSTM module which can also be trained on custom datasets.

import pytesseract
import cv2
custom_config = r'--oem 3 --psm 6'
img = cv2.imread(file_path)
text = pytesseract.image_to_string(img, lang='eng', config=custom_config)
  • The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract.
tesseract --help-oem # for oem. OCR Engine modes:
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
  • The --psm controls the automatic Page Segmentation Mode used by Tesseract.
tesseract --help-psm # for psm.Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.

4.FastAPI Backend

When user uploads the image it needs to be converted to text right? So the image is opened in the browser and the data is send to a python backend process to OCR it.

FastAPI is used as the REST API backend for its type safety features and ease of testing the endpoints with in-build features.

ALLOWED_ORIGINS = ["*"]app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

5. TorchServe

Ok, we have trained and got a model to use? How to use the model to predict in real world? How to put it behind a REST API endpoint ? How to scale the model ? How to update the model after deploying?

I will leave you to explore TorchServe and its benefits ;)

To serve a NLP model all we need is the model weights, vocabulary used and the tokenizer, luckily HuggingFace model saves all these for us:

! ls ~/.gyan42/models/hf/sroie2019v1/config.json    special_tokens_map.json  training_args.bin pytorch_model.bin  tokenizer_config.json    vocab.txt

Torch serve has a tool called torch model archiver which packages the model as self contained archive which then can be invoked independently.

%set_env SERIALIZED_MODEL_FILE=/root/.gyan42/models/hf/sroie2019v1/pytorch_model.bin!torch-model-archiver --force \
--model-name sroie2019v1 \
--version 1.0 \
--serialized-file $SERIALIZED_MODEL_FILE \
--handler gyan42/serving/handler/hf_transformer_handler.py \
--extra-files /root/.gyan42/models/hf/sroie2019v1/config.json,/root/.gyan42/models/hf/sroie2019v1/special_tokens_map.json,/root/.gyan42/models/hf/sroie2019v1/training_args.bin,/root/.gyan42/models/hf/sroie2019v1/tokenizer_config.json,/root/.gyan42/models/hf/sroie2019v1/vocab.txt \
--export-path /root/.gyan42/model-store/

Wait what is hf_transformer_handler.py?

The best part I like about Torch serve is its ease of packaging model unlike Tensorflow (no comments!)

So what needs to be done typically to handle an incoming HTTP request ?

  • Load the model part of initialization
  • Extract the text from incoming request
  • Preprocess the text
  • Run inference through the model
  • Post process the results
  • Send back the reply

This is what done in hf_transformer_handler.py

When torch serve serves the model, it loads the handler which has the knowledge of loading the artifacts for our model and how to process the incoming request. Neat and simple isn’t?

torch-model-archiver creates a file with extension .mar which stand for Model Archive File, which then can be served with :

torchserve --start --model-store data/model-store --models all --ts-config configs/torch_serve_config.properties --foreground

Above command loads all the model under the directory data/model-store

Torch Serve can be configured for three URLs, namely:

inference_address = http://0.0.0.0:6543
management_address = http://0.0.0.0:6544
metrics_address = http://0.0.0.0:6545

Model predictions are done at http://0.0.0.0:6543/predictions/{model_name}

List of models can be found with http://0.0.0.0:6544/models

Management URL has features to load, scale and delete the model over the REST endpoints, comes in handy when we wanted to manage models remotely.

And don’t forget to configure CORS for serving REST API endpoints

# cors_allowed_origin is required to enable CORS, use '*' or your domain name
cors_allowed_origin=*
# required if you want to use preflight request
cors_allowed_methods=GET,POST,PUT,OPTIONS
# required if the request has an Access-Control-Request-Headers header
cors_allowed_headers=X-Custom-Header,content-type

Part of our demo we are loading the model statically, however more refined way is to load the model over the management URL. By storing the model in S3 kind of storage, it can be loaded dynamically.

curl -X POST  "http://localhost:6544/models?initial_workers=1&synchronous=true&url=s3://some_bucket/gyna42/model-store/sroie2019v1.mar"

Torch Serve comes with support to work with Kubernetes.

6. Vue3 WebUI

Without an UI the whole model showcasing becomes command line oriented which is less intuitive.

Though Streamlit seems to be attractive option it lacks the power of customisation what we need sometimes.

I chose to learn Vue compared to other Web framework, as usual for its simplicity and ease of use.

  • Bulma CSS framework is used for HTML styling. Read about column styling here which is used to split the web page into different column segments
  • Axios is used for backend communications.
  • Nginx for reverse proxy and as web server.

Environment files are used to load different configurations during build time. I have used following env files, namely linux, mac, maclocal and heroku to setup URLs accordingly.

VUE_APP_API_BASE_URL=http://localhost:8088
VUE_APP_TORCH_PRED_BASE_URL=http://0.0.0.0:6543
VUE_APP_TORCH_MGMNT_BASE_URL=http://0.0.0.0:6544

#End points
VUE_APP_API_OCR_TESSERACT=/gyan42/ocr/engine/pytesseract/file

Each Vue component consists of:

  • HTML
  • CSS : Which plays major role in decorating the web pages. One good thing is there are well defined CSS framework like https://bulma.io/ that gives all essential CSS classes for free
  • JavaScript with some predefined layout

To understand the VueUI following Vue 3 concepts are necessary:

We use following HTML component:

Once you get knowledge of all these components, understanding the Vue file becomes comfortable but it takes time and practice to digest HTML and JS interactions.

  • Sample Button
<button class="button is-link ml-5" v-on:click="onImageSample" > Sample a Test Image </button>
  • Once the sample button is clicked, it calls the onImageSample JS function, which samples the image from static path and converts the blob data into a file data which then passed to backend for OCR
onImageSample() {

console.info("onImageSample")
this.startTime = performance.now()
this.timeElapsed = 0
this.modelTimeElapsed = 0
this.ocrTimeElapsed = 0

this.predictions = []

// Number of test iamges are 138!
const rndInt = Math.floor(Math.random() * 138) + 1
console.log(rndInt)

const blobUrlToFile = (blobUrl) => new Promise((resolve) => {
fetch(blobUrl).then((res) => {
res.blob().then((blob) => {
const fileName = blobUrl.split("/")[2]
console.info(fileName)
const file = new File([blob], fileName, {type: blob.type})
resolve(file)
})
})
})

blobUrlToFile(require("@/assets/images/test/"+rndInt+".jpg")).then( f => {
this.imageFileName = f
console.info(this.imageFileName )
this.createImage(this.imageFileName);
this.onRunOCR()
}
)
},
createImage(file) {
var reader = new FileReader();
reader.onload = (e) => {
this.imageFile = e.target.result;
};
reader.readAsDataURL(file);
},
}
  • Image holder : Once the file is available through the variable imageFile the image content is displayed on the UI
<div class="box" style="background-color:transparent;">
<p v-if="imageFile.length > 0"> <img v-bind:src="imageFile" /></p>
</div>
  • The image file data is then given to OCR backend API, which returns the text data that gets stored in inTextData
async onRunOCR() {
this.status = "Running Tesseract"
this.ocrStartTime = performance.now()
console.info("Running Tesseract")
console.info(process.env.VUE_APP_API_BASE_URL, process.env.VUE_APP_API_OCR_TESSERACT)
let formData = new FormData();
formData.append('file', this.imageFileName);
console.info(this.imageFileName)
console.info(formData)
let headers = {
headers: {
'Content-Type': 'multipart/form-data'
},
timeout: 30000
}

api.fastapi
.post(process.env.VUE_APP_API_OCR_TESSERACT, formData, headers)

.then(res => {
console.info(res);
this.inTextData = res["data"]['text']
this.ocrTimeElapsed = performance.now() - this.ocrStartTime
this.predict()
})
.catch((err) => alert(err));
},
  • The text data is then passed to model prediction API, which returns list of tuples of (token, tag) pairs.
predict(){
this.status = "Running Transformer Model"
this.modelStartTime = performance.now()
api.torchserve.post("predictions/" + this.selectedTorchModel, {"text": this.inTextData}, {timeout: 20000})
.then(value => {
console.info(value["data"])
this.extractTags(value["data"])
})
},
  • extractTags function converts the list of tuples into list of map objects
  • Once the predictions are available in predictions data variable, iterate through it and create the label and input text boxes
<li v-for="prediction in predictions" :key="prediction.id">
<div class="field is-horizontal">
<div class="field-label is-normal">
<label class="label">{{prediction.tag}}</label>
</div>
<div class="field-body mr-5">
<div class="field">
<div class="control">
<input class="input" type="text" v-model=prediction.token>
</div>
</div>
</div>
</div>
<br>
</li>
  • Predictions are list of map object as follows:
predictions: [ {id: 1, tag: "TAG1", token: "TOKEN1"}, 
{id:2, tag: "TAG2", token: "TOKEN2"}]

7. Docker Images

Multi stage build is used

All docker files are found here:

COPY ui/nginx.conf /etc/nginx/nginx.conf
  • .env files are used with Vue builder for environment variables like URLs
  • https://gunicorn.org/ is used to launch the FastAPI backend as a daemon service
  • torchserve to launch Torch Serve REST API endpoints from a static model file

With that comes to our demo time : https://gitlab.com/gyan42/receipts-form-filling/-/tree/main#docker-compose

An attempt was made to run the app in Heroku platform, however as suspected the RAM needs are greater than 500MB as we put Tesseract and Transformer under one roof.

On our next POC lets build a model and serve them under 500MB, as we have PyTorch Mobile https://pytorch.org/mobile/home/ or quantization

Thanks for reading and raise issues if something doesn’t work, happy to fix them any time!

--

--