NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?
Online Colab Notebook for model training.
Use docker compose to launch the demo.
If we are lucky getting a dataset and building a model and evaluating it in development environment is pretty easy with tools we have to throw at the problem. The real pain comes when we wanted to move the model to production.
Three years back for a client we wanted to do a POC to retrieve13 tags from client documents with a limited training set around 2.5K documents.
Initially CRF models were used which where not giving satisfactory results, so we moved towards Deep Learning models using Bidirectional LSTM networks. By using both character and word level embeddings we were able to get some amazing results, where we were able to retrieve 8 high occurring tags with F1 score around 70 to 90 and the rest of the tags were not up to the mark because of their low frequency and individualities in them.
In the age of Transformers, I was thinking why not revisit the topic and see how it performs with receipts datasets which is highly noisy and very hard problem to tackle as the images can have random noise in it along with different camera angles to it.
Here I have outlined the components that are brought together to build a NER system, which uses scalable tools and frameworks but focused to run on developer machine to get a hands on experience.
It’s not a complete system however it gives a glimpse of designing and building a Deep Learning systems for production.
There are lot of external resources, its up the reader to decide how much to dive in them or skip them if they are familiar :)
1. Dataset
TAN O
WOON O
YANN O
BOOK company
TA company
.K(TAMAN company
DAYA) company
SDN company
BND company
789417-W O
NO.53 address
55 address
57 address
& address
59 address
address
JALAN address
SAGU address
18 address
TAMAN address
DAYA address
81100 address
JOHOR address
BAHRU address
JOHOR. address
DOCUMENT O
NO O
: O
TD01167104 O
DATE: O
25/12/2018 date
8:13:39 date
PM date
CASHIER: O
MANIS O
MEMBER: O
CASH O
BILL O
CODE/DESC O
PRICE O
DISC O
AMOUNT O
QTY O
RM O
9556939040116 O
KF O
MODELLING O
CLAY O
KIDDY O
FISH O
1 O
PC O
* O
9.000 total
0.00 O
9.00 total
TOTAL: O
ROUR O
DING O
ADJUSTMENT: O
ROUND O
D O
TOTAL O
(RM): O
CASH O
10.00 O
CHANGE O
1.00 O
GOODS O
SOLD O
ARE O
NOT O
RETURNABLE O
OR O
EXCHANGEABLE O
*** O
THANK O
YOU O
PLEASE O
COME O
AGAIN O
! O
The dataset we are going to use today is ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction
Preparing dataset is paramount work of itself, which I covered in my previous blog @ https://mageswaran1989.medium.com/how-to-build-custom-ner-huggingface-dataset-for-receipts-and-train-with-huggingface-transformers-6c954b84473c
It is highly recommended to read it.
- HFSREIO2019Dataset class is a custom HuggingFace dataset wrapper, which downloads the SREIO2019 data from my Github repo and prepares list of (token, tag) pairs.
- The CoNLL formated dataset is then loaded and batches are created with DataCollatorForTokenClassification after preprocessing with HFTokenizer
2. DeepLearning Model with HuggingFace Transformers
distilbert-base-uncased is used as the Transformer model loaded with AutoModelForTokenClassification class of HuggingFace.
Training the model follows the usual routine, so I am gonna skip it.
There is a online Colab version which can be used to build the model @ https://colab.research.google.com/gist/Mageswaran1989/442d575d8f5ca11b7ae12b1b061a04d3/receiptsautoformfilling.ipynb
3. OCR
Google Tesseract with PyTesseract package is used to convert the image into text for its simplicity and ease of use.
Tesseract is an easy option for most of the OCR task, though there are other DL options, such as following ones which can be trained for specific datasets.
Remember Tesseract also comes with LSTM module which can also be trained on custom datasets.
import pytesseract
import cv2custom_config = r'--oem 3 --psm 6'
img = cv2.imread(file_path)
text = pytesseract.image_to_string(img, lang='eng', config=custom_config)
- The
--oem
argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract.
tesseract --help-oem # for oem. OCR Engine modes:
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
- The
--psm
controls the automatic Page Segmentation Mode used by Tesseract.
tesseract --help-psm # for psm.Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
4.FastAPI Backend
When user uploads the image it needs to be converted to text right? So the image is opened in the browser and the data is send to a python backend process to OCR it.
FastAPI is used as the REST API backend for its type safety features and ease of testing the endpoints with in-build features.
- Setting up CORS is important to allow local testing, however this needs to be changed to meet production requirements.
ALLOWED_ORIGINS = ["*"]app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
- pydantic is used to model the JSON requests, and this is where FastAPI shines over Flask.
- Keeping all endpoints in one file is not a way to handle big projects, then whats the better way? FastAPI router
- FastAPI router is used to keep the endpoints files as a separate module. This design greatly helps in maintaining and adding new features without altering the main file after initial setup. Code @ https://gitlab.com/gyan42/receipts-form-filling/-/blob/main/api/routers/ocr/tesseract/tesseract.py
5. TorchServe
Ok, we have trained and got a model to use? How to use the model to predict in real world? How to put it behind a REST API endpoint ? How to scale the model ? How to update the model after deploying?
I will leave you to explore TorchServe and its benefits ;)
To serve a NLP model all we need is the model weights, vocabulary used and the tokenizer, luckily HuggingFace model saves all these for us:
! ls ~/.gyan42/models/hf/sroie2019v1/config.json special_tokens_map.json training_args.bin pytorch_model.bin tokenizer_config.json vocab.txt
Torch serve has a tool called torch model archiver which packages the model as self contained archive which then can be invoked independently.
%set_env SERIALIZED_MODEL_FILE=/root/.gyan42/models/hf/sroie2019v1/pytorch_model.bin!torch-model-archiver --force \
--model-name sroie2019v1 \
--version 1.0 \
--serialized-file $SERIALIZED_MODEL_FILE \
--handler gyan42/serving/handler/hf_transformer_handler.py \
--extra-files /root/.gyan42/models/hf/sroie2019v1/config.json,/root/.gyan42/models/hf/sroie2019v1/special_tokens_map.json,/root/.gyan42/models/hf/sroie2019v1/training_args.bin,/root/.gyan42/models/hf/sroie2019v1/tokenizer_config.json,/root/.gyan42/models/hf/sroie2019v1/vocab.txt \
--export-path /root/.gyan42/model-store/
Wait what is hf_transformer_handler.py?
The best part I like about Torch serve is its ease of packaging model unlike Tensorflow (no comments!)
So what needs to be done typically to handle an incoming HTTP request ?
- Load the model part of initialization
- Extract the text from incoming request
- Preprocess the text
- Run inference through the model
- Post process the results
- Send back the reply
This is what done in hf_transformer_handler.py
When torch serve serves the model, it loads the handler which has the knowledge of loading the artifacts for our model and how to process the incoming request. Neat and simple isn’t?
torch-model-archiver
creates a file with extension .mar
which stand for Model Archive File, which then can be served with :
torchserve --start --model-store data/model-store --models all --ts-config configs/torch_serve_config.properties --foreground
Above command loads all the model under the directory data/model-store
Torch Serve can be configured for three URLs, namely:
inference_address = http://0.0.0.0:6543
management_address = http://0.0.0.0:6544
metrics_address = http://0.0.0.0:6545
Model predictions are done at http://0.0.0.0:6543/predictions/{model_name}
List of models can be found with http://0.0.0.0:6544/models
Management URL has features to load, scale and delete the model over the REST endpoints, comes in handy when we wanted to manage models remotely.
And don’t forget to configure CORS for serving REST API endpoints
# cors_allowed_origin is required to enable CORS, use '*' or your domain name
cors_allowed_origin=*
# required if you want to use preflight request
cors_allowed_methods=GET,POST,PUT,OPTIONS
# required if the request has an Access-Control-Request-Headers header
cors_allowed_headers=X-Custom-Header,content-type
Part of our demo we are loading the model statically, however more refined way is to load the model over the management URL. By storing the model in S3 kind of storage, it can be loaded dynamically.
curl -X POST "http://localhost:6544/models?initial_workers=1&synchronous=true&url=s3://some_bucket/gyna42/model-store/sroie2019v1.mar"
Torch Serve comes with support to work with Kubernetes.
6. Vue3 WebUI
Without an UI the whole model showcasing becomes command line oriented which is less intuitive.
Though Streamlit seems to be attractive option it lacks the power of customisation what we need sometimes.
I chose to learn Vue compared to other Web framework, as usual for its simplicity and ease of use.
- Bulma CSS framework is used for HTML styling. Read about column styling here which is used to split the web page into different column segments
- Axios is used for backend communications.
- Nginx for reverse proxy and as web server.
Environment files are used to load different configurations during build time. I have used following env files, namely linux, mac, maclocal and heroku to setup URLs accordingly.
VUE_APP_API_BASE_URL=http://localhost:8088
VUE_APP_TORCH_PRED_BASE_URL=http://0.0.0.0:6543
VUE_APP_TORCH_MGMNT_BASE_URL=http://0.0.0.0:6544
#End points
VUE_APP_API_OCR_TESSERACT=/gyan42/ocr/engine/pytesseract/file
Each Vue component consists of:
- HTML
- CSS : Which plays major role in decorating the web pages. One good thing is there are well defined CSS framework like https://bulma.io/ that gives all essential CSS classes for free
- JavaScript with some predefined layout
To understand the VueUI following Vue 3 concepts are necessary:
We use following HTML component:
Once you get knowledge of all these components, understanding the Vue file becomes comfortable but it takes time and practice to digest HTML and JS interactions.
- Sample Button
<button class="button is-link ml-5" v-on:click="onImageSample" > Sample a Test Image </button>
- Once the sample button is clicked, it calls the
onImageSample
JS function, which samples the image from static path and converts the blob data into a file data which then passed to backend for OCR
onImageSample() {
console.info("onImageSample")
this.startTime = performance.now()
this.timeElapsed = 0
this.modelTimeElapsed = 0
this.ocrTimeElapsed = 0
this.predictions = []
// Number of test iamges are 138!
const rndInt = Math.floor(Math.random() * 138) + 1
console.log(rndInt)
const blobUrlToFile = (blobUrl) => new Promise((resolve) => {
fetch(blobUrl).then((res) => {
res.blob().then((blob) => {
const fileName = blobUrl.split("/")[2]
console.info(fileName)
const file = new File([blob], fileName, {type: blob.type})
resolve(file)
})
})
})
blobUrlToFile(require("@/assets/images/test/"+rndInt+".jpg")).then( f => {
this.imageFileName = f
console.info(this.imageFileName )
this.createImage(this.imageFileName);
this.onRunOCR()
}
)
},
createImage(file) {
var reader = new FileReader();
reader.onload = (e) => {
this.imageFile = e.target.result;
};
reader.readAsDataURL(file);
},
}
- Image holder : Once the file is available through the variable
imageFile
the image content is displayed on the UI
<div class="box" style="background-color:transparent;">
<p v-if="imageFile.length > 0"> <img v-bind:src="imageFile" /></p>
</div>
- The image file data is then given to OCR backend API, which returns the text data that gets stored in
inTextData
async onRunOCR() {
this.status = "Running Tesseract"
this.ocrStartTime = performance.now()
console.info("Running Tesseract")
console.info(process.env.VUE_APP_API_BASE_URL, process.env.VUE_APP_API_OCR_TESSERACT)
let formData = new FormData();
formData.append('file', this.imageFileName);
console.info(this.imageFileName)
console.info(formData)
let headers = {
headers: {
'Content-Type': 'multipart/form-data'
},
timeout: 30000
}
api.fastapi
.post(process.env.VUE_APP_API_OCR_TESSERACT, formData, headers)
.then(res => {
console.info(res);
this.inTextData = res["data"]['text']
this.ocrTimeElapsed = performance.now() - this.ocrStartTime
this.predict()
})
.catch((err) => alert(err));
},
- The text data is then passed to model prediction API, which returns list of tuples of (token, tag) pairs.
predict(){
this.status = "Running Transformer Model"
this.modelStartTime = performance.now()
api.torchserve.post("predictions/" + this.selectedTorchModel, {"text": this.inTextData}, {timeout: 20000})
.then(value => {
console.info(value["data"])
this.extractTags(value["data"])
})
},
extractTags
function converts the list of tuples into list of map objects- Once the predictions are available in
predictions
data variable, iterate through it and create the label and input text boxes
<li v-for="prediction in predictions" :key="prediction.id">
<div class="field is-horizontal">
<div class="field-label is-normal">
<label class="label">{{prediction.tag}}</label>
</div>
<div class="field-body mr-5">
<div class="field">
<div class="control">
<input class="input" type="text" v-model=prediction.token>
</div>
</div>
</div>
</div>
<br>
</li>
- Predictions are list of map object as follows:
predictions: [ {id: 1, tag: "TAG1", token: "TOKEN1"},
{id:2, tag: "TAG2", token: "TOKEN2"}]
7. Docker Images
All docker files are found here:
- Multistage build id used to reduce the size of the images
- Nginx is configured inside the Docker as reverse proxy, which is a must for Vue UI to work out of Docker
COPY ui/nginx.conf /etc/nginx/nginx.conf
- .env files are used with Vue builder for environment variables like URLs
- https://gunicorn.org/ is used to launch the FastAPI backend as a daemon service
torchserve
to launch Torch Serve REST API endpoints from a static model file
With that comes to our demo time : https://gitlab.com/gyan42/receipts-form-filling/-/tree/main#docker-compose
An attempt was made to run the app in Heroku platform, however as suspected the RAM needs are greater than 500MB as we put Tesseract and Transformer under one roof.
On our next POC lets build a model and serve them under 500MB, as we have PyTorch Mobile https://pytorch.org/mobile/home/ or quantization
Thanks for reading and raise issues if something doesn’t work, happy to fix them any time!