How to build custom NER HuggingFace dataset for receipts and train with HuggingFace Transformers library?

5 min readOct 28, 2021

Disclaimer: It is assumed that you have some working knowledge in Hugging face library and datasets library, to begin with.

Git link:

mozhi-datasets/sroie2019 at main · gyan42/mozhi-datasets

Official URL: https://rrc.cvc.uab.es/?ch=13&com=downloads Drive URL…

github.com

Named Entity Recognition the back bone of extracting information from text documents, often less visited topic on NLP.

Let’s not get into what is NER topic for today’s discussion.

Rather lets see how to prepare data so that we can train using the famous NLP library Transformers from Hugging Face.

The dataset we are going to use today is ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

It consists of three tasks Text localisation , OCR and Information extraction.

We are going to use OCR and Information extraction task data to prepare NER dataset for our training.

Following is a sample image from the train dataset:

OCR text file:


72,25,326,25,326,64,72,64,TAN WOON YANN
50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND
205,121,285,121,285,139,205,139,789417-W
110,144,383,144,383,163,110,163,NO.53 55,57 & 59, JALAN SAGU 18,
192,169,299,169,299,187,192,187,TAMAN DAYA,
162,193,334,193,334,211,162,211,81100 JOHOR BAHRU,
217,216,275,216,275,233,217,233,JOHOR.
50,342,279,342,279,359,50,359,DOCUMENT NO : TD01167104
50,372,96,372,96,390,50,390,DATE:
165,372,342,372,342,389,165,389,25/12/2018 8:13:39 PM
48,396,117,396,117,415,48,415,CASHIER:
164,397,215,397,215,413,164,413,MANIS
49,423,122,423,122,440,49,440,MEMBER:
191,460,298,460,298,476,191,476,CASH BILL
30,508,121,508,121,523,30,523,CODE/DESC
200,507,247,507,247,521,200,521,PRICE
276,506,306,506,306,522,276,522,DISC
374,507,441,507,441,521,374,521,AMOUNT
69,531,102,531,102,550,69,550,QTY
221,531,247,531,247,545,221,545,RM
420,529,443,529,443,547,420,547,RM
27,570,137,570,137,583,27,583,9556939040116
159,570,396,570,396,584,159,584,KF MODELLING CLAY KIDDY FISH
77,598,113,598,113,613,77,613,1 PC
138,597,148,597,148,607,138,607,*
202,597,245,597,245,612,202,612,9.000
275,598,309,598,309,612,275,612,0.00
411,596,443,596,443,613,411,613,9.00
245,639,293,639,293,658,245,658,TOTAL:
118,671,291,671,291,687,118,687,ROUR DING ADJUSTMENT:
408,669,443,669,443,684,408,684,0.00
86,704,292,704,292,723,86,723,ROUND D TOTAL (RM):
401,703,443,703,443,719,401,719,9.00
205,744,243,744,243,765,205,765,CASH
402,748,441,748,441,763,402,763,10.00
205,770,271,770,271,788,205,788,CHANGE
412,772,443,772,443,786,412,786,1.00
97,845,401,845,401,860,97,860,GOODS SOLD ARE NOT RETURNABLE OR
190,864,309,864,309,880,190,880,EXCHANGEABLE
142,883,353,883,353,901,142,901,***
137,903,351,903,351,920,137,920,***
202,942,292,942,292,959,202,959,THANK YOU
163,962,330,962,330,977,163,977,PLEASE COME AGAIN !
412,639,442,639,442,654,412,654,9.00

IE text file:

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
}

First task is to merge IE text file data and OCR text file data and make (word, tag) pair.

TAN O
WOON O
YANN O
BOOK company
TA company
.K(TAMAN company
DAYA) company
SDN company
BND company
789417-W O
NO.53 address
55 address
57 address
& address
59 address
 address
JALAN address
SAGU address
18 address
TAMAN address
DAYA address
81100 address
JOHOR address
BAHRU address
JOHOR. address
DOCUMENT O
NO O
: O
TD01167104 O
DATE: O
25/12/2018 date
8:13:39 date
PM date
CASHIER: O
MANIS O
MEMBER: O
CASH O
BILL O
CODE/DESC O
PRICE O
DISC O
AMOUNT O
QTY O
RM O
9556939040116 O
KF O
MODELLING O
CLAY O
KIDDY O
FISH O
1 O
PC O
* O
9.000 total
0.00 O
9.00 total
TOTAL: O
ROUR O
DING O
ADJUSTMENT: O
ROUND O
D O
TOTAL O
(RM): O
CASH O
10.00 O
CHANGE O
1.00 O
GOODS O
SOLD O
ARE O
NOT O
RETURNABLE O
OR O
EXCHANGEABLE O
*** O
THANK O
YOU O
PLEASE O
COME O
AGAIN O
! O

Above one is very simple one with (word, tag) pair, if you wanted to be more informative and complicated, this can be converted into IOB/BOI format. Where B stands for beginning, O stands for other tag and I stands for inside.

Yet another commonly used format is CoNLL more details @ https://universaldependencies.org/format.html
The de-facto dataset for NER tasks is CoNLL-2003 dataset
https://www.clips.uantwerpen.be/conll2003/ner/
https://paperswithcode.com/dataset/conll-2003
Here is the Git link for CoNLL 2003 dataset : https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003

Check this notebook for the code that prepares SROIE 2019 NER dataset :

mozhi-datasets/SroieNERDataset.ipynb at main · gyan42/mozhi-datasets

To store custom datasets used in Mozhi repo. Contribute to gyan42/mozhi-datasets development by creating an account on…

github.com

Now comes the part where we wanted to use the data for transformer model training.

Dataset library from Huggingface has become a good choice to use for many model experimentation. However it has only support for some of well established NER datasets like CoNLL 2003

Online version @ https://huggingface.co/datasets/conll2003
Source Code @ https://github.com/huggingface/datasets/blob/master/datasets/conll2003/conll2003.py

What if we need to use our own dataset like the one we created above?

One easy way is to check existing setup, so I was checking how the existing CoNLL 2003 dataset is working in HF datasets library and came up with following class, with minor additions.

Below class takes in raw url path to Git hub directory, file names and list of NER tags. With that it downloads the data and prepare the data for training.

_info() is mandatory where we need to specify the columns of the dataset. In our case it is three columns id, ner_tags, tokens , where id and tokens are values from the dataset, ner_tags is for names of the NER tags which needs to be set manually.

_generate_examples(file_path) reads our IOB formatted text file and creates list of (word, tag) for each sentence. Our dataset will then a list of list i.e outer list os for each sentence and inner list for word tag pairs. Open the file, collect id, token, ner_tag till we encounter new line and append it to outer list and repeat till the end of the file.

We also need a tokeniser which can take our list of examples and converts it into integers

`tokenize_and_align_labels` function act as a data transformer that takes raw examples and convers it into compatible format for model training.

Now that we have a dataset class that is compatible with HuggingFace transformer, its a cake walk to integrate with a AutoModelForTokenClassification model named distilbert-base-uncased

Initialise dataset class

# HFSREIO2019Dataset is wrapper around SROIE2019 with some extra methods like id2word word2id methods
hf_dataset = HFSREIO2019Dataset()hf_model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased",num_labels=len(hf_dataset.labels))

dataset info:

Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 588
})
Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 82
})
Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 49
})num_rows: 49})

toenized dataset info:

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'tokens'],
        num_rows: 588
    })
    validation: Dataset({
        features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'tokens'],
        num_rows: 49
    })
    test: Dataset({
        features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'tokens'],
        num_rows: 82
    })
})

Now prepare the data collector for training

hf_preprocessor = HFTokenizer.init_vf(hf_pretrained_tokenizer_checkpoint=hf_pretrained_tokenizer_checkpoint)

Now prepare the data collector for training

from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(hf_preprocessor.tokenizer)

Finally training

args = TrainingArguments(
    f"test-ner",
    evaluation_strategy="epoch",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=max_epochs,
    weight_decay=0.01,
)trainer = Trainer(
    hf_model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=hf_preprocessor.tokenizer,
    compute_metrics=lambda p: compute_metrics(p=p, label_list=hf_dataset.labels)
)

trainer.train()
trainer.evaluate()

Result

{
'date': {'precision': 0.6068376068376068, 'recall': 0.6893203883495146, 'f1': 0.6454545454545455, 'number': 103}, 'address': {'precision': 0.782608695652174, 'recall': 0.8, 'f1': 0.7912087912087912, 'number': 90}, 'company': {'precision': 0.7088607594936709, 'recall': 0.691358024691358, 'f1': 0.7, 'number': 81}, 'total': {'precision': 0.6605504587155964, 'recall': 0.7659574468085106, 'f1': 0.7093596059113301, 'number': 94}, 'overall_precision': 0.6826196473551638, 
'overall_recall': 0.7364130434782609, 
'overall_f1': 0.7084967320261438, 
'overall_accuracy': 0.969463275079296
}

Check this Google Colab notebook for end to end training on out receipt dataset