How to build custom NER HuggingFace dataset for receipts and train with HuggingFace Transformers library?


72,25,326,25,326,64,72,64,TAN WOON YANN
50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND
205,121,285,121,285,139,205,139,789417-W
110,144,383,144,383,163,110,163,NO.53 55,57 & 59, JALAN SAGU 18,
192,169,299,169,299,187,192,187,TAMAN DAYA,
162,193,334,193,334,211,162,211,81100 JOHOR BAHRU,
217,216,275,216,275,233,217,233,JOHOR.
50,342,279,342,279,359,50,359,DOCUMENT NO : TD01167104
50,372,96,372,96,390,50,390,DATE:
165,372,342,372,342,389,165,389,25/12/2018 8:13:39 PM
48,396,117,396,117,415,48,415,CASHIER:
164,397,215,397,215,413,164,413,MANIS
49,423,122,423,122,440,49,440,MEMBER:
191,460,298,460,298,476,191,476,CASH BILL
30,508,121,508,121,523,30,523,CODE/DESC
200,507,247,507,247,521,200,521,PRICE
276,506,306,506,306,522,276,522,DISC
374,507,441,507,441,521,374,521,AMOUNT
69,531,102,531,102,550,69,550,QTY
221,531,247,531,247,545,221,545,RM
420,529,443,529,443,547,420,547,RM
27,570,137,570,137,583,27,583,9556939040116
159,570,396,570,396,584,159,584,KF MODELLING CLAY KIDDY FISH
77,598,113,598,113,613,77,613,1 PC
138,597,148,597,148,607,138,607,*
202,597,245,597,245,612,202,612,9.000
275,598,309,598,309,612,275,612,0.00
411,596,443,596,443,613,411,613,9.00
245,639,293,639,293,658,245,658,TOTAL:
118,671,291,671,291,687,118,687,ROUR DING ADJUSTMENT:
408,669,443,669,443,684,408,684,0.00
86,704,292,704,292,723,86,723,ROUND D TOTAL (RM):
401,703,443,703,443,719,401,719,9.00
205,744,243,744,243,765,205,765,CASH
402,748,441,748,441,763,402,763,10.00
205,770,271,770,271,788,205,788,CHANGE
412,772,443,772,443,786,412,786,1.00
97,845,401,845,401,860,97,860,GOODS SOLD ARE NOT RETURNABLE OR
190,864,309,864,309,880,190,880,EXCHANGEABLE
142,883,353,883,353,901,142,901,***
137,903,351,903,351,920,137,920,***
202,942,292,942,292,959,202,959,THANK YOU
163,962,330,962,330,977,163,977,PLEASE COME AGAIN !
412,639,442,639,442,654,412,654,9.00
{
"company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
"date": "25/12/2018",
"address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
"total": "9.00"
}
TAN O
WOON O
YANN O
BOOK company
TA company
.K(TAMAN company
DAYA) company
SDN company
BND company
789417-W O
NO.53 address
55 address
57 address
& address
59 address
address
JALAN address
SAGU address
18 address
TAMAN address
DAYA address
81100 address
JOHOR address
BAHRU address
JOHOR. address
DOCUMENT O
NO O
: O
TD01167104 O
DATE: O
25/12/2018 date
8:13:39 date
PM date
CASHIER: O
MANIS O
MEMBER: O
CASH O
BILL O
CODE/DESC O
PRICE O
DISC O
AMOUNT O
QTY O
RM O
9556939040116 O
KF O
MODELLING O
CLAY O
KIDDY O
FISH O
1 O
PC O
* O
9.000 total
0.00 O
9.00 total
TOTAL: O
ROUR O
DING O
ADJUSTMENT: O
ROUND O
D O
TOTAL O
(RM): O
CASH O
10.00 O
CHANGE O
1.00 O
GOODS O
SOLD O
ARE O
NOT O
RETURNABLE O
OR O
EXCHANGEABLE O
*** O
THANK O
YOU O
PLEASE O
COME O
AGAIN O
! O

Yet another commonly used format is CoNLL more details @ https://universaldependencies.org/format.html

The de-facto dataset for NER tasks is CoNLL-2003 dataset

https://www.clips.uantwerpen.be/conll2003/ner/

https://paperswithcode.com/dataset/conll-2003

Here is the Git link for CoNLL 2003 dataset : https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003

# HFSREIO2019Dataset is wrapper around SROIE2019 with some extra methods like id2word word2id methods
hf_dataset = HFSREIO2019Dataset()
hf_model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased",num_labels=len(hf_dataset.labels))
Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 588
})
Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 82
})
Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 49
})
num_rows: 49})
DatasetDict({
train: Dataset({
features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'tokens'],
num_rows: 588
})
validation: Dataset({
features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'tokens'],
num_rows: 49
})
test: Dataset({
features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'tokens'],
num_rows: 82
})
})
hf_preprocessor = HFTokenizer.init_vf(hf_pretrained_tokenizer_checkpoint=hf_pretrained_tokenizer_checkpoint)
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(hf_preprocessor.tokenizer)
args = TrainingArguments(
f"test-ner",
evaluation_strategy="epoch",
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=max_epochs,
weight_decay=0.01,
)
trainer = Trainer(
hf_model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=hf_preprocessor.tokenizer,
compute_metrics=lambda p: compute_metrics(p=p, label_list=hf_dataset.labels)
)

trainer.train()
trainer.evaluate()
{
'date': {'precision': 0.6068376068376068, 'recall': 0.6893203883495146, 'f1': 0.6454545454545455, 'number': 103},
'address': {'precision': 0.782608695652174, 'recall': 0.8, 'f1': 0.7912087912087912, 'number': 90}, 'company': {'precision': 0.7088607594936709, 'recall': 0.691358024691358, 'f1': 0.7, 'number': 81}, 'total': {'precision': 0.6605504587155964, 'recall': 0.7659574468085106, 'f1': 0.7093596059113301, 'number': 94}, 'overall_precision': 0.6826196473551638,
'overall_recall': 0.7364130434782609,
'overall_f1': 0.7084967320261438,
'overall_accuracy': 0.969463275079296
}

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store