Big Data Play Ground For Engineers: Flask Annotation Tool for Text Classification

Mageswaran D

4 min readApr 18, 2020

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Data without labels are of less interested for Machine Learning and for Deep Learning its a Big Nooo!

Preparing labelled data is not a child's play…

Domain understanding is required
Lot of human hours, needs to be put in to prepare the data set, accounting for human error in the process

More professional managed services are available like…

Amazon SageMaker Ground Truth | AWS

Build highly accurate training datasets using machine learning and reduce data labeling costs by up to 70% Amazon…

aws.amazon.com

Prodigy · An annotation tool for AI, Machine Learning & NLP

An annotation tool poweredby active learning. pip install ./prodigy.whlSuccessfully installed prodigy prodigy…

prodi.gy

In this post lets see how to create a simple flask based annotation tool that reads data from Postgresql DB table and stores the label data back to it.

If you have the basics of flask, then its more than enough to try this out and learn something out it.

Example Code Repo @

gyan42/naive_text_tagger

Flask based Text annotation tool for educational purpose - gyan42/naive_text_tagger

github.com

Posgresql DB

Lets start with setting up the Posgresql DB…

- install ubuntu packages
`sudo apt-get install postgresql postgresql-contrib`

- check the version
`sudo -u postgres psql -c "SELECT version();"`sudo su - postgres
    psql #to launch the terminal
    # drop user sparkstreaming;
    CREATE USER tagger WITH PASSWORD 'tagger'; 
    \du #list users
    CREATE DATABASE taggerdb;
    grant all privileges on database taggerdb to tagger;
    \list # to see the DB created
    \q

Upload the Data to DB

Run the script to upload the parquet data into the DB

python dataset_base.py

sudo su - postgres
psql
\dt
select count(*) from train_0;

Okay, good now our data is ready.

Data is nothing but the tweets collected with Tweepy and the schema are as follows:

id_str : Tweet ID
created_at : Date and time
source object : Device Type
text : Tweet text
expanded_url : Url text
media_url_https : Media URL
hash : string hash value
slabel : Snorkel label (refer previous post for the details)
text_id : Integer IDs
label : Copy of slabel column gets updated with human annotation
id int64

Flask

python app.py

Check out the url in you local machine @ http://0.0.0.0:8766/

As you can see, the web app is reading the tables (train/dev/test) from Database and provides a way to annotate.

It uses two colors, to differentiate the classification labels, where the label values can be configured through gin file called tagger.gin.

Gin config is used for configurations!

google/gin-config

Gin provides a lightweight configuration framework for Python, based on dependency injection. Functions or classes can…

github.com

Lets dive into little flashy stuff…

For sure the data is not going to fit in a single page, so we need to show a subset of data per page which called pagination in Web app terms!

flask-rest-api

When returning a list of objects, it is generally good practice to paginate the resource. This is where steps in…

flask-rest-api.readthedocs.io

Yet another small language in the tool box…

Jinja2

Jinja, also commonly referred to as " Jinja2" to specify the newest release version, is a Python template engine used…

www.fullstackpython.com

Jinja templates are used to interrupt the data send from Python in HTML page. It helps to work with loops and variables in HTML page.

Read a little about HTML forms @

HTML Forms

The HTML element defines a form that is used to collect user input: An HTML form contains form elements. Form elements…

www.w3schools.com

The logic is to group the text,text ids, label ids, page number, offset etc as part of the HTML form (similar to class object in python you can say).

When the page is rendered for the first time, pagination is applied and the details are send to the layout where some of them are list of values, these values are then interrupted with Jinja2 templates on HTML page.

When user updates the labels and hits the Submit button, all text, labels, page info values are send as POST form data to the same URL.

The POST request is then decoded, that is form values are converted to dictionaries and the key/values are used to update the table through Postgresql DB connection object.

Download the Data from DB

Run following command ot download the data from Postgresql as parquet…

python dataset_base.py --mode=download

Hope you enjoyed the blog! see you in the next post…