Big Data Play Ground For Engineers: Flask Annotation Tool for Text Classification

Mageswaran D
4 min readApr 18, 2020
Data Flow

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Data without labels are of less interested for Machine Learning and for Deep Learning its a Big Nooo!

Preparing labelled data is not a child's play…

  • Domain understanding is required
  • Lot of human hours, needs to be put in to prepare the data set, accounting for human error in the process

More professional managed services are available like…

In this post lets see how to create a simple flask based annotation tool that reads data from Postgresql DB table and stores the label data back to it.

If you have the basics of flask, then its more than enough to try this out and learn something out it.

Example Code Repo @

Posgresql DB

Lets start with setting up the Posgresql DB…

- install ubuntu packages
`sudo apt-get install postgresql postgresql-contrib`

- check the version
`sudo -u postgres psql -c "SELECT version();"`
sudo su - postgres
psql #to launch the terminal
# drop user sparkstreaming;
CREATE USER tagger WITH PASSWORD 'tagger';
\du #list users
CREATE DATABASE taggerdb;
grant all privileges on database taggerdb to tagger;
\list # to see the DB created
\q

Upload the Data to DB

Run the script to upload the parquet data into the DB

python dataset_base.py

sudo su - postgres
psql
\dt
select count(*) from train_0;

Okay, good now our data is ready.

Data is nothing but the tweets collected with Tweepy and the schema are as follows:

id_str : Tweet ID
created_at : Date and time
source object : Device Type
text : Tweet text
expanded_url : Url text
media_url_https : Media URL
hash : string hash value
slabel : Snorkel label (refer previous post for the details)
text_id : Integer IDs
label : Copy of slabel column gets updated with human annotation
id int64

Flask

python app.py

Check out the url in you local machine @ http://0.0.0.0:8766/

As you can see, the web app is reading the tables (train/dev/test) from Database and provides a way to annotate.

It uses two colors, to differentiate the classification labels, where the label values can be configured through gin file called tagger.gin.

Gin config is used for configurations!

Lets dive into little flashy stuff…

For sure the data is not going to fit in a single page, so we need to show a subset of data per page which called pagination in Web app terms!

Yet another small language in the tool box…

Jinja templates are used to interrupt the data send from Python in HTML page. It helps to work with loops and variables in HTML page.

Read a little about HTML forms @

The logic is to group the text,text ids, label ids, page number, offset etc as part of the HTML form (similar to class object in python you can say).

When the page is rendered for the first time, pagination is applied and the details are send to the layout where some of them are list of values, these values are then interrupted with Jinja2 templates on HTML page.

When user updates the labels and hits the Submit button, all text, labels, page info values are send as POST form data to the same URL.

The POST request is then decoded, that is form values are converted to dictionaries and the key/values are used to update the table through Postgresql DB connection object.

Download the Data from DB

Run following command ot download the data from Postgresql as parquet…

python dataset_base.py --mode=download

Hope you enjoyed the blog! see you in the next post…

--

--