NLP: How to predict next word in a search query? Full stack N-Gram Model withVue, FastAPI & Heroku
- Dataset : google_wellformed_query 25K examples
2. Model: Naive N-Gram probability model : Single machine version and PySpark version
3. Web UI : https://v3.vuejs.org/
4. Backend API: https://fastapi.tiangolo.com/
5. Docker and Docker Compose
6. Deployment on Heroku
Git: https://github.com/gyan42/autocomplete-ngram-model
Colab Notebook for Model training: https://colab.research.google.com/gist/Mageswaran1989/e49e043f09c1de89c6f433967be118a2/autocorrectmodel.ipynb
Live Demo: https://gyan42-autocompleter.herokuapp.com/
I hope I don’t have set any context here, as the above picture depicts what we wanna achieve.
Who have won the _____
To fill the above sentence we need to know the current search trend and whats happening in the locality of the search is happening. It can be anything from local election result to international game results.
Like any other ML model the result is subjective to the dataset we train them on. Luckily we have google_wellformed_query which we are gonna use for our exploration. We are not going to use any fancy ML algorithm here, a simple bag of words probability model. We also see how to use Apache Spark to produce the model artefacts that is needed for our predictions for large datasets.
And finally we are going to see how to use large dataset and train using Spark and use the artefacts with our model.
1. Dataset
To keep data handling simple we are going to use Datasets package from https://huggingface.co/datasets
Here is the online view linker for `google_wellformed_query` which can be used for exploring the dataset
from datasets import load_dataset
dataset = load_dataset('google_wellformed_query')
dataset['train']['content][:2]
>> The European Union includes how many ?
What are Mia Hamms accomplishment ?
2. N-Gram Model
N-Gram models are Statistical(Probabilistic) Language models that aim to assign probabilities to a given sequence of words. Any N-gram is just a sequence of “n” words. For example, “Saurav” is a unigram and “Hi There” is a bigram.
Sentences -> Tokenisers (splitting sentences into words) -> Grouping tokens into tuples
So what it does basically? it estimates the probability of next word given previous set of words.
What is word or unigram probability ?
What is bi-gram probability?
What is n-gram probability?
def _count_n_grams(self, tokenized_sentences, ngram):
'''
Creates n-gram from tokenized sentence and counts the same
'''
freq = defaultdict(lambda: 0)
for sentence in tqdm(tokenized_sentences, desc="NGrams"):
sentence = [self._start_token] * ngram + sentence + [self._end_token]
m = len(sentence) if ngram == 1 else len(sentence) - 1
for i in range(m):
ngram_token = sentence[i:i+ngram]
#freq[tuple(ngram_token)] += 1
# tuples can't be used as key in JSON
freq[" ".join(ngram_token)] += 1
return freq
Now comes the next question how to predict next word in my search query sentence? simple, use n-gram of your choice
- Calculate the counts for n-gram Eg: bi-gram
- Calculate the counts for n-gram + 1 Eg: tri-gram . By +1 it means we are considering next n-gram.
- Apply the n-gram probability formula with K-Smoothing, to avoid out of vocab and skewed words
Try Kneser Ney smoothening instead of plain Laplace smoothening!
start_tokens = ["how", "many", "pairs", "of"]model.suggestions(start_tokens)>> ['chromosomes', 'how', 'do', 'you', 'change']
So basically we have two counts n-gram and n-gram+1, as an example lets take bi-gram and tri-gram. As the user types in we consider previous bi-gram, here [“pairs”, “of”] and iterate through all words in the vocabulary and calculate the probability of each word combination with our n-gram tuple.
What is the probability of a word from the vocabulary given the previous bi-gram (pairs, of)?
def _estimate_probability(self, word, previous_ngram):
vocab_size = len(self._word_frequency)
#previous_ngram = tuple(previous_ngram)
if type(previous_ngram) != list:
previous_ngram = [previous_ngram]
previous_ngram = " ".join(previous_ngram)
previous_ngram_count = self._ngram_word_frequency.get(previous_ngram, 0)
if previous_ngram_count == 0:
# print("Warning no match found for entered words!")
return 0
denominator = previous_ngram_count + self._k * len(self._vocab)
n_plus1_gram = previous_ngram + " " + word
n_plus1_gram_count = self._ngram_plus1_word_frequency.get(n_plus1_gram, 0)
numerator = n_plus1_gram_count + self._k
probability = numerator / denominator
return probability
Iterate through all words in vocabulary and calculate the probabilities, this is needed as weed figure out which word has highest probability for given n-gram.
def _estimate_probabilities(self, previous_ngram):
probabilities = {}
# previous_n_gram = tuple(previous_n_gram)
if type(previous_ngram) != list:
previous_ngram = [previous_ngram]
previous_ngram = " ".join(previous_ngram).lower()
for word in self._vocab:
probabilities[word] = self._estimate_probability(word, previous_ngram)
return probabilities
Apache Spark
Ok cool! we managed to build our model with some understanding of its maths (at-least for now!). We are able to train on small corpus and build the model quickly.
For our to model to suggest next words all we need is the vocab, counts of n-gram and counts of n-gram+1. Now, lets see how to use Apache Spark to get the counts for n-gram and n-gram +1 on a bigger corpus.
class SparkAutoCorrectModel(object):
def __init__(self,
spark,
dataset,
ngram=2):
self._spark = spark
self._df = spark.createDataFrame(pd.DataFrame({"text": ds.lines}))
self._ngram = ngramself._tokenizer = Tokenizer(inputCol="text", outputCol="words")
self._bigram = NGram(n=2, inputCol="words", outputCol="ngrams")
self._trigram = NGram(n=ngram+1, inputCol="words", outputCol="ngram_plus_one")def transform(self):
df_tokenized = self._tokenizer.transform(self._df)
ngram_df = self._bigram.transform(df_tokenized)
ngram_df = self._trigram.transform(ngram_df)
ngram_df.show()
self._ngram_df = ngram_dfdef save_as_json(self, file_path):
vocab = self._ngram_df.select(F.explode("words").alias("vocab")).collect()
vocab = {row['vocab'] for row in vocab}
vocab = list(vocab)ngram = self._ngram_df.select(F.explode("ngrams").alias("ngram")).groupBy("ngram").count().collect()
ngram_word_frequency = {row['ngram']: row['count'] for row in ngram}ngram_plus_one = self._ngram_df.select(F.explode("ngram_plus_one").alias("ngram_plus_one_")).groupBy("ngram_plus_one_").count().collect()
ngram_plus1_word_frequency = {row['ngram_plus_one_']: row['count'] for row in ngram_plus_one}data = {}
data['ngram'] = self._ngram
data['vocab'] = vocab
data['ngram_word_frequency'] = ngram_word_frequency
data['ngram_plus1_word_frequency'] = ngram_plus1_word_frequencywith open(file_path, "w", encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=False, indent=4)
So whats happening here?
- Convert the list of dataset lines into Spark DataFrame, with lines in
text
column - Tokenizer the text column with
Tokenizer
- With
NGram
create two new columnngram
andngram_plus_one
- With
explode
SQL in-build function convert array of tokens into rows and aggregate the tuple counts - Collect the vocabulary, n-grm counts and n-gram + 1 counts and create a JSON file, that can be loaded into out model class.
3. Vue Web UI
In recent time https://streamlit.io/ library has become the go to option for Data Science UI as it only takes an hour to build a UI that you wanna for your proof of projects all in Python of-course, nevertheless it comes with cost of loosing customisation.
In my Googling I found Vue to match my object oriented programming experience and a framework that has an easy learning curve to kick start. It has this nice way of modeling data variables, user methods, signal emits & props for data sharing and HTML handling.
Each Vue component consists of:
- HTML
- CSS : Which plays major role in decorating the web pages. One good thing is there are well defined CSS framework like https://bulma.io/ that gives all essential CSS classes for free
- JavaScript with some predefined layout
To understand the VueUI following Vue 3 concepts are necessary:
We need a search box and a list to display the suggestions.
<input class="input is-rounded"
v-model="search"
@input="onChange"
@keyup.down="onArrowDown"
@keyup.up="onArrowUp"
@keyup.enter="onEnter"
autocomplete="off"
placeholder="Enter your text here for auto suggestion..."/><ul id="autocomplete-results" v-show="isOpen" class="autocomplete-results">
<li class="loading" v-if="isLoading">
Loading Results...
</li>
<li v-else v-for="(result, i) in results" :key="i" @click="setResult(result)"
class="autocomplete-result" :class="{ 'is-active': i === arrowCounter }">
{{result}}
</li>
</ul>
Whats happening?
- Bind
search
data variable to input box - On user input call
onChange
method, similarly on arrow key press call respective methods - List is used to display the suggestions, and show the list only we have suggestion from backend with data variable
isOpen
- If back end is busy doing the computation, show
Loading Results
based on the value of data variableisLoading
- When suggestions are available iterate through it and display it with a style for active selection
When user inputs the search query, call backend API:
// Backend
axios.post(path, {text: cleandedQuery})
.then((res) => {
this.isLoading = false
this.results = res.data.tokens
console.info("suggestions", this.results)
})
When user navigates the suggestions in the list and presses the enter button, add the current selection the our search query
onEnter() {
console.info("enter", this.arrowCounter)
if (this.arrowCounter != -1) {
if (this.search.split(" ").length > 1) {
this.search = this.search.split(" ").slice(0, -1).join(" ")
}
// based on the selection from suggestions add the suggested word to
this.search = this.search + " " + this.results[this.arrowCounter];
}
this.isOpen = false
this.arrowCounter = -1
},
Thats all to it for our intelligent search box 😉
4. FastAPI Backend
I have picked https://fastapi.tiangolo.com/ instead of traditional frameworks as it makes testing the API very easy and has all the batteries that is needed for quick prototyping.
- Setup FastAPI main file with cross origin policies
app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
- Initialise AutoCorrectModel and load one of the json which we got as result of our training
auto_correct_model = AutoCorrectModel()
auto_correct_model.load_from_json("data/trigram-autocompleter.json")
- Our end point is at “/suggestions”, which can receive a string, split it on spaces and returns a max of 10 words fro next word
@app.post("/suggestions")
def suggestions(data: Text):
start_with = None
tokens = data.text.split(" ")
if len(tokens) < auto_correct_model._ngram:
tokens = ['<s>'] * (auto_correct_model._ngram - len(tokens)) + tokens
print("Inputs", tokens)
res = auto_correct_model.suggestions(tokens, num_suggestions=10, start_with=start_with)
print("Sugestions", res)
return {"tokens": res}
5. Docker
Multistage build are used to speed up the docker builds and keep the size of the images considerably low.
Check the README here for all build commands.
API : https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/api/Dockerfile
- In my trials, on Mac the docker IP address is different from the host even if we bridge the network. So I had to come up with two Docker images one for Linux and one for Mac. Each having its own backend API host address.
- On Mac it defaults to
192.168.99.100
, whereas in Linux it is0.0.0.0 or localhost
- This is handled in Vue .env file @ https://github.com/gyan42/autocomplete-ngram-model/tree/main/ui/autocomplete
UI: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops/ui
- Basic understanding of https://www.nginx.com/ is required as we use it as our webserver.
- Beginners guide @ http://nginx.org/en/docs/beginners_guide.html
- To run on local machine we use this ngnix.conf @ https://github.com/gyan42/autocomplete-ngram-model/blob/main/ui/autocomplete/nginx.conf
- It listens on port 80 and loads the index file from our Vue build directory
DockerCompose: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops
- Since we have two platform to handle we have two Docker compose files.
6. Heroku
Lets host our front end and back end along with our model on https://www.heroku.com/
- Create an account
- Create an app
- Login into Heroku and Heroku container on command line
- Build an Docker image with both Vue and FastaPI on single docker image @ https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/heroku/Dockerfile
- Heroku has different ngnix file that takes in Heroku supplied port number for UI and we have configured proxy for redirect backend API calls to FastAPI endpoints
- Part of Docker command the
$PORT
number is replaced during the launch time : sed -i -e ‘s/$PORT/’”$PORT”’/g’ /etc/nginx/conf.d/default.conf
server {
listen $PORT;root /usr/share/nginx/html;
index index.html index.html;location / {
client_max_body_size 200M;
root /app;
index index.html;
try_files $uri $uri/ /index.html;
}location /suggestions {
proxy_connect_timeout 6000;
proxy_read_timeout 6000;
proxy_pass http://127.0.0.1:5000;
}error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
}
- Push the docker image to the Heroku registry
- Release the app
Hope this post gives some pointers for hosting Vue + Fast API application on Heroku and as well on local machine.
References:
- https://towardsdatascience.com/index-48563e4c1572
- https://www.kaggle.com/sauravmaheshkar/auto-completion-using-n-gram-models
- https://testdriven.io/blog/developing-a-single-page-app-with-fastapi-and-vuejs/
- https://testdriven.io/blog/deploying-flask-to-heroku-with-docker-and-gitlab/
- https://www.tutlinks.com/create-and-deploy-fastapi-app-to-heroku/