NLP: How to predict next word in a search query? Full stack N-Gram Model withVue, FastAPI & Heroku

Mageswaran D
8 min readNov 10, 2021

--

  1. Dataset : google_wellformed_query 25K examples

2. Model: Naive N-Gram probability model : Single machine version and PySpark version

3. Web UI : https://v3.vuejs.org/

4. Backend API: https://fastapi.tiangolo.com/

5. Docker and Docker Compose

6. Deployment on Heroku

Git: https://github.com/gyan42/autocomplete-ngram-model

Colab Notebook for Model training: https://colab.research.google.com/gist/Mageswaran1989/e49e043f09c1de89c6f433967be118a2/autocorrectmodel.ipynb

Live Demo: https://gyan42-autocompleter.herokuapp.com/

I hope I don’t have set any context here, as the above picture depicts what we wanna achieve.

Who have won the _____

To fill the above sentence we need to know the current search trend and whats happening in the locality of the search is happening. It can be anything from local election result to international game results.

Like any other ML model the result is subjective to the dataset we train them on. Luckily we have google_wellformed_query which we are gonna use for our exploration. We are not going to use any fancy ML algorithm here, a simple bag of words probability model. We also see how to use Apache Spark to produce the model artefacts that is needed for our predictions for large datasets.

And finally we are going to see how to use large dataset and train using Spark and use the artefacts with our model.

1. Dataset

To keep data handling simple we are going to use Datasets package from https://huggingface.co/datasets

Here is the online view linker for `google_wellformed_query` which can be used for exploring the dataset

from datasets import load_dataset
dataset = load_dataset('google_wellformed_query')
dataset['train']['content][:2]
>> The European Union includes how many ?
What are Mia Hamms accomplishment ?

2. N-Gram Model

N-Gram models are Statistical(Probabilistic) Language models that aim to assign probabilities to a given sequence of words. Any N-gram is just a sequence of “n” words. For example, “Saurav” is a unigram and “Hi There” is a bigram.

Sentences -> Tokenisers (splitting sentences into words) -> Grouping tokens into tuples

So what it does basically? it estimates the probability of next word given previous set of words.

What is word or unigram probability ?

What is bi-gram probability?

What is n-gram probability?

def _count_n_grams(self, tokenized_sentences, ngram):
'''
Creates n-gram from tokenized sentence and counts the same
'''
freq = defaultdict(lambda: 0)
for sentence in tqdm(tokenized_sentences, desc="NGrams"):
sentence = [self._start_token] * ngram + sentence + [self._end_token]
m = len(sentence) if ngram == 1 else len(sentence) - 1
for i in range(m):
ngram_token = sentence[i:i+ngram]
#freq[tuple(ngram_token)] += 1
# tuples can't be used as key in JSON
freq[" ".join(ngram_token)] += 1
return freq

Now comes the next question how to predict next word in my search query sentence? simple, use n-gram of your choice

  • Calculate the counts for n-gram Eg: bi-gram
  • Calculate the counts for n-gram + 1 Eg: tri-gram . By +1 it means we are considering next n-gram.
  • Apply the n-gram probability formula with K-Smoothing, to avoid out of vocab and skewed words

Try Kneser Ney smoothening instead of plain Laplace smoothening!

start_tokens = ["how", "many", "pairs", "of"]model.suggestions(start_tokens)>> ['chromosomes', 'how', 'do', 'you', 'change']

So basically we have two counts n-gram and n-gram+1, as an example lets take bi-gram and tri-gram. As the user types in we consider previous bi-gram, here [“pairs”, “of”] and iterate through all words in the vocabulary and calculate the probability of each word combination with our n-gram tuple.

What is the probability of a word from the vocabulary given the previous bi-gram (pairs, of)?

def _estimate_probability(self, word, previous_ngram):
vocab_size = len(self._word_frequency)
#previous_ngram = tuple(previous_ngram)
if type(previous_ngram) != list:
previous_ngram = [previous_ngram]
previous_ngram = " ".join(previous_ngram)
previous_ngram_count = self._ngram_word_frequency.get(previous_ngram, 0)
if previous_ngram_count == 0:
# print("Warning no match found for entered words!")
return 0
denominator = previous_ngram_count + self._k * len(self._vocab)
n_plus1_gram = previous_ngram + " " + word
n_plus1_gram_count = self._ngram_plus1_word_frequency.get(n_plus1_gram, 0)
numerator = n_plus1_gram_count + self._k
probability = numerator / denominator
return probability

Iterate through all words in vocabulary and calculate the probabilities, this is needed as weed figure out which word has highest probability for given n-gram.

def _estimate_probabilities(self, previous_ngram):
probabilities = {}
# previous_n_gram = tuple(previous_n_gram)
if type(previous_ngram) != list:
previous_ngram = [previous_ngram]
previous_ngram = " ".join(previous_ngram).lower()
for word in self._vocab:
probabilities[word] = self._estimate_probability(word, previous_ngram)
return probabilities

Apache Spark

Ok cool! we managed to build our model with some understanding of its maths (at-least for now!). We are able to train on small corpus and build the model quickly.

For our to model to suggest next words all we need is the vocab, counts of n-gram and counts of n-gram+1. Now, lets see how to use Apache Spark to get the counts for n-gram and n-gram +1 on a bigger corpus.

class SparkAutoCorrectModel(object):
def __init__(self,
spark,
dataset,
ngram=2):
self._spark = spark
self._df = spark.createDataFrame(pd.DataFrame({"text": ds.lines}))
self._ngram = ngram
self._tokenizer = Tokenizer(inputCol="text", outputCol="words")
self._bigram = NGram(n=2, inputCol="words", outputCol="ngrams")
self._trigram = NGram(n=ngram+1, inputCol="words", outputCol="ngram_plus_one")
def transform(self):
df_tokenized = self._tokenizer.transform(self._df)
ngram_df = self._bigram.transform(df_tokenized)
ngram_df = self._trigram.transform(ngram_df)
ngram_df.show()
self._ngram_df = ngram_df
def save_as_json(self, file_path):
vocab = self._ngram_df.select(F.explode("words").alias("vocab")).collect()
vocab = {row['vocab'] for row in vocab}
vocab = list(vocab)
ngram = self._ngram_df.select(F.explode("ngrams").alias("ngram")).groupBy("ngram").count().collect()
ngram_word_frequency = {row['ngram']: row['count'] for row in ngram}
ngram_plus_one = self._ngram_df.select(F.explode("ngram_plus_one").alias("ngram_plus_one_")).groupBy("ngram_plus_one_").count().collect()
ngram_plus1_word_frequency = {row['ngram_plus_one_']: row['count'] for row in ngram_plus_one}
data = {}
data['ngram'] = self._ngram
data['vocab'] = vocab
data['ngram_word_frequency'] = ngram_word_frequency
data['ngram_plus1_word_frequency'] = ngram_plus1_word_frequency
with open(file_path, "w", encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=False, indent=4)

So whats happening here?

  • Convert the list of dataset lines into Spark DataFrame, with lines in text column
  • Tokenizer the text column with Tokenizer
  • With NGram create two new column ngramand ngram_plus_one
  • With explode SQL in-build function convert array of tokens into rows and aggregate the tuple counts
  • Collect the vocabulary, n-grm counts and n-gram + 1 counts and create a JSON file, that can be loaded into out model class.

3. Vue Web UI

In recent time https://streamlit.io/ library has become the go to option for Data Science UI as it only takes an hour to build a UI that you wanna for your proof of projects all in Python of-course, nevertheless it comes with cost of loosing customisation.

In my Googling I found Vue to match my object oriented programming experience and a framework that has an easy learning curve to kick start. It has this nice way of modeling data variables, user methods, signal emits & props for data sharing and HTML handling.

Each Vue component consists of:

  • HTML
  • CSS : Which plays major role in decorating the web pages. One good thing is there are well defined CSS framework like https://bulma.io/ that gives all essential CSS classes for free
  • JavaScript with some predefined layout

To understand the VueUI following Vue 3 concepts are necessary:

We need a search box and a list to display the suggestions.

<input class="input is-rounded" 
v-model="search"
@input="onChange"
@keyup.down="onArrowDown"
@keyup.up="onArrowUp"
@keyup.enter="onEnter"
autocomplete="off"
placeholder="Enter your text here for auto suggestion..."/>
<ul id="autocomplete-results" v-show="isOpen" class="autocomplete-results">
<li class="loading" v-if="isLoading">
Loading Results...
</li>
<li v-else v-for="(result, i) in results" :key="i" @click="setResult(result)"
class="autocomplete-result" :class="{ 'is-active': i === arrowCounter }">
{{result}}
</li>
</ul>

Whats happening?

  • Bind search data variable to input box
  • On user input call onChange method, similarly on arrow key press call respective methods
  • List is used to display the suggestions, and show the list only we have suggestion from backend with data variable isOpen
  • If back end is busy doing the computation, show Loading Results based on the value of data variable isLoading
  • When suggestions are available iterate through it and display it with a style for active selection

When user inputs the search query, call backend API:

       // Backend 
axios.post(path, {text: cleandedQuery})
.then((res) => {
this.isLoading = false
this.results = res.data.tokens
console.info("suggestions", this.results)
})

When user navigates the suggestions in the list and presses the enter button, add the current selection the our search query

onEnter() {
console.info("enter", this.arrowCounter)
if (this.arrowCounter != -1) {
if (this.search.split(" ").length > 1) {
this.search = this.search.split(" ").slice(0, -1).join(" ")
}
// based on the selection from suggestions add the suggested word to
this.search = this.search + " " + this.results[this.arrowCounter];
}
this.isOpen = false
this.arrowCounter = -1
},

Thats all to it for our intelligent search box 😉

4. FastAPI Backend

I have picked https://fastapi.tiangolo.com/ instead of traditional frameworks as it makes testing the API very easy and has all the batteries that is needed for quick prototyping.

app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
  • Initialise AutoCorrectModel and load one of the json which we got as result of our training
auto_correct_model = AutoCorrectModel()
auto_correct_model.load_from_json("data/trigram-autocompleter.json")
  • Our end point is at “/suggestions”, which can receive a string, split it on spaces and returns a max of 10 words fro next word
@app.post("/suggestions")
def suggestions(data: Text):
start_with = None
tokens = data.text.split(" ")
if len(tokens) < auto_correct_model._ngram:
tokens = ['<s>'] * (auto_correct_model._ngram - len(tokens)) + tokens
print("Inputs", tokens)
res = auto_correct_model.suggestions(tokens, num_suggestions=10, start_with=start_with)
print("Sugestions", res)
return {"tokens": res}

5. Docker

Multistage build are used to speed up the docker builds and keep the size of the images considerably low.

Check the README here for all build commands.

API : https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/api/Dockerfile

  • In my trials, on Mac the docker IP address is different from the host even if we bridge the network. So I had to come up with two Docker images one for Linux and one for Mac. Each having its own backend API host address.
  • On Mac it defaults to 192.168.99.100 , whereas in Linux it is 0.0.0.0 or localhost
  • This is handled in Vue .env file @ https://github.com/gyan42/autocomplete-ngram-model/tree/main/ui/autocomplete

UI: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops/ui

DockerCompose: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops

  • Since we have two platform to handle we have two Docker compose files.

6. Heroku

Lets host our front end and back end along with our model on https://www.heroku.com/

  • Create an account
  • Create an app
  • Login into Heroku and Heroku container on command line
  • Build an Docker image with both Vue and FastaPI on single docker image @ https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/heroku/Dockerfile
  • Heroku has different ngnix file that takes in Heroku supplied port number for UI and we have configured proxy for redirect backend API calls to FastAPI endpoints
  • Part of Docker command the $PORT number is replaced during the launch time : sed -i -e ‘s/$PORT/’”$PORT”’/g’ /etc/nginx/conf.d/default.conf
server {
listen $PORT;
root /usr/share/nginx/html;
index index.html index.html;
location / {
client_max_body_size 200M;
root /app;
index index.html;
try_files $uri $uri/ /index.html;
}
location /suggestions {
proxy_connect_timeout 6000;
proxy_read_timeout 6000;
proxy_pass http://127.0.0.1:5000;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
}
  • Push the docker image to the Heroku registry
  • Release the app

Hope this post gives some pointers for hosting Vue + Fast API application on Heroku and as well on local machine.

References:

--

--