NLP: How to predict next word in a search query? Full stack N-Gram Model withVue, FastAPI & Heroku

2. Model: Naive N-Gram probability model : Single machine version and PySpark version

3. Web UI : https://v3.vuejs.org/

4. Backend API: https://fastapi.tiangolo.com/

5. Docker and Docker Compose

6. Deployment on Heroku

Git: https://github.com/gyan42/autocomplete-ngram-model

Colab Notebook for Model training: https://colab.research.google.com/gist/Mageswaran1989/e49e043f09c1de89c6f433967be118a2/autocorrectmodel.ipynb

Live Demo: https://gyan42-autocompleter.herokuapp.com/

I hope I don’t have set any context here, as the above picture depicts what we wanna achieve.

Who have won the _____

To fill the above sentence we need to know the current search trend and whats happening in the locality of the search is happening. It can be anything from local election result to international game results.

Like any other ML model the result is subjective to the dataset we train them on. Luckily we have google_wellformed_query which we are gonna use for our exploration. We are not going to use any fancy ML algorithm here, a simple bag of words probability model. We also see how to use Apache Spark to produce the model artefacts that is needed for our predictions for large datasets.

And finally we are going to see how to use large dataset and train using Spark and use the artefacts with our model.

1. Dataset

To keep data handling simple we are going to use Datasets package from https://huggingface.co/datasets

Here is the online view linker for `google_wellformed_query` which can be used for exploring the dataset

from datasets import load_dataset
dataset = load_dataset('google_wellformed_query')
dataset['train']['content][:2]
>> The European Union includes how many ?
What are Mia Hamms accomplishment ?

2. N-Gram Model

N-Gram models are Statistical(Probabilistic) Language models that aim to assign probabilities to a given sequence of words. Any N-gram is just a sequence of “n” words. For example, “Saurav” is a unigram and “Hi There” is a bigram.

Sentences -> Tokenisers (splitting sentences into words) -> Grouping tokens into tuples

So what it does basically? it estimates the probability of next word given previous set of words.

What is word or unigram probability ?

What is bi-gram probability?

What is n-gram probability?

def _count_n_grams(self, tokenized_sentences, ngram):
'''
Creates n-gram from tokenized sentence and counts the same
'''
freq = defaultdict(lambda: 0)
for sentence in tqdm(tokenized_sentences, desc="NGrams"):
sentence = [self._start_token] * ngram + sentence + [self._end_token]
m = len(sentence) if ngram == 1 else len(sentence) - 1
for i in range(m):
ngram_token = sentence[i:i+ngram]
#freq[tuple(ngram_token)] += 1
# tuples can't be used as key in JSON
freq[" ".join(ngram_token)] += 1
return freq

Now comes the next question how to predict next word in my search query sentence? simple, use n-gram of your choice

Try Kneser Ney smoothening instead of plain Laplace smoothening!

start_tokens = ["how", "many", "pairs", "of"]model.suggestions(start_tokens)>> ['chromosomes', 'how', 'do', 'you', 'change']

So basically we have two counts n-gram and n-gram+1, as an example lets take bi-gram and tri-gram. As the user types in we consider previous bi-gram, here [“pairs”, “of”] and iterate through all words in the vocabulary and calculate the probability of each word combination with our n-gram tuple.

What is the probability of a word from the vocabulary given the previous bi-gram (pairs, of)?

def _estimate_probability(self, word, previous_ngram):
vocab_size = len(self._word_frequency)
#previous_ngram = tuple(previous_ngram)
if type(previous_ngram) != list:
previous_ngram = [previous_ngram]
previous_ngram = " ".join(previous_ngram)
previous_ngram_count = self._ngram_word_frequency.get(previous_ngram, 0)
if previous_ngram_count == 0:
# print("Warning no match found for entered words!")
return 0
denominator = previous_ngram_count + self._k * len(self._vocab)
n_plus1_gram = previous_ngram + " " + word
n_plus1_gram_count = self._ngram_plus1_word_frequency.get(n_plus1_gram, 0)
numerator = n_plus1_gram_count + self._k
probability = numerator / denominator
return probability

Iterate through all words in vocabulary and calculate the probabilities, this is needed as weed figure out which word has highest probability for given n-gram.

def _estimate_probabilities(self, previous_ngram):
probabilities = {}
# previous_n_gram = tuple(previous_n_gram)
if type(previous_ngram) != list:
previous_ngram = [previous_ngram]
previous_ngram = " ".join(previous_ngram).lower()
for word in self._vocab:
probabilities[word] = self._estimate_probability(word, previous_ngram)
return probabilities

Apache Spark

Ok cool! we managed to build our model with some understanding of its maths (at-least for now!). We are able to train on small corpus and build the model quickly.

For our to model to suggest next words all we need is the vocab, counts of n-gram and counts of n-gram+1. Now, lets see how to use Apache Spark to get the counts for n-gram and n-gram +1 on a bigger corpus.

class SparkAutoCorrectModel(object):
def __init__(self,
spark,
dataset,
ngram=2):
self._spark = spark
self._df = spark.createDataFrame(pd.DataFrame({"text": ds.lines}))
self._ngram = ngram
self._tokenizer = Tokenizer(inputCol="text", outputCol="words")
self._bigram = NGram(n=2, inputCol="words", outputCol="ngrams")
self._trigram = NGram(n=ngram+1, inputCol="words", outputCol="ngram_plus_one")
def transform(self):
df_tokenized = self._tokenizer.transform(self._df)
ngram_df = self._bigram.transform(df_tokenized)
ngram_df = self._trigram.transform(ngram_df)
ngram_df.show()
self._ngram_df = ngram_df
def save_as_json(self, file_path):
vocab = self._ngram_df.select(F.explode("words").alias("vocab")).collect()
vocab = {row['vocab'] for row in vocab}
vocab = list(vocab)
ngram = self._ngram_df.select(F.explode("ngrams").alias("ngram")).groupBy("ngram").count().collect()
ngram_word_frequency = {row['ngram']: row['count'] for row in ngram}
ngram_plus_one = self._ngram_df.select(F.explode("ngram_plus_one").alias("ngram_plus_one_")).groupBy("ngram_plus_one_").count().collect()
ngram_plus1_word_frequency = {row['ngram_plus_one_']: row['count'] for row in ngram_plus_one}
data = {}
data['ngram'] = self._ngram
data['vocab'] = vocab
data['ngram_word_frequency'] = ngram_word_frequency
data['ngram_plus1_word_frequency'] = ngram_plus1_word_frequency
with open(file_path, "w", encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=False, indent=4)

So whats happening here?

3. Vue Web UI

In recent time https://streamlit.io/ library has become the go to option for Data Science UI as it only takes an hour to build a UI that you wanna for your proof of projects all in Python of-course, nevertheless it comes with cost of loosing customisation.

In my Googling I found Vue to match my object oriented programming experience and a framework that has an easy learning curve to kick start. It has this nice way of modeling data variables, user methods, signal emits & props for data sharing and HTML handling.

Each Vue component consists of:

To understand the VueUI following Vue 3 concepts are necessary:

We need a search box and a list to display the suggestions.

<input class="input is-rounded" 
v-model="search"
@input="onChange"
@keyup.down="onArrowDown"
@keyup.up="onArrowUp"
@keyup.enter="onEnter"
autocomplete="off"
placeholder="Enter your text here for auto suggestion..."/>
<ul id="autocomplete-results" v-show="isOpen" class="autocomplete-results">
<li class="loading" v-if="isLoading">
Loading Results...
</li>
<li v-else v-for="(result, i) in results" :key="i" @click="setResult(result)"
class="autocomplete-result" :class="{ 'is-active': i === arrowCounter }">
{{result}}
</li>
</ul>

Whats happening?

When user inputs the search query, call backend API:

       // Backend 
axios.post(path, {text: cleandedQuery})
.then((res) => {
this.isLoading = false
this.results = res.data.tokens
console.info("suggestions", this.results)
})

When user navigates the suggestions in the list and presses the enter button, add the current selection the our search query

onEnter() {
console.info("enter", this.arrowCounter)
if (this.arrowCounter != -1) {
if (this.search.split(" ").length > 1) {
this.search = this.search.split(" ").slice(0, -1).join(" ")
}
// based on the selection from suggestions add the suggested word to
this.search = this.search + " " + this.results[this.arrowCounter];
}
this.isOpen = false
this.arrowCounter = -1
},

Thats all to it for our intelligent search box 😉

4. FastAPI Backend

I have picked https://fastapi.tiangolo.com/ instead of traditional frameworks as it makes testing the API very easy and has all the batteries that is needed for quick prototyping.

app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
auto_correct_model = AutoCorrectModel()
auto_correct_model.load_from_json("data/trigram-autocompleter.json")
@app.post("/suggestions")
def suggestions(data: Text):
start_with = None
tokens = data.text.split(" ")
if len(tokens) < auto_correct_model._ngram:
tokens = ['<s>'] * (auto_correct_model._ngram - len(tokens)) + tokens
print("Inputs", tokens)
res = auto_correct_model.suggestions(tokens, num_suggestions=10, start_with=start_with)
print("Sugestions", res)
return {"tokens": res}

5. Docker

Multistage build are used to speed up the docker builds and keep the size of the images considerably low.

Check the README here for all build commands.

API : https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/api/Dockerfile

UI: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops/ui

DockerCompose: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops

6. Heroku

Lets host our front end and back end along with our model on https://www.heroku.com/

server {
listen $PORT;
root /usr/share/nginx/html;
index index.html index.html;
location / {
client_max_body_size 200M;
root /app;
index index.html;
try_files $uri $uri/ /index.html;
}
location /suggestions {
proxy_connect_timeout 6000;
proxy_read_timeout 6000;
proxy_pass http://127.0.0.1:5000;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
}

Hope this post gives some pointers for hosting Vue + Fast API application on Heroku and as well on local machine.

References:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store