NLP: How to predict next word in a search query? Full stack N-Gram Model withVue, FastAPI & Heroku

Mageswaran D

8 min readNov 10, 2021

Dataset : google_wellformed_query 25K examples

2. Model: Naive N-Gram probability model : Single machine version and PySpark version

3. Web UI : https://v3.vuejs.org/

4. Backend API: https://fastapi.tiangolo.com/

5. Docker and Docker Compose

6. Deployment on Heroku

Git: https://github.com/gyan42/autocomplete-ngram-model

Colab Notebook for Model training: https://colab.research.google.com/gist/Mageswaran1989/e49e043f09c1de89c6f433967be118a2/autocorrectmodel.ipynb

Live Demo: https://gyan42-autocompleter.herokuapp.com/

I hope I don’t have set any context here, as the above picture depicts what we wanna achieve.

Who have won the _____

To fill the above sentence we need to know the current search trend and whats happening in the locality of the search is happening. It can be anything from local election result to international game results.

Like any other ML model the result is subjective to the dataset we train them on. Luckily we have google_wellformed_query which we are gonna use for our exploration. We are not going to use any fancy ML algorithm here, a simple bag of words probability model. We also see how to use Apache Spark to produce the model artefacts that is needed for our predictions for large datasets.

And finally we are going to see how to use large dataset and train using Spark and use the artefacts with our model.

1. Dataset

To keep data handling simple we are going to use Datasets package from https://huggingface.co/datasets

Here is the online view linker for `google_wellformed_query` which can be used for exploring the dataset

Streamlit

Edit description

huggingface.co

from datasets import load_dataset
dataset = load_dataset('google_wellformed_query')
dataset['train']['content][:2]
>> The European Union includes how many ?
   What are Mia Hamms accomplishment ?

2. N-Gram Model

N-Gram models are Statistical(Probabilistic) Language models that aim to assign probabilities to a given sequence of words. Any N-gram is just a sequence of “n” words. For example, “Saurav” is a unigram and “Hi There” is a bigram.

Sentences -> Tokenisers (splitting sentences into words) -> Grouping tokens into tuples

So what it does basically? it estimates the probability of next word given previous set of words.

What is word or unigram probability ?

What is bi-gram probability?

What is n-gram probability?

def _count_n_grams(self, tokenized_sentences, ngram):
    '''
    Creates n-gram from tokenized sentence and counts the same
    '''
    freq = defaultdict(lambda: 0)
    for sentence in tqdm(tokenized_sentences, desc="NGrams"):
      sentence = [self._start_token] * ngram + sentence + [self._end_token]
      m = len(sentence) if ngram == 1 else len(sentence) - 1
      for i in range(m):
        ngram_token = sentence[i:i+ngram]
        #freq[tuple(ngram_token)] += 1
        # tuples can't be used as key in JSON
        freq[" ".join(ngram_token)] += 1
    return freq

Now comes the next question how to predict next word in my search query sentence? simple, use n-gram of your choice

Calculate the counts for n-gram Eg: bi-gram
Calculate the counts for n-gram + 1 Eg: tri-gram . By +1 it means we are considering next n-gram.

Apply the n-gram probability formula with K-Smoothing, to avoid out of vocab and skewed words

Try Kneser Ney smoothening instead of plain Laplace smoothening!

start_tokens = ["how", "many", "pairs", "of"]model.suggestions(start_tokens)>> ['chromosomes', 'how', 'do', 'you', 'change']

So basically we have two counts n-gram and n-gram+1, as an example lets take bi-gram and tri-gram. As the user types in we consider previous bi-gram, here [“pairs”, “of”] and iterate through all words in the vocabulary and calculate the probability of each word combination with our n-gram tuple.

What is the probability of a word from the vocabulary given the previous bi-gram (pairs, of)?

def _estimate_probability(self, word, previous_ngram):
    vocab_size = len(self._word_frequency)
    #previous_ngram = tuple(previous_ngram)
    if type(previous_ngram) != list:
      previous_ngram = [previous_ngram]
    previous_ngram = " ".join(previous_ngram)
    previous_ngram_count = self._ngram_word_frequency.get(previous_ngram, 0)
    if previous_ngram_count == 0:
      # print("Warning no match found for entered words!")
      return 0
    denominator = previous_ngram_count + self._k * len(self._vocab)
    n_plus1_gram = previous_ngram + " " + word
    n_plus1_gram_count =  self._ngram_plus1_word_frequency.get(n_plus1_gram, 0)
    numerator = n_plus1_gram_count + self._k
    probability = numerator / denominator
    return probability

Iterate through all words in vocabulary and calculate the probabilities, this is needed as weed figure out which word has highest probability for given n-gram.

def _estimate_probabilities(self, previous_ngram):
    probabilities = {}
    # previous_n_gram = tuple(previous_n_gram)
    if type(previous_ngram) != list:
      previous_ngram = [previous_ngram]
    previous_ngram = " ".join(previous_ngram).lower()
    for word in self._vocab:
      probabilities[word] = self._estimate_probability(word, previous_ngram)
    return probabilities

Apache Spark

Ok cool! we managed to build our model with some understanding of its maths (at-least for now!). We are able to train on small corpus and build the model quickly.

For our to model to suggest next words all we need is the vocab, counts of n-gram and counts of n-gram+1. Now, lets see how to use Apache Spark to get the counts for n-gram and n-gram +1 on a bigger corpus.

class SparkAutoCorrectModel(object):
  def __init__(self, 
               spark,
               dataset,
               ngram=2):
    self._spark = spark
    self._df = spark.createDataFrame(pd.DataFrame({"text": ds.lines}))
    self._ngram = ngramself._tokenizer = Tokenizer(inputCol="text", outputCol="words")
    self._bigram = NGram(n=2, inputCol="words", outputCol="ngrams")
    self._trigram = NGram(n=ngram+1, inputCol="words", outputCol="ngram_plus_one")def transform(self):
    df_tokenized = self._tokenizer.transform(self._df)
    ngram_df = self._bigram.transform(df_tokenized)
    ngram_df = self._trigram.transform(ngram_df)
    ngram_df.show()
    self._ngram_df = ngram_dfdef save_as_json(self, file_path):
    vocab = self._ngram_df.select(F.explode("words").alias("vocab")).collect()
    vocab = {row['vocab'] for row in vocab}
    vocab = list(vocab)ngram = self._ngram_df.select(F.explode("ngrams").alias("ngram")).groupBy("ngram").count().collect()
    ngram_word_frequency = {row['ngram']: row['count'] for row in ngram}ngram_plus_one = self._ngram_df.select(F.explode("ngram_plus_one").alias("ngram_plus_one_")).groupBy("ngram_plus_one_").count().collect()
    ngram_plus1_word_frequency = {row['ngram_plus_one_']: row['count'] for row in ngram_plus_one}data = {}
    data['ngram'] = self._ngram
    data['vocab'] = vocab 
    data['ngram_word_frequency'] = ngram_word_frequency
    data['ngram_plus1_word_frequency'] = ngram_plus1_word_frequencywith open(file_path, "w", encoding='utf-8') as file:
      json.dump(data, file, ensure_ascii=False, indent=4)

So whats happening here?

Convert the list of dataset lines into Spark DataFrame, with lines in text column
Tokenizer the text column with Tokenizer
With NGram create two new column ngramand ngram_plus_one
With explode SQL in-build function convert array of tokens into rows and aggregate the tuple counts
Collect the vocabulary, n-grm counts and n-gram + 1 counts and create a JSON file, that can be loaded into out model class.

3. Vue Web UI

In recent time https://streamlit.io/ library has become the go to option for Data Science UI as it only takes an hour to build a UI that you wanna for your proof of projects all in Python of-course, nevertheless it comes with cost of loosing customisation.

In my Googling I found Vue to match my object oriented programming experience and a framework that has an easy learning curve to kick start. It has this nice way of modeling data variables, user methods, signal emits & props for data sharing and HTML handling.

Each Vue component consists of:

HTML
CSS : Which plays major role in decorating the web pages. One good thing is there are well defined CSS framework like https://bulma.io/ that gives all essential CSS classes for free
JavaScript with some predefined layout

To understand the VueUI following Vue 3 concepts are necessary:

We need a search box and a list to display the suggestions.

<input class="input is-rounded" 
          v-model="search" 
          @input="onChange"
          @keyup.down="onArrowDown" 
          @keyup.up="onArrowUp" 
          @keyup.enter="onEnter"
          autocomplete="off"
          placeholder="Enter your text here for auto suggestion..."/><ul id="autocomplete-results" v-show="isOpen" class="autocomplete-results">
        <li class="loading" v-if="isLoading">
          Loading Results...
        </li>
        <li v-else v-for="(result, i) in results" :key="i" @click="setResult(result)"
                  class="autocomplete-result" :class="{ 'is-active': i === arrowCounter }">
        {{result}}
        </li>
      </ul>

Whats happening?

Bind search data variable to input box
On user input call onChange method, similarly on arrow key press call respective methods
List is used to display the suggestions, and show the list only we have suggestion from backend with data variable isOpen
If back end is busy doing the computation, show Loading Results based on the value of data variable isLoading
When suggestions are available iterate through it and display it with a style for active selection

When user inputs the search query, call backend API:

       // Backend 
      axios.post(path, {text: cleandedQuery})
      .then((res) => {
        this.isLoading = false
        this.results = res.data.tokens
        console.info("suggestions", this.results)
      })

When user navigates the suggestions in the list and presses the enter button, add the current selection the our search query

onEnter() {
      console.info("enter", this.arrowCounter)
      if (this.arrowCounter != -1) {
        if (this.search.split(" ").length > 1) {
          this.search = this.search.split(" ").slice(0, -1).join(" ")
        }
        // based on the selection from suggestions add the suggested word to 
        this.search = this.search + " " + this.results[this.arrowCounter];
      }
      this.isOpen = false
      this.arrowCounter = -1
    },

Thats all to it for our intelligent search box 😉

4. FastAPI Backend

I have picked https://fastapi.tiangolo.com/ instead of traditional frameworks as it makes testing the API very easy and has all the batteries that is needed for quick prototyping.

Setup FastAPI main file with cross origin policies

app.add_middleware(
    CORSMiddleware,
    allow_origins=ALLOWED_ORIGINS,
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Initialise AutoCorrectModel and load one of the json which we got as result of our training

auto_correct_model = AutoCorrectModel()
auto_correct_model.load_from_json("data/trigram-autocompleter.json")

Our end point is at “/suggestions”, which can receive a string, split it on spaces and returns a max of 10 words fro next word

@app.post("/suggestions")
def suggestions(data: Text):
    start_with = None
    tokens = data.text.split(" ")
    if len(tokens) < auto_correct_model._ngram:
        tokens = ['<s>'] * (auto_correct_model._ngram - len(tokens)) + tokens
    print("Inputs", tokens)
    res = auto_correct_model.suggestions(tokens, num_suggestions=10,   start_with=start_with)
    print("Sugestions", res)
    return {"tokens": res}

5. Docker

Multistage build are used to speed up the docker builds and keep the size of the images considerably low.

Check the README here for all build commands.

API : https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/api/Dockerfile

In my trials, on Mac the docker IP address is different from the host even if we bridge the network. So I had to come up with two Docker images one for Linux and one for Mac. Each having its own backend API host address.
On Mac it defaults to 192.168.99.100 , whereas in Linux it is 0.0.0.0 or localhost
This is handled in Vue .env file @ https://github.com/gyan42/autocomplete-ngram-model/tree/main/ui/autocomplete

UI: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops/ui

Basic understanding of https://www.nginx.com/ is required as we use it as our webserver.
Beginners guide @ http://nginx.org/en/docs/beginners_guide.html
To run on local machine we use this ngnix.conf @ https://github.com/gyan42/autocomplete-ngram-model/blob/main/ui/autocomplete/nginx.conf
It listens on port 80 and loads the index file from our Vue build directory

DockerCompose: https://github.com/gyan42/autocomplete-ngram-model/tree/main/ops

Since we have two platform to handle we have two Docker compose files.

6. Heroku

Lets host our front end and back end along with our model on https://www.heroku.com/

Create an account
Create an app
Login into Heroku and Heroku container on command line
Build an Docker image with both Vue and FastaPI on single docker image @ https://github.com/gyan42/autocomplete-ngram-model/blob/main/ops/heroku/Dockerfile
Heroku has different ngnix file that takes in Heroku supplied port number for UI and we have configured proxy for redirect backend API calls to FastAPI endpoints
Part of Docker command the $PORT number is replaced during the launch time : sed -i -e ‘s/$PORT/’”$PORT”’/g’ /etc/nginx/conf.d/default.conf

server {
  listen      $PORT;root /usr/share/nginx/html;
  index index.html index.html;location / {
    client_max_body_size 200M;
    root   /app;
    index  index.html;
    try_files $uri $uri/ /index.html;
  }location /suggestions {
      proxy_connect_timeout 6000;
      proxy_read_timeout 6000;
      proxy_pass http://127.0.0.1:5000;
  }error_page   500 502 503 504  /50x.html;
  location = /50x.html {
    root   /usr/share/nginx/html;
  }
}

Push the docker image to the Heroku registry
Release the app

Hope this post gives some pointers for hosting Vue + Fast API application on Heroku and as well on local machine.

References: