Building a Personal AI Assistant: Part 1

Using intent classification & entity extraction to understand natural language

Robert MacWha
Nerd For Tech

--

Cover Photo
Photo by Thomas Kolnowski on Unsplash, edited

Siri was the first piece of technology I was truly ‘wowed’ by. Being able to naturally(ish) interact with a computer was truly impressive. I think this fascination has stuck with me over the past decade. I am consistently impressed by natural language technology, both when it works and when it doesn’t. That’s why I decided to spend the past month building my own virtual assistant dubbed Project Aurras.

Over the course of my next three articles I will document how Aurras’s core components function. This article will focus on data preparation. The next two will focus on intent classification and entity extraction respectively. These systems, put together, can facilitate enough interactions for most personal use cases.

Structure of a Virtual Assistant

This virtual assistant, at a fundamental level, will rely on two interlocking components. Intent classification to understand context and entity extraction to locate meaning.

Intent Classification

Intent classification is the sentence-level process that determines what the user wants to accomplish by saying a given sentence. As an example, if you were to tell your virtual assistant to ‘please turn off the lights,’ intent classification would analyze that prompt and determine that you intended to issue the ‘lights_off’ command.

Entity Extraction

Entity extraction is the token-level process that locates keywords related to your intent. As an example, if you issued the prompt ‘turn on the downstairs lights’ then the entity extraction would detect the keyword downstairs as the location.

One last thing before I begin. I have prepared a Colab notebook with all the code from this article, available here. If you have problems with the code then I recommend looking at the Colab notebook. Failing that, feel free to reach out to me and I’ll do my best to help.

Dataset preparation

Before beginning work on the virtual assistant, a dataset needs to be prepared. An easy way to get this dataset would be to look for public intent classification datasets such as this one from rikhuijzer’s Github. However, by relying on pre-made datasets the virtual assistant will be limited to understanding only commands which exist in said dataset. Because of this I decided to create my own dataset procedurally.

Process for procedural dataset generation

Creating natural langue datasets is difficult because, most of the time, every single sentence needs to be written by hand. This simply won’t do for our use case. At least a thousand data points will be needed and it would take ages to write that many unique sentences. Instead, the structure of the sentences can be exploited to facilitate procedural generation.

To generate the dataset a reliable way of producing sentences needs to be found. Luckily, this is a problem that has already been solved by the creators of mad libs. Mad libs is a children's game where one is presented with a group of partially completed sentences. The person playing the game fills in blank spaces with the right types of words, such as names or actions. This results in a personalized story. By adopting this method a dataset can be generated from a collection of sentence templates and word groups.

Using this method of creating sentence templates and then filling in the missing words will drastically increase our rate of data production. As an example, by using these three sentence templates, I was able to generate over four hundred thousand unique sentences.

  • what’s {number} {math_function} {number}
  • {prepend_request} calculate {number} {math_function} {number}
  • {prepend_request} tell me what {number} {math_function} {number} is

Most of the samples should be discarded to increase variety, but this method can easily generate a hundred unique sentences from just these three templates.

Code for procedural dataset generation

I have already written code that can convert sets of sentence templates into datasets. This code can be found here in the dataset-generation branch. If you don’t want to have to create your own dataset, then download the resources folder and use the pre-existing dataset. This pre-existing dataset contains samples for the following five intents:

  1. Calculate
  2. Light
  3. TodoAdd
  4. TodoGet
  5. WeatherGet

More samples can be added by following the instructions found in the ReadMe. Feel free to add more samples before continuing with this article as they will not impact anything save for the final results.

Intent classification overview

I know I said that we wouldn’t be working with the intent classifier in this article, but bear with me. To begin the pre-processing step one first must understand the relationship between pre-trained models and tokenizers.

In the field of NLP pre-trained models are often large and complicated. This means that they require input data that has been very specific formatted. Furthermore, since most pre-trained models use different data formats, one would have to write code to format their dataset for each NLP model that they used. To solve this problem most NLP models, and indeed many other pre-trained models, come with tokenizers. Tokenizers are objects which take care of all the formatting issues to save time and reduce the chance of errors.

Tokenizing the dataset

Before using the intent classification dataset to train the model it needs to be tokenized. Tokenization is the process where our dataset is converted into a format that an NLP model can understand: number IDs.

For more information on tokenization, I suggest reading this article by HuggingFace. It gives a much better explanation of tokenization than I am able to.

To convert raw sentences into their tokenized forms, a tokenizer is used. In this case, the tokenizer is from HuggingFace’s transformer library. Since the distilbert-base-uncased transformer will be used in the model, its tokenizer must also be used here.

To tokenize the dataset it must first be loaded into python. A tokenizer object must also be created.

from transformers import DistilBertTokenizerFast
from ast import literal_eval
import pandas as pd
import json
#* create the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
#* load in the dataset
df_train = pd.read_csv('https://raw.githubusercontent.com/Robert-MacWha/NLP-Intent-Classification/Intent-Classification/resources/data/train.csv')
# load json data
resp = requests.get('https://raw.githubusercontent.com/Robert-MacWha/NLP-Intent-Classification/Intent-Classification/resources/data/intent_labels.json')
intent_labels = json.loads(resp.text)
intent_count = len(intent_labels)

Once the dataset is loaded, the tokenizer can be used to convert text into numeric input IDs.

inputs = tokenizer(
list(df_train['words']), # specify the string[] to tokenize
max_length=128, # custom padding
padding='max_length', # sets padding to the custom value
return_attention_mask=True,
return_token_type_ids=False,
return_tensors='np' # flag to return numpy array
)
x_train_ids = inputs['input_ids']
x_train_attention = inputs['attention_mask']

The tokenizer will return two arrays — x_train_ids and x_train_attention. The train_ids variable contains an array of tokenized sentences which can be fed straight into the model. The train_attention variable is used to tell the model which tokens are padding and which are not. Since a padding of 128 was used for the tokenizer, all shorter sentences will have null tokens (0) append to them until they reach that length.

# An example of a single sentence after being tokenizedprint(x_train_ids)
# the first 7 numbers are text IDs, while the rest are padding
# [7993, 170, 11303, 1200, 2443, 1110, 3014, 0, 0, 0, 0, 0, ..., 0]
print(x_train_attention)
# ones denote parts of the sentence, zeros denote padded tokens/
# [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0..., 0]

Since the attention mask tells the model which tokens are important, it allows the model to only pay attention to those informational tokens.

Lastly, the y_labels for the tokenized sentences need to be converted into one-hot labels. This is done using TensorFlow's one_hot function and is necessary since the model’s outputs are one-hot classifications.

y_train_intents = tf.one_hot(df_train['intent_label'].values, intent_count)

Finally, let’s print out one of the data points to make sure that everything has gone according to plan.

# sample datapoint
print(f'Prompt: {df_train["words"][0]}')
print(f'Token IDs: {x_train_ids[0][:12]}...')
print(f'Attention mask: {x_train_attention[0][:12]}...')
print(f'One-hot Label: {y_train_intents[0]}')
# Prompt: can you calculate twelve point five plus two Token
# IDs: [101 2064 2017 18422 4376 2391 2274 4606 2048, 102 0 0]...
# Attention mask: [1 1 1 1 1 1 1 1 1 1 0 0]...
# One-hot Label: [1. 0. 0. 0. 0.]

And that’s it! The prompt has been tokenized and padded, the attention mask matches up with the input IDs, and the label is correct. The next steps are to build and train an intent classification transformer, but that’s for part two of this mini-series. I’ll be releasing it in just a few days, so if you’ve finished with this then sit tight and follow me so you’ll get notified when I post part two.

Here’s a sneak peek at what’s coming!

Thanks for reading my article! Feel free to check out my portfolio, message me on LinkedIn if you have anything to say, or follow me on Medium to get notified when I post another article!

Helpful links

--

--

Robert MacWha
Nerd For Tech

ML Developer, Robotics enthusiast, and activator at TKS.