NLP, Tokens and Cost Estimation with Azure OpenAI

Reeshabh Choudhary
10 min readJun 15, 2023

In this article, we discuss upon NLP (Natural Language Processing) and its usage of tokenization, which is the core of NLP and cost estimation of Azure Open AI models based upon tokens.

What is token and why is tokenization required?

Let us consider the following paragraph:

“In the heart of a dense, enchanting forest, there lived a curious and imaginative boy named Ethan. Surrounded by towering trees, he called the woods his home. Every day, Ethan embarked on captivating adventures, weaving tales with the creatures that shared his mystical world. In the mornings, he would wake up to the sweet melody of chirping birds and the soft rustling of leaves as sunlight streamed through the canopy. With his trusty backpack and a heart full of curiosity, Ethan would wander through the moss-covered paths, discovering hidden nooks and crannies. He would converse with playful squirrels, listen to the wise whispers of ancient oaks, and befriend the gentlest of deer. Through his bond with nature, Ethan felt an indescribable connection, as if the forest’s secrets were whispered directly to his soul. The woods embraced him, offering solace and inspiration in their ever-changing beauty. With each passing day, the boy and the forest intertwined their stories, creating a symphony of harmony between humanity and the natural world. Together, they reveled in the magic of the woods, where imagination thrived, and dreams found their wings.”

The above paragraph contains a lot of information. It involves story about a boy named Ethan who lives in an enchanting forest.

So, which parts of the information would you consider salient features of the paragraph? To start with, you can consider extracting name of the characters, setting of his home and forest around it, name of animals or birds he encounters every day, time setting of the story, any particular action by the main character, etc.

Natural Language Processing (NLP)

As a human, we can process this information by out intuition and experience. But, for large amount of data, may be our eyes might not suffice, and you reach out to your computer machine to help you out. The computer performs feature engineering of the text presented to it, which on this day we call Natural Language Processing (NLP).

With the vast amount of textual data available, NLP helps extract valuable information from unstructured text sources. NLP techniques can automatically analyze documents, emails, social media posts, news articles, and other textual data to identify and extract relevant entities, relationships, sentiments, and other valuable insights. This facilitates tasks such as text summarization, sentiment analysis, named entity recognition, and topic modeling. NLP also enables the automatic generation of human-like text, such as chatbot responses, personalized recommendations, news articles, and even creative writing. NLP models can also summarize long documents or articles, extracting the most important information and condensing it into a shorter form. This can save time and provide users with concise summaries of extensive textual content.

But, how does NLP do all this stuff? Tokenization.

Bag-of-words & Bag-of-n-grams

Tokenization is a fundamental step in NLP that breaks down a sequence of text into smaller units or tokens. For starters, we can start with a list of word count statistics called a bag-of-words (BoW). For simple tasks such as classifying a document, word count statistics often suffice. This technique can also be used in information retrieval, where the goal is to retrieve the set of documents that are relevant to an input text query.

Let’s consider an example to illustrate the process:

Suppose we have a small corpus with three documents:

Document 1: “I love cats.”

Document 2: “Dogs are loyal.”

Document 3: “Cats and dogs make great pets.”

To create the BoW representation, we first construct a vocabulary by collecting all the unique words present in the corpus:

Vocabulary: [“I”, “love”, “cats”, “dogs”, “are”, “loyal”, “and”, “make”, “great”, “pets”]

Next, we convert each document into a vector based on the word counts. The length of the vector is equal to the size of the vocabulary.

Document 1: [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Document 2: [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]

Document 3: [0, 0, 1, 1, 0, 0, 1, 1, 1, 1]

In the vector representation, the first entry corresponds to the count of “I,” the second entry corresponds to the count of “love,” and so on. For example, in Document 1, the count of “I” is 1, the count of “love” is 1, and the count of “cats” is 1, while the counts of other words are 0.

By using this approach, each document is transformed into a numerical vector that captures the frequency of words. This vector representation allows for various machine learning algorithms to operate on text data and make predictions based on the presence or absence of specific words in the documents.

Caution: However, it is not the perfect approach as breaking down a sentence into single words can destroy the semantic meaning. For example: ‘Not Bad’ has a meaning of being somewhat good or decent, however, with this approach, it will account for ‘not’ and ‘bad’, which are two negative sentiments.

Hence, computer scientists try to improve upon this approach with other approaches such as bag-of-n-grams, which tries to retain more of the original sequence structure of the text and hence the representation can be more informative.

An n-gram is a sequence of n tokens. A word is essentially a 1-gram, also known as a unigram. After tokenization, the counting mechanism can collate individual tokens into word counts or count overlapping sequences as n-grams.

For example, let us consider the following sentence:

“I love eating ice cream.”.

If we tokenize this sentence, we get the following individual tokens or unigrams:

[“I”, “love”, “eating”, “ice”, “cream”]

Now, let’s examine the different n-gram representations based on different values of n:

Unigrams (1-grams):

The unigram representation simply counts the occurrences of individual tokens.

Unigram counts: {“I”: 1, “love”: 1, “eating”: 1, “ice”: 1, “cream”: 1}

Bigrams (2-grams):

The bigram representation counts sequences of two adjacent tokens.

Bigram counts: {“I love”: 1, “love eating”: 1, “eating ice”: 1, “ice cream”: 1}

Trigrams (3-grams):

The trigram representation counts sequences of three adjacent tokens.

Trigram counts: {“I love eating”: 1, “love eating ice”: 1, “eating ice cream”: 1}

NOTE: Bag-of-n-grams produces features which are larger in size and would be expensive to compute or even store the model. Size of n is directly proportional to cost of computation.

Alternative approaches and enhancements

With the features extracted, our next goal becomes cleaning the feature from the noise, which in terms of text feature engineering would mean to separate out words which don’t help in extracting semantic meaning of a sentence or a paragraph or an article. For an example, articles or verbs such as [‘a’, ‘an’, ‘am’, ‘was’] etc. don’t add value to meaning of a paragraph. Hence, we can take the route of filtering such words, stopwords.

We can also take route filtering the words by utilizing frequency-based-filtering approaches on top of stopwords filtering. It helps identify the rare words, which might result in larger computation and storage cost without adding significant value. We can either filter out such words or create a special bin for aggregating count of such words, which can also serve as additional features.

ML engineers also use a process called Stemming, to reduce words to their base or root form, known as the stem. The goal of stemming is to normalize words so that different forms of the same word are treated as a single entity. For example, stemming can transform variations of a word like “run,” “running,” and “ran” into the common stem “run”.

Again, a noteworthy point here is that Stemming has a computation cost and its usage should be requirement based.

Parsing and Tokenization

Digitally, a computer understands a text document in the form of strings, so basically a sequence of words is represented as string and string in itself is a sequence of characters. In real world. Text documents can contain more than words such as HTML tags, JSON structures, images, etc. Now, the task at hand becomes to extract sequence of words from this ‘given string’.

Parsing refers to the process of analyzing the structure and components of a text document to identify different elements and their relationships. It is necessary when the string contains more than plain text. For example, in HTML, tags such as <p> for paragraphs or <h1> for headings indicate the boundaries of different sections within the document.

The plain text portion of the string goes for Tokenization. It involves turning a string, a sequence of characters, into a sequence of tokens. Tokens are typically words or groups of characters that carry meaning. Tokenization helps in segmenting the text and creating a more granular representation that can be processed further. There are different tokenizer algorithms present which divided the given text into respective segments and can be used independently or in conjunction depending upon the requirement.

Following are some of the common tokenization algorithms, libraries (NLTK, huggingface, etc. ) to perform them are widely available:

1. Word Tokenizers: Word tokenizers split the text into individual words based on language-specific rules.

2. Sentence Tokenizers: Sentence tokenizers segment text into individual sentences. They identify sentence boundaries based on punctuation marks, such as periods, question marks, and exclamation marks.

3. Rule-Based Tokenizers: Rule-based tokenizers utilize predefined rules and patterns to determine token boundaries. These rules can be based on punctuation marks, whitespace, or specific language patterns.

4. Whitespace Tokenizers: Whitespace tokenizers split the text based on whitespace characters, such as spaces or tabs.

OpenAI and Tokenization

OpenAI is the leading company in collaboration with Microsoft to develop Large Language Models, which are designed to understand and generate human-like text based on large-scale pre-training on diverse internet text sources.

OpenAI utilizes tokenization to a very large extent. The models operate at the token level, meaning they process and generate text on a token-by-token basis. Each token in the input text is assigned a numerical representation that the model uses for computation. The tokenization process allows the model to handle long sequences of text efficiently and helps capture the contextual relationships between words.

Byte-Pair Encoding

OpenAI utilizes a subword-based tokenizer called Byte-Pair encoding. It aims to split words into smaller subword units, particularly focusing on rare or out-of-vocabulary words. The intention is to handle the long tail of words that may not appear frequently in the training data. The subword tokenization process involves iteratively merging the most frequent pairs of characters or character sequences to form new subwords.

Let’s consider the word “unhappiness.”

Using the BPE algorithm, the tokenization process might involve the following steps:

Initial vocabulary: [unhappiness]

  • Iteration 1: Split the most frequent pair of characters or character sequences.

Vocabulary: [unha, ppiness]

  • Iteration 2: Split the most frequent pair of characters or character sequences.

Vocabulary: [un, ha, ppiness]

  • Iteration 3: Split the most frequent pair of characters or character sequences.

Vocabulary: [un, h, a, ppiness]

Let us say, our desired vocabulary size has been reached after iteration 3, so the algorithm would stop chunking the words further.

By representing “unhappiness” as a combination of subwords, the model can learn that it is composed of the prefix “un-” indicating negation, the root word “happy,” and the suffix “-ness” denoting a state or quality. This understanding enables the model to grasp the nuanced meaning of “unhappiness” as the opposite of happiness.

COST Estimation before using Azure OpenAI

Azure OpenAI charges models on the basis of per 1000 tokens. Following are the models and respective costs, which might differ region to region:

Standard Models (cost per 1000 tokens):

Text-Ada: ₹0.033029

Text-Babbage: ₹0.041286

Text-Curie: ₹0.165143

Text-Davinci: ₹1.651426

Codex Models (cost per 1000 tokens):

Code-Cushman: ₹1.981711

Code-Davinci: ₹8.257126

ChatGPT Model (cost per 1000 tokens):

ChatGPT (gpt-3.5-turbo): ₹0.165143

Note: Token costs are for both input and output. Suppose we send an input text via prompt to OpenAI, which is 1000 token long and the output received from OpenAI is around 500 tokens long, then we would be charged for 1500 tokens.

Cost Estimation

Now, let us take an example of estimation of cost for an application which queries on enterprise data.

For reference, each token is roughly four characters for typical English text. Let us consider, Average tokens per word = 1.2 tokens.

Average words per sentence = (10 + 15) / 2 = 12.5 words

For 1 sentence query: Average words per query = 12.5 words * 1 sentence = 12.5 words

For 2 sentence queries: Average words per query = 12.5 words * 2 sentences = 25 words

· For example, if the average words per sentence is 12.5 words, the average tokens per query would be 12.5 words * 1.2 tokens = 15 tokens.

· For a 2-sentence query, the average tokens per query would be 25 words * 1.2 tokens = 30 tokens.

Calculation Assumption

Assuming 500 tokens per query, 25 users per day with 50 queries per user per day:

Tokens per user per day = 50 queries per user * 500 tokens per query = 25,000 tokens

Total tokens per day for all users = Tokens per user per day * Number of users per day = 25,000 tokens * 25 users = 625,000 tokens

Considering: ChatGPT (gpt-3.5-turbo) model with pricing of ₹0.165143 per 1000 tokens.

Cost per day = Total tokens per day / 1000 * Cost per 1000 tokens

Cost per day = 625,000 / 1000 * ₹0.165143

Cost per day ≈ ₹103.21

Monthly cost = Daily cost * Number of days in a month

Monthly cost = ₹103.21 * 30

Monthly cost ≈ ₹3,096.30 (38$)

Important Notes:

The calculation above is just an estimation if you are using the LLM models to get your queries answered. In case of fine tuning the models, charges would be based on factors such as Training Hours, Hosting Hours and Inference per 1000 tokens. You can perform cost analysis in Azure OpenAI Studio as per use case.

References:

1. Feature Engineering for Machine Learning Principles and Techniques for Data Scientists- Alice Zheng, Amanda Casari

2. Byte-Pair Encoding: Subword-based tokenization | Towards Data Science

3. https://www.linkedin.com/pulse/querying-enterprise-data-using-azure-openai-cognitive-choudhary

4. Plan to manage costs for Azure OpenAI Service — Azure Cognitive Services | Microsoft Learn

#NLP #Tokenization #AzureOpenAI #ChatGPT #CostEstimation

--

--

Reeshabh Choudhary

Software Architect and Developer | Author : Objects, Data & AI.