Token Limits of Azure OpenAI. Cost Effective querying using Cheaper GPT models

Reeshabh Choudhary
3 min readJun 24


Azure Cognitive Search & OpenAI:

NLP & Tokens:


In the previous two articles, we discussed what is Azure Cognitive Search and how we can query enterprise data using Azure Open AI models. We also discussed about the feature of tokenization with respect to Large Language models (LLMs) and how to do effective cost estimation for using Azure OpenAI models.

In this article, we shall be discussing limitations of Azure OpenAI models while querying your enterprise data and how you can overcome these limitations and also implement some effective strategies to get output from cheaper GPT models, in turn saving lot of infrastructure cost for your organization.

Azure OpenAI Pricing and Token Limits

Pricing table with Token Limits

Problem Statement

Suppose an organization has around 50 documents, whose cumulative size is 12MB. Organization has uploaded the documents on Azure Blob Storage and also indexed the documents using Azure Cognitive Search.

Process for the above steps has already been described in detail in the previous article:

Organization has a user interface (UI) from where it passes a query to the server which fetches results from the search operation performed by Azure Cognitive Search and pass the results as a prompt to generative OpenAI client to generate a response for the user.


Organization has a cost constraint and it is not using GPT4 models for generating response to the queried text. It is using lower end models, say ‘text-davinci-003’ model which has token constraint of 4097 tokens which include prompt and response combined.

However, while fetching search results from Azure Cognitive search, text size is large and as a result token limit is breached while feeding to GPT model.


There are two options for the developers to get a solution to it:

1. To convince the organization for use case of using higher end models and paying 3 times the cost, for a seamless experience.

2. To perform some data science operation and accept computational cost in response time and reduce prompt size being fed to the GPT models.

Here, we are focusing on the second approach.

Once, the results are retrieved from Cognitive Search, we can do the following:

· Write our custom text summarization function to extract a summary from the result text and then feed it to GPT model. Pros: Less computation time, Cons: Efficiency of summary generated might be less.

· Using pre-trained transformer models like “google/peagus-xsum” or “facebook/BART” to perform text summarization and feed the result to the prompt of GPT model. Pros: High computation time, Cons: Efficiency of summary generated will be good.

In the following Jupyter notebook, code examples of both the approaches have been implemented:


While the GPT models have wide variety of use case, the token limits to models available with respect to organizational cost for infrastructure remains a challenge which needs to be balanced. Hence, the expertise of NLP algorithms can be leveraged for text processing and cleaning and added computational cost is accepted in such scenarios. In the coming days, we might see efficient LLMs with scalable token size at lower cost, but till then, we are stuck with coding work 😉.




Reeshabh Choudhary

Software Architect and Developer | Author : Objects, Data & AI.