Cost Effective querying using Cheaper GPT models: LangChain Retrievers — Part 2

Reeshabh Choudhary
3 min readJul 14


Azure Cognitive Search & OpenAI:

NLP & Tokens:

Cost Effective querying using Cheaper GPT models — Part 1:


In the previous three articles, we discussed what is Azure Cognitive Search and how we can query enterprise data using Azure Open AI models. We also discussed about the feature of tokenization with respect to Large Language models (LLMs) and how to do effective cost estimation for using Azure OpenAI models. In the third article, we broke down the cost of different GPT models and their token limits. Based upon it, we looked at summarization algorithms and use of local LLMs to summarize our search results, which are then to be passed on to GPT prompt for final response.

In this article, we shall be discussing about how to leverage LangChain Retrievers and Chains to query over Large documents.

Azure Cognitive Search Retriever: A Vector DB alternative


A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) it.

We can leverage a default AzureCognitiveSearchRetriever from LangChain’s retriever library, which allows us to fetch search over our query input with minimal code. Also, if we use this retriever library, we can do away with code of chunking and indexing the document locally using Vector DBs like Chroma.

Obvious downside is that Azure Cognitive Search service comes with a cost but it is faster and efficient in comparison with ChromaDB chunking and indexing process overall.

Response Processing using Recursive calls to LLMs via Chains

Once, the results are retrieved from Cognitive Search, we can do the following:

· MAP-REDUCE: The map reduce documents chain first applies an LLM chain to each document individually (the Map step), treating the chain output as a new document. It then passes all the new documents to a separate combine documents chain to get a single output (the Reduce step).

· REFINE: The refine documents chain constructs a response by looping over the input documents and iteratively updating its answer. For each document, it passes all non-document inputs, the current document, and the latest intermediate answer to an LLM chain to get a new answer.

Sample Notebook

Please find the link for sample notebook, where you can plug and play your LLM with proper API keys, managed via Azure Key Vault and see the response over your personal/enterprise documents.


While the GPT models have wide variety of use case, the token limits to models available with respect to organizational cost for infrastructure remains a challenge which needs to be balanced. LLM models are efficient but due to token limits, we have to make multiple calls to LLM models and which results in slower response time, hence, it is up to the developers to mix and match the approach of using local summarization with NLP (discussed in previous part) and LLM Retriever and Compression.





Reeshabh Choudhary

Software Architect and Developer | Author : Objects, Data & AI.