Querying Enterprise Data using Azure OpenAI and COGNITIVE Search
This article briefly discusses different components used while building an application for querying Enterprise data utilizing Azure OpenAI services and Cognitive Search. We look deep into how Cognitive Search is leveraged for indexing the data which in turn makes the process of querying faster and efficient. Later, we also look at the Python code example with a Jupyter Notebook link of the GitHub project attached to understand a working example of how to use Azure OpenAI and Azure SDKs to generate prompts and response.
Microsoft collaboration with OpenAI can be utilized via Azure Open AI STUDIO. OpenAI has pre-trained Large Language Models (LLMs) that are very good at understanding and generating text. Azure OpenAI Service provides REST API access to OpenAI’s LLM models. Users can access the service through REST APIs, Python SDK, or our web-based interface in the Azure OpenAI Studio. Users can use these models as per their specific requirements such as content generation, semantic search and using NLP to Code generation.
One such use-case can be using Azure Open AI services to query your enterprise data. Enterprise data can be anything from text files to PDFs to CSVs. OpenAI LLMs are accessed using Prompt Engineering and LangChain makes this process easy.
However, before we leverage LLMs and LangChain, we need to make the data accessible and readable to LLMs and this is where we leverage Azure COGNITIVE Search service.
Azure Cognitive Search is a cloud search service that provides developers with infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications. It has following capabilities:
· A search engine for full text search.
· Indexing with features of lexical analysis and AI enrichment for content extraction and transformation, for e.g. extracting images from PDF texts.
· Query syntax for various search paradigms such as text search, fuzzy search, geo-search, etc.
· Accessibility via REST APIs and SDKs.
· Easy Integration with data layer using Azure Storage Clients such as Blob Storage, etc.
Index in Cognitive Search
In Azure Cognitive Search, data is imported and indexed by a schema, which means, data layer exists within search service apart from primary data stores.
For COGNITIVE Search, document is a single unit of searchable data in the index and structure of document is determined by index schema. Indexing is an intake process that loads content into COGNITIVE search service and makes it searchable. Internally, inbound text is processed into tokens and stored in inverted indexes for fast scans.
NOTE: A search index should only contain searchable content.
Following is the probable index schema in Cognitive Search:
"@odata.context": "<Cognitive Search Service Enpoint",
"@odata.etag": "<respective TAG ID>",
"name": "<Index NAME>",
The “fields” collection is typically the largest part of an index, where each field is named, assigned a data type, and attributed with allowable behaviors that determine how it is used. Fields are used for document identification (keys), storing searchable text, and fields for supporting filters, facets, and sorts. Field attributes determine how a field is used, such as whether it is used in full text search, faceted navigation, sort operations, and so forth.
Field attributes play a crucial role in defining the behavior and capabilities of data indexing processes. By specifying attributes for each field, you can enable various operations and optimizations to support different types of data access.
1. “Searchable”: When a field is marked as “searchable,” it indicates that the data within that field can be searched using full-text search techniques. This typically involves tokenizing the text into individual terms and creating inverted indices that map each term to the documents or records containing it. During a search operation, the inverted index is scanned to locate the relevant documents.
2. “Filterable”: A “filterable” attribute allows for filtering or querying based on the field’s original, unmodified value. This means that the field can be used as a criterion to narrow down the search results or retrieve specific records. Unlike full-text search, filtering doesn’t involve tokenization or scanning inverted indices. Instead, it relies on other data structures or indexing techniques that can efficiently process the original values.
3. “Sortable”: When a field is marked as “sortable,” it means that the data can be used to sort the search results or records in a particular order. Sorting typically requires an index structure that organizes the data in a way that facilitates efficient sorting operations. For example, a field containing numeric values or timestamps can be sorted in ascending or descending order to provide a meaningful order of the search results.
By defining these attributes during the indexing process, the search or retrieval system can create the necessary data structures and optimize the operations accordingly. These attributes are just a few examples of how fields can be configured to support different types of access and enable efficient searching, filtering, and sorting of data.
NOTE: Azure has its own implementation of physical layer of an index and is managed by Microsoft Azure team. Index schema can be accessed, and content can be queried, and size can be managed by the developers.
Indexer in Cognitive Search
After data is imported and index schema is defined, indexing process is taken care by Indexer. An Indexer is a crawler that extracts searchable content from cloud data sources and populates a search index using field-to-field mappings between source data and a search index. In Azure, multiple data sources are supported. It can crawl data stores on Azure as well as outside of it.
Indexers can be run on demand and its run time can be scheduled as well, anticipating changes in your data source. Indexers have the basic capability for change detection. During pilot run, when the index is empty, an indexer will read in all of the data provided in the table or container. On subsequent runs, the indexer can usually detect and retrieve just the data that has changed.
Flow of operation of an Indexer:
Once the data is indexed, we can leverage Azure SDK for performing search operation on our data via querying.
Let us define our use case for the programming:
We have uploaded PDF files into Azure Blob Storage and we will be leveraging Azure Search Document client (Cognitive Search SDK) in python for querying our data based on user input. Once search results are obtained, we extract relevant information from the result and pass it on to Azure OpenAI to create Prompt and obtain respective response from our models using LangChain.
The demo program with proper documentation can be accessed at the following link:
GitHub - reeshabh90/PDF-reader
Contribute to reeshabh90/PDF-reader development by creating an account on GitHub.