⏺Automate Mundane Tasks in Enterprises using Embedding and Vector DB.

5 min readDec 21, 2023

👷‍♂️ Software Architecture Series — Part 12.

💡To achieve better efficiency and focus on larger pictures, organizations must aspire to automate repetitive manual tasks, so that manpower has energy left to focus on the goals. A thumb rule to be followed is, if there is anything manual and repetitive, it should be automated.

🎛In an organization, there can be multiple mundane tasks being performed by the manpower. In an e-commerce company like Amazon, recommendation algorithms are used to suggest products based on user browsing history, purchases, and behavior. A retail organization can manage inventory levels using automated system tracking and can trigger orders or alerts when stock reaches predefined thresholds. While providing customer support, an incoming ticket can be classified into various groups and routed to appropriate departments or agents based on their content or category. These days, almost all customer facing websites leverage AI-powered chatbots to handle routine queries, provide instant responses and free up manpower to solve more complex issues. Social Media platforms use content moderation algorithms to flag or remove inappropriate or spam content on social media platforms, ensuring compliance with community guidelines. They also use recommendation based on user preferences, for example, Netflix will show you recommendations based on your previous watching history. HRs can leverage filtering algorithms to filter out resumes and rank resumes against a job description. Sending timely reminders to customers and booking appointments can also be automated. In healthcare, AI-powered systems can assist radiologists by automatically analyzing medical images (X-rays, MRIs) to detect anomalies or assist in diagnosis.

⏳Most of the tasks discussed above require some level of classification, filtering, clustering, etc. which can be implemented by leveraging Embedding models and Vector DB. Vector DBs are recently in trend which store vector representations of an entity. Let us understand what is embedding, what are embedding models and how can we leverage Vector DB.

💻Computers understand numbers, while in real life we deal with texts, images, sound, etc. To process and understand information from data, we need to understand different aspects of the data. For example, in a simple scenario, if we have to group students based on their performance, we will look at their marks in different subjects and based upon that we can do clustering as per our needs. Here, for one student we considered their marks in 4–5 different subjects and clustered them accordingly. However, real life data is quite complex. Data in form of texts, images, sound, etc. have high degree of dimensionality in them and it is not even possible to look and analyze these dimensions by simply plotting them. This is where Embeddings come to play an important role.

📌Embedding

Embedding are numerical representations of a data, which intends to capture their semantic meaning. These numerical representations are called Vectors which are stored in Vector DB. Embeddings position words or phrases close together in the vector space if they share similar meanings or contexts. Words with related meanings or usage tend to have similar vector representations. Embeddings can also be created for an image or sound. Once we have embedding of the data we want to analyze, we can do lot of creative things. For instance, in natural language processing (NLP), words with similar meanings will have similar embeddings. This provides a way to quantify the ‘similarity’ between different words or entities, which is incredibly valuable when building complex models. When we embed images, visually similar images will be close to each other in a vector space.

📌Embedding Models

There can be different embedding models capturing different semantic relationships between entities and these small models can even be grouped together under a Large Language Model. Pre-trained embedding models like Word2Vec, GloVe, FastText, and transformer-based models like BERT or GPT capture semantic relationships between words, phrases, or sentences in a high-dimensional space. BERT and GPT can even incorporate contextual information, understanding how the meaning of a word changes based on its context in a sentence or document. These models are trained on large dataset, learning from vast amounts of text data. This enables them to capture general language patterns and nuances, allowing models to transfer this knowledge to new, smaller datasets or specific tasks with limited labeled data.

LLM Embedding Model can capture many such relationships

Now, let us take an example of recommendation system which can be built using Embedding models and Vector DB. Say, a company wants to rank resumes applied against a Job Description. It would be a real hard work to perform this task manually, as the person(s) involved will need to go through each and every applied resume and then sort them out. However, there is a better way to automate this process which can help us look at the most meaningful resumes close to Job Description and save some valuable time. We can simply embed the list of resumes and store the embeddings in a vector DB like Chroma, Qdrant, etc. We have our Job Description, which itself can be embedded and then compared against the already embedded resumes in the specified vector space. Similarity between vectors (Resume-JD) can be measured using various similarity metrics like Cosine, Euclidean, Dot product, etc. The compared entities (here, resumes) can then be ranked based on their similarity score.

🛒Summary:

Embeddings capture semantic meaning by positioning words, sentences, images, sound, etc. in a continuous vector space where similar semantics correspond to closer vector representations. Various dimensionality reduction techniques like PCA, UMAP, etc. are used to represent high dimensional data to low dimensions, which makes a case for better analysis. Embeddings are stored in Vector DB and Vector DB can be leveraged for fast retrieval of vectors that are similar or closest to a query vector, facilitating efficient similarity searches in massive datasets. Modern Vector databases efficiently handle large volumes of vector data, scaling to accommodate millions or billions of vectors, yet maintaining search performance even as the dataset size increases, which is crucial for real-time applications and systems dealing with massive data. This aids in building recommendation systems by retrieving similar items or content based on user preferences or item embeddings. It is also useful in semantic search tasks by efficiently retrieving similar documents, sentences, or word embeddings based on their semantic similarity. Clustering and Classification can be easily performed using Vector DB.

#softwarearchitect #LLM #architecture #softwaredevelopment #AI #ML #recommendation #automation

⏺Automate Mundane Tasks in Enterprises using Embedding and Vector DB.

Written by Reeshabh Choudhary

No responses yet