Dealing with Complex Relationships? Try Graph Databases!
👷♂️ Software Architecture Series — Part 16.
Introduction
Edward Codd prophesized back in 1972 that in the near future great variety of languages would be proposed to interrogate and update databases. While the prophecy does stand true to test of times, the emergence of great variety of databases (especially NoSQL) itself has a significance in the realm of Computer Science. Internet led to explosion of data creation back in 1990s and the trend is still going strong and likely to get stronger for the considerable future. According to the latest estimates, 328.77 million terabytes of data are created each day. This growth has been exponential and the fact that around 90% of the world’s data has been generated in last couple of years only, adds only to the testament of this growth.
We live in the age of Big Data where content creation is not the only priority but powerful search engines searching relevant content across web is the basic necessity of using internet. Google, Bing, etc. search engines provide the capability of semantic search, which aims to understand the intent and contextual meaning behind a search query, rather than just matching keywords or phrases. These Semantic search engines can recognize entities (such as people, places, organizations) and their relationships, allowing for more precise retrieval of information.
Facebook has around 3 billion active users with around 2 billion active users every day. These active users post content of wide variety every day. Most of the posts have media(photos, videos, etc.) and location tracking. Moreover, the social media platforms also allow for integration of data within platforms, for e.g. a TikTok user can post content on Facebook, Twitter, etc. With different entities connected to each other and amount of data generated by these platforms increasing exponential every passing day, how the enterprises are supposed to make sense of the data?
Data is distributed widely across the internet and enterprises have the use case of making the relevant data available to various stake holders within organization. However, within an enterprise, the data may not live on a single cloud provider’s platform but distributed across cloud providers. It only advocates for the need of an intelligent data fabric where different sources of data are interconnected, and relevant data is accessed by rightful entities. This data fabric makes use of metadata, which is data about data. Idea is to discover data, understand underlying relationships, their usage tracking and assess the value generated and risks involved with data usage.
Relational data models fail to capture Relationships, but so do most Non-relational models!
Back in the days, it was mostly structured data with defined relationships in the form of relational databases which was being generated by enterprises by their business operations. However, as the complexity of business operations increased over time, the structure of data also got complex, and to a point that it was not feasible to capture the relationships and semantics of entities through relational databases. With advent of social media platforms, complexity got only increased. And to the top of it, data being generated increased tremendously.
Representing complex, dynamic, or highly interconnected relationships between entities is challenging within the relational data model. It requires multiple joins across tables, leading to complex queries and ultimately impacting performance. Moreover, the rigid schema of relational data models is ill suited for capturing the randomness of real-world entities and their interconnections.
Computer Scientists looked for alternate ways to capture relationships of entities, which were captured under the umbrella of NoSQL databases. These new database management systems are broadly categorized under five categories: Column, Document, Key-value, Graph and Timeseries; each for their specific use cases.
Even in most NoSQL databases like key-value stores, document-oriented databases, or column-oriented databases, emphasis is often on storing individual, self-contained entities rather than capturing direct relationships between these entities. Although this design offers scalability and flexibility, it is inefficient at managing connected or graph-like data structures. If we try to explicitly capture relationships in these NoSQL databases through embedded references, it will ultimately lead to similar challenges faced by relational models while capturing complex and highly interconnected data.
Graph Databases
We are interested here in Graph Databases, which aimed to capture natural representation of relationships among entities. With help of graph databases, organizations are able to create knowledge graph of data with semantic context. This model goes beyond matching keywords to queries but provide an understanding of how these real-world entities are interconnected. Organizations go on to create metadata knowledge graph, which serves as a powerful tool for architects and data professionals to comprehend the intricate relationships and flow of data within an organization. Social Network Analysis is one of the organic ways to bring values to businesses and society. Creating a multi-dimensional metadata graph is a technically complex activity but at the core of it lies the power of Graph databases.
Graph databases represent data using nodes to signify entities (like people, products, places) and edges to denote relationships (many-to-many relationships) between these entities. Both nodes and edges can have associated properties or attributes that provide additional information. Popular DBs like Neo4j, GraphDB, FlockDB, InfiniteGraph, and others are designed to support graph storage and querying. These databases often offer a predefined library of graph algorithms, allowing users to perform operations and manipulations on the graph data. These algorithms can include pathfinding, centrality measures, clustering, and more, enabling powerful data manipulation and analysis within the graph.
Graphs at the core!
The real-world entities are interconnected, sometimes their relationships are driven by uniform rules and sometimes with irregularities. To understand such wide diversity of datasets, graphs come as a useful tool. Whether in scientific projects or business analysis, graphs are extensively used for data analysis. When it comes to represent entities and their relationships in social networking platforms, graphs can be leveraged over relational database capabilities.
A graph is an abstract mathematical representation of two or more entities somehow connected to each other. Nodes represent the objects or entities in picture and edges represent the relationships between them. The structure of how nodes and relationships are connected to each other makes a graph.
The figure above shows a sample graph model for representing entities of Twitter platform, where nodes are represented by circle and edges have been used to show relationship between nodes. A point to note here is that different nodes have different labels. This is a popular form of graph model known as labeled property graph. In this kind of model, both nodes and relationships can have their specific properties, typically stored as key-value pairs. These properties provide additional information about the nodes, such as attributes, characteristics, or descriptive data related to the entity the node represents. Similar to nodes, relationships properties contain additional information about the relationship itself, such as timestamps, weights, or any relevant attributes. This structured representation allows for rich and flexible modeling of real-world scenarios where entities have attributes, relationships, and varying degrees of complexity in their interactions. The labeled property graph model is widely used in graph databases like Neo4j, OrientDB, and others and are simple, intuitive, and easy to understand. An online database management system which performs CRUD operations on graph data models is called graph database management system (graph databases).
Graph Database Management System
Graph databases are often designed and optimized for transactional (OLTP — Online Transaction Processing) systems, they excel in handling numerous small, frequent, and concurrent transactions typical of OLTP systems. They can efficiently manage CRUD operation (i.e. reading, updating, and deleting nodes and relationships) in scenarios with complex relationships between data points. Maintaining data consistency and integrity is crucial in OLTP systems such as e-commerce platforms, banking systems, reservation systems, etc. Graph databases ensure that transactions adhere to ACID (Atomicity, Consistency, Isolation, Durability) properties thus ensuring reliability and preserving the integrity of the data during transactions. Apart from that, graph databases are engineered to provide high operational availability, allowing systems to remain accessible and operational even during updates, maintenance, or failures. This is critical for OLTP systems that require continuous operation. However, we must note that usage of graph databases is not just limited to just OLTP scenarios. They have proven to be versatile and valuable in various other domains such as analytics, recommendation systems, fraud detection, and network analysis, etc.
Modern graph databases like Neo4j, Amazon Neptune, and JanusGraph use native graph storage, which means they are specifically engineered to store and manage graph structures efficiently. However, some older graph databases map graph data to general-purpose data stores, such as relational databases (like MySQL or PostgreSQL), object-oriented databases, or other storage systems, but are not very optimized for efficient graph operations as databases with dedicated native graph storage engines.
From user’s perspective, any database system, irrespective of underlying implementation, which allows for CRUD operation in graph data models are under umbrella of Graph Database Management System. However, underlying implementation becomes a critical factor in judging the performance of these databases before using them to provide real world solutions. The nodes in modern graph databases directly reference and connect to their neighbouring nodes (index-free adjacency) without needing an index lookup or pointer chasing. This design can significantly enhance traversal performance in graph databases, especially for highly connected data and is known as native graph processing. However, every design comes with trade-off, we must not forget!
Queries that don’t involve traversing relationships, such as certain aggregations, pattern matching, or complex filtering based on node properties, might not benefit directly from index-free adjacency design. These types of queries might require additional computational effort or memory because the direct connections between nodes don’t inherently optimize these operations.
Capturing Relationships
As evident in the social media example discussed above, in the real world the relationship (many-to-many relationships) between entities are diverse and dynamic. For example, in a social media domain; connections might represent friendships, follows, likes, comments, or other types of interactions. Each type of relationship can have its own properties and behaviors. These relationships/connections are not even static, as over time a user may for new connection and break existing ones. Managing such dynamic relationships requires a flexible data model that can accommodate changes seamlessly. There are more complex dimensions of capturing relationships in social media domain. Two users may not be connected directly but may be part of a community group based on shared interests, geographical locations, common activities, and more. Moreover, relationship between entities might carry diverse semantics or meanings. For example, two users from same geographical area interacting and posting together may have more closer relationship in real life than the two users sitting apart in far locations and less interaction between them. Social media platforms aim to capture such semantics and prepare timelines of the users based on such insights. The schema-flexible nature of graph databases allows the representation of dynamic relationships without the need for rigid schema definitions, making them ideal for scenarios where the structure and nature of relationships among entities are diverse and constantly changing.
Databases like Neo4j capture roles of entities via Labels. They even allow nodes to have multiple labels, which means a node can belong to several different categories or have multiple roles simultaneously. For example, a node might have both a “User” label and an “Admin” label, indicating that it represents a user who also has administrative privileges. Having labels help developers to run efficient queries to find all nodes with a specific label or combination of labels, for example a query to retrieve all nodes labelled as “User” allows to retrieve all nodes representing users, regardless of other roles they might have.
Further, nodes can be declaratively indexed based on labels to quickly locate nodes that have specific labels. When a query involves filtering nodes based on labels, the index allows the database to directly access the subset of nodes belonging to those labels without needing to traverse the entire graph.
So far in our discussion, we can notice one significant aspect that lead to adoption of graph databases as solution in some real-world applications. That aspect is “ reducing semantic dissonance between our conceptualization of the world and the data model”. But we must know how to model in graphs.
Data Modelling in Graph Database
To do data modelling and perform CRUD operations on data, we need a query language. For our discussions, we shall be using Cypher, an expressive and declarative graph database query language. Cypher is used in the popular graph database Neo4j. A major benefit with a declarative language is we can declare the pattern that we would like to see retrieved, and then let the database worry about how to go about retrieving that data. It separates the concern of stating the problem from solving it, which also promotes better readability of the queries. Going to the example we had discussed earlier:
If we have to create the above relationship using Cypher, we will write something like this:
CREATE (andy:Person {name: 'Andy'})-[:FOLLOWS]->(jack:Person {name: 'Jack'}),
(jack)-[:FOLLOWS]->(andy),
(andy)-[:FOLLOWS]->(sam:Person {name: 'Sam'}),(sam)-[:FOLLOWS]->(jack),
(jack)-[:FOLLOWS]->(sam)
Following is the break down of the syntax above:
- Nodes Creation: (andy:Person {name: ‘Andy’}): This syntax creates a node labeled as Person with the property name set to ‘Andy’. The ‘andy’ is an alias or identifier for the node that can be used in subsequent operations. Similarly, other nodes have been created for ‘jack’ and ‘sam’.
- Relationship Creation: ()-[:FOLLOWS]->(): This syntax defines a directional relationship (represented by ‘->’ syntax, indicating the direction in which the relationship is established) between nodes. The empty parentheses () signify nodes, and [:FOLLOWS] signifies the type of relationship. In example above, we have relationships as [Andy Follows Jack], [Jack follows Andy], [Andy Follows Sam], [Sam follows Jack], [Jack follows Sam].
- Multiple Operations in One Query: ‘,’ Commas are used to separate different operations within a single query. In our example, it’s used to separate the creation of different nodes and relationships.
(andy)-[:FOLLOWS]->(sam:Person {name: 'Sam'}), (sam)-[:FOLLOWS]->(jack),
· Chained Relationships:
(andy)-[:FOLLOWS]->(jack), (jack)-[:FOLLOWS]->(andy)
This part of the query establishes two relationships in succession between andy and jack, forming a mutual “FOLLOWS” relationship between them.
- Node Labels and Properties:
‘:Person’: Indicates a label assigned to nodes. Labels help to categorize nodes and are useful for indexing and querying.
{name: ‘Andy’}: Represents the properties assigned to the nodes. These properties hold specific information about the node.
To query specific elements or patterns within the created graph, MATCH clause can be used, which is at the heart of most of the Cypher queries. Basic syntax structure for MATCH clause is as follows:
MATCH (pattern)
WHERE (conditions)
RETURN (what to retrieve)
- ‘pattern’ describes the structure or pattern to match in the graph. It includes nodes, relationships, and their directions. For example:
(node)-[:RELATIONSHIP]->(otherNode): It describes a pattern where ‘node’ is related to ‘otherNode’ via a ‘RELATIONSHIP’.
- ‘conditions’ (Optional) allows specifying conditions or filtering criteria.
WHERE: This keyword helps in filtering the results based on specified conditions using expressions or comparisons.
- Return: Specifies what to retrieve from the matched pattern.
RETURN: Specifies the elements or properties to be returned as the query result.
NOTE: Multiple patterns can be specified in a single MATCH clause. Example:
MATCH (nodeA)-[:RELATIONSHIP]->(nodeB), (nodeC)-[:ANOTHER_RELATIONSHIP]->(nodeA)
RETURN nodeA, nodeB, nodeC
Say for example, we want to look for nodes connected by a “FOLLOWS” relationship and retrieves the names of the nodes involved in that relationship, we can write following query:
MATCH (a:Person)-[:FOLLOWS]->(b:Person)
RETURN a.name, b.name
The response would be in the form of a table or result set displaying the pairs of names. Here’s an example of how the response might look like:
The above query retrieves these pairs, showing who follows whom based on the established relationships. However, we must make a not of the fact that in Cypher, specifying the direction of a relationship is not always mandatory, providing flexibility in querying the graph data. This flexibility is useful in scenarios where the direction of the relationship is not essential or when exploring bi-directional relationships. Consider an example:
MATCH (a:Person)-[r:RELATED_TO]-(b.Person)
RETURN a.name, r.relationshipName, b.name
Here, the MATCH clause represents a generic relationship ‘r’ between nodes a and b without specifying the direction. It allows Cypher to match any relationship (regardless of direction) between the nodes a and b. If we want to search for a particular relationship, we can always modify the query as:
MATCH (a:Person)-[r:RELATED_TO]-(b:Person)
WHERE r.relation = "Brother"
RETURN a.name, b.name
This flexibility is particularly helpful in scenarios where the direction of the relationship isn’t relevant or when exploring connections where the directionality doesn’t play a critical role. Allowing relationships without explicitly specifying their direction has several advantages. For example, in a social network platform where friendships are bidirectional (A is friends with B implies that B is friends with A), storing both directions separately would duplicate the data. Allowing relationships without direction reduces this redundancy, making the database more efficient in terms of storage. Moreover, a direction-agnostic approach often leads to a cleaner and more straightforward data model. It removes the need to manage and maintain separate relationships for different directions, resulting in a more intuitive and easier-to-understand schema. Queries become more concise, cleaner, readable, and versatile since they don’t need to account for the direction of relationships explicitly. This allows the developers to focus on the logical patterns they want to retrieve from the graph without worrying about directionality.
Cypher has other clauses similar to MATCH which can be used in querying and manipulating the data models. A concise list of clauses used in Cypher:
- WHERE: Filters pattern matching results based on specified conditions.
- CREATE and CREATE UNIQUE: Create nodes and relationships in the graph.
- MERGE: Checks for the existence of a pattern and either reuses existing elements or creates new ones.
- DELETE: Removes nodes, relationships, and properties from the graph.
- SET: Sets property values on nodes and relationships.
- FOREACH: Performs an updating action for each element in a list or collection.
- UNION: Merges results from multiple queries into a single result set.
- WITH: Chains subsequent parts of a query, allowing the forwarding of results from one part to the next, acting like a pipe in Unix.
- START: Specifies explicit starting points (deprecated in favor of specifying anchor points in a MATCH clause).
Most of the clauses look familiar to SQL and it requires little bit of intuition and imagination to play around the data models using these clauses. More you practice, better you get at it!
Summary
Based on our extensive discussion above, when relationships between data elements are as important as the data elements itself, we can use graph databases. They excel in managing relationships, allowing for fast traversal and retrieval of connected data. Traversal-based queries (e.g., finding paths, analyzing networks) are optimized which leads to quicker query response times. Schema is flexible enough which allows for easy modification and adaptation to changes in data models without major disruptions. More importantly, graph data models reflect real-world scenarios more naturally, making it intuitive to represent and navigate complex relationships.
However, as the wise men keep saying, every decision comes with a cost, hence we must always be cautious in adopting graph database solutions. Storing relationships explicitly can result in higher storage requirements compared to certain other database models. They might even face scaling challenges when dealing with huge volumes of data and complex queries. If we are dealing with simple tabular data without intricate relationships, other database types like relational databases might be more efficient. Understanding these benefits and trade-offs helps in making an informed decision when selecting the appropriate database technology.
#softwarearchitect #Graph #graphdatabases #architecture #softwaredevelopment #socialnetwork #connection