Gen-AI (VilimiAI)RAG with Knowledge Graph

9 min readJan 8, 2024

Build a corpus & medical knowledge graph by linking entities, extracting relationships, and incorporating external databases for enrichment.

👨🏾‍💻 LinkedIn ⭐️| 🐦 ViliminAi | ❤️ViliminGPT.com

A Knowledge Graph (KG), or any Graph, comprises Nodes and Edges, where each node signifies a concept, and each edge represents a relationship between a pair of these concepts. This article introduces a technique for transforming any text corpus into a Graph of Knowledge(GK), used interchangeably with the term KG in the context of this demonstration, to better convey the concept.

Knowledge Graph

A knowledge graph encapsulates the essence of relationships between two entities. Within this structure, nodes symbolize entities such as people, places, or events, while edges signify the connections between these entities. What sets knowledge graphs apart is their incorporation of a third element, commonly known as a predicate or edge label, which articulates the nature of the relationship.

A knowledge graph, like a smart network, shows how real-world things are connected. It’s stored in a graph database and visualized as a graph structure, forming what we call a “knowledge graph.” User can chat with Graph data like a real time chatbot conversation.

Knowledge Graphs serve various purposes. By applying graph algorithms, we can compute centralities for any node, offering insights into the significance of a concept within a body of work. Analyzing connected and disconnected sets of concepts, or determining communities of concepts, provides a thorough understanding of the subject matter. Knowledge Graphs enable us to uncover links between seemingly unrelated concepts.

Additionally, Knowledge Graphs can be leveraged for Graph Retrieval Augmented Generation (GRAG or GAG) and facilitate conversational interactions with documents. This approach often yields superior results compared to the conventional version of RAG, which has inherent limitations. For instance, relying on a simple semantic similarity search for context retrieval may not always be effective, especially when queries lack sufficient context or when relevant information is scattered across a vast text corpus.

High Level RAG Architecture

Populate a vector database with encoded documents.
Transform the query into a vector using a sentence transformer.
Retrieve pertinent context from the vector database based on the inputted query.
Utilize both the query and retrieved context to stimulate the LLM.

Limitations of RAG

A key drawback of RAG is its difficulty in providing precise responses to intricate and nuanced queries.. This limitation stems from several factors:

Understanding User Intent: RAG systems may face difficulty in fully grasping the precise intent behind a user’s query, a crucial aspect for delivering accurate information to the LLM.
Dependency on Vector Embeddings: RAG heavily relies on vector embeddings to interpret and match queries with relevant information. While these embeddings are potent, they are not foolproof and can sometimes result in inaccuracies or oversimplifications in understanding the context of the query.
Black Box Nature: The process of generating and comparing vector embeddings is intricate and often not transparent. Given the potential for embeddings to have numerous dimensions, deciphering their structure and understanding their impact on similarity scores in semantic search poses a challenge.
Generic Training Data: Embedding models typically undergo training on generic datasets, potentially missing the specific nuances or contexts essential for certain queries. This can result in drawing superficial similarities between different content pieces.

Types of Knowledge Graphs

Encyclopedic KGs: This common type captures general knowledge by consolidating information from diverse sources such as encyclopedias, databases, and expert insights. For instance, Wikidata compiles extensive knowledge from Wikipedia articles, resulting in vast and diverse KGs with millions of entities and relations across multiple languages.
Common-sense KGs: Focused on everyday knowledge, these KGs encompass information about objects, events, and their relationships. They contribute to understanding fundamental, often implicit, knowledge we use in our daily lives. ConceptNet, for example, includes common-sense concepts and relationships, aiding computers in a more natural grasp of human language.
Domain-Specific KGs: Tailored to specific fields like medicine, finance, or biology, these KGs are smaller but highly precise and dependable. UMLS in the medical domain, for instance, contains detailed biomedical concepts and relationships, catering to specialized knowledge needs.
Multi-Modal KGs: Going beyond text, these KGs incorporate images, sounds, and videos, serving purposes such as image-text matching or visual question answering. Examples like IMGpedia and MMKG seamlessly blend textual and visual information for comprehensive knowledge representation.

Use cases in Search Engines

In the realm of search engines, KGs are pivotal in elevating search precision and relevance. By comprehending the relationships and context embedded within KGs, search engines transcend mere keyword matching, delving into the intent and profound meaning behind user queries. This evolution results in search outcomes that are not only more intuitive but also attuned to the context, fundamentally transforming the way we access information online.

Business Architecture for this Application

Data is sourced from various channels, encompassing unstructured data, flat files, and structured data with XML or JSON databases, traditional SQL databases, and more. This diverse data undergoes processing through multiple systems to extract entities and relationships, essential components of a Knowledge Graph. Instead of conventional methods like ETL, there’s a shift towards leveraging generative AI. This advanced approach not only automates the extraction of entities and relationships but also generates queries in Neo4j’s cipher language. The result is an automatic integration of these elements into the Neo4j database, representing the left side of the diagram.

On the opposite side of the spectrum, customers showcase the Knowledge Graph generated with generative AI. This is achieved through web applications featuring text interfaces, enabling users to pose queries. Generative AI comes into play by transforming these questions into Cipher, the database query language. The query is executed against the database, the result is obtained, and then it undergoes another round of generative AI processing to convert it back into natural language.

In the middle layer, Graph Database will generate the schema based on the corpus data into a Graph of Concepts, using the Nodes and Edges. When you connect those, you can see Nodes and Edges relationship like below.

Build Knowledge Graph

There are four steps involved, like below but this will be vary based on business needs and use case scenarios.

Identify and capture concepts and entities from the content. These elements represent the nodes in the system.
Uncover relationships between the identified concepts, forming the edges of the structure.
Populate a graph data structure or a graph database with the identified nodes (concepts) and edges (relations).
Visualize the constructed graph for both analytical insights and potential artistic enjoyment.

Corpus data flow diagram is given below, this flow will vary based on what DB model are you using. Example if you are using Graph DB and Data Science DB, data will be stored in backend system. If you are using in-memory place holder then you can use Pandas Data frames, etc.

In the initial phase, the text corpus undergoes segmentation, with each segment being assigned a unique chunk_id. Following this, an Language Model (LLM) is employed to extract concepts and their semantic relationships from each text chunk, assigning a weightage of W1 to these relationships. It’s important to note that multiple relationships may exist between the same pair of concepts.

Subsequently, contextual proximity within the same text chunk is considered, establishing an additional relationship with a weightage of W2 between concepts. This recognition extends to instances where the same concept pair appears in different chunks. To streamline the data, similar pairs are grouped, their weights are summed, and their relationships are concatenated. The outcome is a consolidated representation featuring a single edge for each distinct pair of concepts, complete with a specific weight and a list of relations as its identifier.

Now let’s run this GenAI model through every text chunk of input data frame and convert the json into a Pandas data frame, here is what it looks like.

Each row in this representation signifies a relationship between a pair of concepts, serving as an edge connecting two nodes in our graph. Multiple edges or relationships may exist between the same pair of concepts. The count in the provided data frame serves as the weight, arbitrarily set to 4.

Integrating KG with LLM-RAG

The integration of Knowledge Graphs (KGs) with Large Language Models (LLMs) holds the promise of significantly enhancing the Retrieval Augmented Generation (RAG) process, resulting in improved knowledge representation and reasoning. This collaborative approach facilitates dynamic knowledge fusion, ensuring that real-world knowledge stays current and distinct from the text space. Consequently, the information provided during inference remains up-to-date and relevant.

Dynamic Knowledge Fusion

Consider a Knowledge Graph (KG) as a dynamic database accessible to Large Language Models (LLMs) for querying the latest and pertinent information. This approach proves highly effective in tasks such as question answering, where staying current is essential. The integration of this knowledge with LLMs is accomplished through advanced architectures, fostering a profound interaction between the text tokens and KG entities. This enriches the LLM’s responses with structured, factual data, elevating the quality of generated information.

KG Enhanced RAG

Elevating RAG techniques with Knowledge Graphs (KGs) involves searching for relevant facts within the KGs and presenting them as contextual information to the LLMs. This method empowers the generation of precise, diverse, and factual content. For instance, when an LLM is tasked with producing a response about a recent event, it can initially consult the KG for the latest facts before formulating its reply.

Additionally, LLMs prove instrumental in crafting high-quality texts that accurately describe KG information. This holds immense potential for generating authentic narratives, dialogues, and stories. Whether by harnessing knowledge from LLMs or constructing extensive KG-text corpora, this process significantly enhances KG-to-text generation, particularly in scenarios with limited training data.

Reasoning with LLMs and KGs

The synergistic impact of LLMs and KGs becomes prominently clear in reasoning tasks. Employing LLMs to interpret textual questions and facilitate reasoning on KGs establishes a connection between textual and structural information, enhancing interpretability and reasoning capabilities. This cohesive approach finds applications across various domains, ranging from personalized recommendations in dialogue systems to strengthening task-specific training procedures through the incorporation of domain knowledge graphs.

Graph Visualization

The visualization stage adds an enjoyable dimension to this exercise, offering a unique artistic satisfaction. We’ve already determined the edge weights to influence their thickness, assigned colors to node communities, and established node degrees to determine their sizes.

live prototype

Explore the flexibility of zooming in and out, manipulating nodes and edges, and adjusting the graph physics using the slider panel at the bottom of the page. Witness how this dynamic graph facilitates the formulation of insightful questions and enhances comprehension of the subject matter!

Conclusion

Knowledge graphs prove highly effective when a blend of structured and unstructured data is needed to fuel RAG applications. This blog post has guided you through constructing a knowledge graph in Graph DB using GenAI functions on corpus or medical or any text. The neatly structured outputs from GenAI/models functions make them an ideal choice for extracting organized information. For an optimal experience with LLMs in graph construction, define the graph schema in detail and incorporate an entity disambiguation step after extraction. Hope our RAG KG graph supports the development of Graph Augmented Retrieval, contributing to the improvement of the overall RAG pipeline.

Incorporating Knowledge Graphs (KGs) into Retrieval Augmented Generation (RAG) systems presents significant potential. By harnessing the structured and interconnected data within KGs, we can substantially elevate the reasoning capabilities of existing RAG systems. This potent fusion holds the promise of mitigating the limitations inherent in current RAG pipelines, delivering responses that are more accurate, context-aware, and nuanced.

KGs function as a robust reservoir of information accessible to LLMs, enabling them to not only retrieve facts but also comprehend the relationships and underlying contexts associated with those facts. This heightened level of understanding is vital for the advancement of AI systems capable of more effective interactions with users, offering information that is not only pertinent but also profoundly insightful.

Please connect through 👨🏾‍💻 LinkedIn for further development.