{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Knowledge Graphs for Complex Topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "**What is a knowledge graph**\n", "\n", "A knowledge graph, also known as a semantic network, represents a network of real-world entities—i.e. objects, events, situations, or concepts—and illustrates the relationship between them.\n", "\n", "A knowledge graph primarily consists of three elements: ``nodes``, ``edges``, and ``labels``. Nodes can represent any entity, be it an object, location, or individual. Edges establish the connection or relationship between these nodes. For instance, consider a node representing a popular author, \"J.K. Rowling\", and another node representing one of her books, \"Harry Potter\". The edge between these nodes could define the relationship as \"author of\", indicating that J.K. Rowling is the author of Harry Potter.\n", "\n", "**Knowledge graph applications**\n", "\n", "By using automated knowledge graphs, you can split hard topics into visually appealing and easy bits, making learning less scary and more helpful.\n", "\n", "some of the widely used examples are:\n", "- Search Engines: Knowledge graphs are used by search engines like Google to enhance search results with semantic-search information gathered from a wide variety of sources.\n", "- Recommendation Systems: They are used in recommendation systems to suggest products or services based on user's behavior and preferences.\n", "- Natural Language Processing: In NLP, knowledge graphs are used to understand and generate human language.\n", "- Data Integration: Knowledge graphs help in integrating data from different sources by understanding the relationship between them.\n", "- Artificial Intelligence and Machine Learning: They are used in AI and ML to provide context to data, which helps in better decision making." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup and Dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Today, we're going to use the [`instructor`](https://github.com/jxnl/instructor) library to simplify the interaction between OpenAI and our code. Along with [Graphviz](https://graphviz.org) library to bring structure to our intricate subjects and have a graph visualization.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "!pip install instructor graphviz --quiet" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import instructor \n", "from openai import OpenAI\n", "\n", "client = instructor.patch(OpenAI())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install the Graphviz based on your operation system https://graphviz.org/download/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Node and Edge Classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We begin by modeling our knowledge graph with Node and Edge objects.\n", "\n", "Node objects represent key concepts or entities, while Edge objects signify the relationships between them." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "from typing import List\n", "\n", "# The Node class represents key concepts or entities in our knowledge graph.\n", "# Each node has an id, a label, and a color.\n", "class Node(BaseModel):\n", " id: int\n", " label: str\n", " color: str\n", "\n", "# The Edge class signifies the relationships between nodes in our knowledge graph.\n", "# Each edge has a source node, a target node, a label, and a color.\n", "# By default, the color of an edge is set to \"black\".\n", "class Edge(BaseModel):\n", " source: int\n", " target: int\n", " label: str\n", " color: str = \"black\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KnowledgeGraph Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The KnowledgeGraph class integrates the nodes and edges, forming a comprehensive structure of our graph. It contains a list of nodes and a list of edges. Each node represents a key concept or entity, and each edge represents a relationship between two nodes.\n", "\n", "Later you'll notice that we model this class to be match the graphviz library's graph object.\n", "Making it easier to visualize our graph.\n", "\n", "The `visualize_knowledge_graph` function visualizes a knowledge graph. It accepts a `KnowledgeGraph` object as input, which includes nodes and edges. The function uses the `graphviz` library to create a directed graph (`Digraph`). Each node and edge from the `KnowledgeGraph` is added to the `Digraph` with their respective attributes (id, label, color). The graph is then rendered and displayed." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from graphviz import Digraph\n", "from IPython.display import display\n", "\n", "class KnowledgeGraph(BaseModel):\n", " nodes: List[Node] = Field(..., default_factory=list) # A list of nodes in the knowledge graph.\n", " edges: List[Edge] = Field(..., default_factory=list) # A list of edges in the knowledge graph.\n", "\n", "\n", " def visualize_knowledge_graph(self):\n", " dot = Digraph(comment=\"Knowledge Graph\")\n", "\n", " for node in self.nodes:\n", " dot.node(str(node.id), node.label, color=node.color)\n", " for edge in self.edges:\n", " dot.edge(str(edge.source), str(edge.target), label=edge.label, color=edge.color)\n", " \n", " return display(dot)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating the Knowledge Graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### generate_graph function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``generate_graph`` function uses OpenAI's model to create a KnowledgeGraph object from an input string.\n", "\n", "It requests the model to interpret the input as a detailed knowledge graph and uses the response to form the KnowledgeGraph object." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def generate_graph(input) -> KnowledgeGraph:\n", " return client.chat.completions.create(\n", " model=\"gpt-4-1106-preview\",\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": f\"Help me understand the following by describing it as a detailed knowledge graph: {input}\",\n", " }\n", " ],\n", " response_model=KnowledgeGraph,\n", " )" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "Neural Network\n", "\n", "\n", "\n", "2\n", "\n", "Input Layer\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "contains\n", "\n", "\n", "\n", "3\n", "\n", "Hidden Layers\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "contains\n", "\n", "\n", "\n", "4\n", "\n", "Output Layer\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "contains\n", "\n", "\n", "\n", "6\n", "\n", "Weights\n", "\n", "\n", "\n", "1->6\n", "\n", "\n", "uses\n", "\n", "\n", "\n", "7\n", "\n", "Bias\n", "\n", "\n", "\n", "1->7\n", "\n", "\n", "uses\n", "\n", "\n", "\n", "9\n", "\n", "Learning\n", "\n", "\n", "\n", "1->9\n", "\n", "\n", "performs\n", "\n", "\n", "\n", "5\n", "\n", "Neurons\n", "\n", "\n", "\n", "2->5\n", "\n", "\n", "composed of\n", "\n", "\n", "\n", "3->5\n", "\n", "\n", "composed of\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "composed of\n", "\n", "\n", "\n", "8\n", "\n", "Activation Function\n", "\n", "\n", "\n", "5->8\n", "\n", "\n", "applies\n", "\n", "\n", "\n", "9->6\n", "\n", "\n", "updates\n", "\n", "\n", "\n", "9->7\n", "\n", "\n", "updates\n", "\n", "\n", "\n", "10\n", "\n", "Backpropagation\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "involves\n", "\n", "\n", "\n", "11\n", "\n", "Loss Function\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "uses\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "generate_graph(\"Explain the concept of a neural network.\").visualize_knowledge_graph()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Iterative Graph Generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When dealing with extensive or segmented text inputs, processing them all at once might be challenging due to limitations in prompt length or the complexity of the content. In such scenarios, an iterative approach to building the knowledge graph proves beneficial. This method involves processing the text in smaller, manageable chunks, updating the graph with new information from each chunk." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are the benefits of this approach?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Scalability: This approach can handle large datasets by breaking them down into smaller, more manageable pieces.\n", "\n", "- Flexibility: It allows for dynamic updates to the graph, accommodating new information as it becomes available.\n", "\n", "- Efficiency: Processing smaller chunks of text can be more efficient and less prone to errors or omissions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What's different?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Previous example laid the foundation, while this new example will adds more complexity and functionality. The Node and Edge classes have been augmented with a __hash__ method, enabling these objects to be used in sets, thereby making it easier to handle duplicates." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "class Node(BaseModel):\n", " id: int\n", " label: str\n", " color: str\n", "\n", " def __hash__(self) -> int:\n", " return hash((id, self.label))\n", " \n", "class Edge(BaseModel):\n", " source: int\n", " target: int\n", " label: str\n", " color: str = \"black\"\n", "\n", " def __hash__(self) -> int:\n", " return hash((self.source, self.target, self.label))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KnowledgeGraph Class now have ``update`` and ``draw`` methods.\n", "\n", "The nodes and edges fields in the KnowledgeGraph class are now optional, providing more flexibility.\n", "\n", "``update``: This method allows for the combination and deduplication of two graphs.\n", "\n", "``draw``: includes a prefix parameter, facilitating the creation of different graph versions during iterations." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from typing import Optional\n", "\n", "class KnowledgeGraph(BaseModel):\n", " # Optional list of nodes and edges in the knowledge graph\n", " nodes: Optional[List[Node]] = Field(..., default_factory=list)\n", " edges: Optional[List[Edge]] = Field(..., default_factory=list)\n", "\n", " def update(self, other: \"KnowledgeGraph\") -> \"KnowledgeGraph\":\n", " # This method updates the current graph with the other graph, deduplicating nodes and edges.\n", " return KnowledgeGraph(\n", " nodes=list(set(self.nodes + other.nodes)), # Combine and deduplicate nodes\n", " edges=list(set(self.edges + other.edges)), # Combine and deduplicate edges\n", " )\n", " \n", "\n", " def visualize_knowledge_graph(self):\n", " dot = Digraph(comment=\"Knowledge Graph\")\n", "\n", " for node in self.nodes:\n", " dot.node(str(node.id), node.label, color=node.color)\n", " for edge in self.edges:\n", " dot.edge(str(edge.source), str(edge.target), label=edge.label, color=edge.color)\n", " \n", " return display(dot)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate itrative graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new ``generate_graph`` function is designed to handle a list of inputs iteratively, updating the graph with each new piece of information.\n", "\n", "If you look carefully it looks liek a very common pattern in programming, a reduce, or fold function. A simple example could be iterating over a list of find the sum of all the elements squared.\n", "\n", "```python\n", "cur_state = 0\n", "for i in [1, 2, 3, 4, 5]:\n", " c += i**2\n", "print(c)\n", "```" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def generate_graph(input: List[str]) -> KnowledgeGraph:\n", " # Initialize an empty KnowledgeGraph\n", " cur_state = KnowledgeGraph()\n", "\n", " # Iterate over the input list\n", " for i, inp in enumerate(input):\n", " new_updates = client.chat.completions.create(\n", " model=\"gpt-3.5-turbo-16k\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"\"\"You are an iterative knowledge graph builder.\n", " You are given the current state of the graph, and you must append the nodes and edges \n", " to it Do not procide any duplcates and try to reuse nodes as much as possible.\"\"\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": f\"\"\"Extract any new nodes and edges from the following:\n", " # Part {i}/{len(input)} of the input:\n", "\n", " {inp}\"\"\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": f\"\"\"Here is the current state of the graph:\n", " {cur_state.model_dump_json(indent=2)}\"\"\",\n", " },\n", " ],\n", " response_model=KnowledgeGraph,\n", " ) # type: ignore\n", "\n", " # Update the current state with the new updates\n", " cur_state = cur_state.update(new_updates)\n", "\n", " # Draw the current state of the graph\n", " cur_state.visualize_knowledge_graph() \n", " \n", " # Return the final state of the KnowledgeGraph\n", " return cur_state\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples Use Case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this approach, we process the text in manageable chunks, one at a time.\n", "\n", "This method is particularly beneficial when dealing with extensive text that may not fit into a single prompt.\n", "\n", "It is especially useful in scenarios such as constructing a knowledge graph for a complex topic, where the information is distributed across multiple documents or sections." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "Jason\n", "\n", "\n", "\n", "3\n", "\n", "physicist\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "is a\n", "\n", "\n", "\n", "2\n", "\n", "quantum mechanics\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "knows\n", "\n", "\n", "\n", "4\n", "\n", "professor\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "is a\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "Jason\n", "\n", "\n", "\n", "3\n", "\n", "physicist\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "is a\n", "\n", "\n", "\n", "2\n", "\n", "quantum mechanics\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "knows\n", "\n", "\n", "\n", "4\n", "\n", "professor\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "is a\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "Jason\n", "\n", "\n", "\n", "3\n", "\n", "physicist\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "is a\n", "\n", "\n", "\n", "2\n", "\n", "quantum mechanics\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "knows\n", "\n", "\n", "\n", "4\n", "\n", "professor\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "is a\n", "\n", "\n", "\n", "5\n", "\n", "Sarah\n", "\n", "\n", "\n", "5->1\n", "\n", "\n", "knows\n", "\n", "\n", "\n", "6\n", "\n", "student\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "is a student of\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "7\n", "\n", "University of Toronto\n", "\n", "\n", "\n", "8\n", "\n", "Canada\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "is in\n", "\n", "\n", "\n", "1\n", "\n", "Jason\n", "\n", "\n", "\n", "3\n", "\n", "physicist\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "is a\n", "\n", "\n", "\n", "2\n", "\n", "quantum mechanics\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "knows\n", "\n", "\n", "\n", "4\n", "\n", "professor\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "is a\n", "\n", "\n", "\n", "5\n", "\n", "Sarah\n", "\n", "\n", "\n", "5->7\n", "\n", "\n", "is a student at\n", "\n", "\n", "\n", "5->1\n", "\n", "\n", "knows\n", "\n", "\n", "\n", "6\n", "\n", "student\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "is a student of\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "text_chunks = [\n", " \"Jason knows a lot about quantum mechanics. He is a physicist. He is a professor\",\n", " \"Professors are smart.\",\n", " \"Sarah knows Jason and is a student of his.\",\n", " \"Sarah is a student at the University of Toronto. and UofT is in Canada.\",\n", "]\n", "\n", "graph: KnowledgeGraph = generate_graph(text_chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial shows how to generate and visualize a knowledge graph for complex topics. It also demonstrates how to extract graphic knowledge from the language model or provided text. The tutorial highlights the iterative process of building the knowledge graph by processing text in smaller chunks and updating the graph with new information.\n", "\n", "Using this approach, we can extract various things, including:\n", "\n", "1) People and their relationships in a story.\n", "\n", "```python\n", "class People(BaseModel):\n", " id: str\n", " name: str\n", " description: str\n", "\n", "class Relationship(BaseModel):\n", " id: str\n", " source: str\n", " target: str\n", " label: str\n", " description: str\n", "\n", "class Story(BaseModel):\n", " people: List[People]\n", " relationships: List[Relationship]\n", "```\n", "\n", "2) Task dependencies and action items from a transcript.\n", "\n", "```python\n", "class Task(BaseModel):\n", " id: str\n", " name: str\n", " description: str\n", "\n", "class Participant(BaseModel):\n", " id: str\n", " name: str\n", " description: str\n", "\n", "class Assignment(BaseModel):\n", " id: str\n", " source: str\n", " target: str\n", " label: str\n", " description: str\n", "\n", "class Transcript(BaseModel):\n", " tasks: List[Task]\n", " participants: List[Participant]\n", " assignments: List[Assignment]\n", "```\n", "\n", "3) Key concepts and their relationships from a research paper.\n", "4) Entities and their relationships from a news article.\n", "\n", "As an excercise, try to implement one of the above examples.\n", "\n", "All of them will follow an idea of iteratively extracting more and more information and accumulating it some state." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }