使用知识图谱实现 Graph RAG

图检索增强生成 (Graph RAG) 正成为一种强大的技术，用于生成式 AI 应用，以利用特定领域的知识和相关信息。Graph RAG 是矢量搜索方法的一个替代方案，这些方法使用矢量数据库。

知识图谱是一种知识系统，其中图数据库（如 Neo4j 或 Amazon Neptune）可以表示结构化数据。在知识图谱中，数据点之间的关系称为边，其重要性与数据点本身的连接（称为顶点，有时也称为节点）同样重要。知识图谱使得遍历网络变得容易，并能处理关于关联数据的复杂查询。知识图谱特别适用于涉及聊天机器人、身份识别、网络分析、推荐引擎、客户 360 和欺诈检测的应用场景。

Graph RAG 方法利用图数据库的结构化特性，为检索到的网络或复杂关系信息提供更丰富的深度和上下文。当图数据库与大型语言模型 (LLM) 结合使用时，开发者可以从文本等非结构化数据中自动化图谱创建过程的大部分工作。LLM 可以处理文本数据并识别实体，理解它们的关系并用图形结构表示它们。

创建 Graph RAG 应用有多种方式，例如使用微软的 GraphRAG，或将 GPT-4 与 LlamaIndex 结合。在本教程中，您将使用 Memgraph（一个开源图数据库解决方案）通过在 watsonx 上使用 Meta 的 Llama 来创建一个 rag 系统。Memgraph 使用声明性查询语言 Cypher。它与 SQL 有一些相似之处，但侧重于节点和关系，而不是表和行。您可以让 Llama 3 从非结构化文本创建并填充图数据库，并在数据库中查询信息。

第 1 步

虽然您可以选择多种工具，本教程将引导您如何设置 IBM 帐户以使用 Jupyter Notebook。

使用您的 IBM® Cloud 帐户登录 watsonx.ai。

创建 watsonx.ai 项目。

您可以从项目内部获取项目 ID。点击“管理”选项卡。然后，从“常规”页面的“详细信息”部分复制项目 ID。您需要此 ID 来完成本教程。

接下来，将您的项目与 watsonx.ai 运行时关联起来。

a. 创建 watsonx.ai 运行时服务实例（选择 Lite 计划，这是一个免费实例）。

b. 在 watsonx.ai 运行时中生成 API 密钥。保存此 API 密钥，本教程后续将会用到。

c. 转到您的项目并选择“管理”选项卡

d. 在左侧选项卡中选择“服务”和“整合”

e. 选择 IBM 服务

f. 选择关联服务并选择 watsonx.ai 运行时。

g. 将 watsonx.ai Runtime 关联到您在 watsonx.ai 中创建的项目

第 2 步

现在您需要安装 Docker。

安装 Docker 后，使用 Docker 容器安装 Memgraph。在 OSX 或 Linux 上，您可以在终端使用以下命令：

curl https://install.memgraph.com | sh

在 Windows 计算机上使用：

iwr https://windows.memgraph.com | iex

按照安装步骤，将 Memgraph 引擎和 Memgraph Lab 启动运行。

在您的计算机上为此项目创建一个新的 virtualenv：

virtualenv kg_rag --python=python3.12

在笔记本电脑的 Python 环境中，安装以下 Python 库：

./kg_rag/bin/pip install langchain langchain-openai langchain_experimental langchain-community==0.3.15 neo4j langchain_ibm jupyterlab json-repair getpass4

现在，您可以连接到 Memgraph 了。

第 3 步

如果您已为 Memgraph 配置了用户名和密码，请在此处设置；否则可以使用默认的“无用户名和密码”。对于生产数据库来说，这不是好的做法，但对于不存储敏感数据的本地开发环境来说，这不是问题。

import os
from langchain_community.chains.graph_qa.memgraph import MemgraphQAChain
from langchain_community.graphs import MemgraphGraph

url = os.environ.get("MEMGRAPH_URI", "bolt://localhost:7687")
username = os.environ.get("MEMGRAPH_USERNAME", "")
password = os.environ.get("MEMGRAPH_PASSWORD", "")

#initialize memgraph connection
graph = MemgraphGraph(
url=url, username=username, password=password, refresh_schema=True
)

现在创建一个示例字符串，描述一个关系数据集，您可以使用该数据集来测试 LLM 系统的能力。您可以使用更复杂的数据源，但这个简单的示例有助于我们演示算法。

graph_text = “””
John’s title is Director of the Digital Marketing Group.
John works with Jane whose title is Chief Marketing Officer.
Jane works in the Executive Group.
Jane works with Sharon whose title is the Director of Client Outreach.
Sharon works in the Sales Group.
“””

输入您在第一步中创建的 watsonx API 密钥：

from getpass import getpass

watsonx_api_key = getpass()
os.environ[“WATSONX_APIKEY”] = watsonx_api_key

watsonx_project_id = getpass()
os.environ[“WATSONX_PROJECT_ID”] = watsonx_project_id

现在配置 WatsonxLLM 实例以生成文本。温度应设置较低，而令牌数应设置较高，以鼓励模型生成尽可能多的细节，同时避免生成文本中不存在的实体或关系。

from langchain_ibm import WatsonxLLM
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames

graph_gen_parameters = {
    GenTextParamsMetaNames.DECODING_METHOD: “sample”,
    GenTextParamsMetaNames.MAX_NEW_TOKENS: 1000,
    GenTextParamsMetaNames.MIN_NEW_TOKENS: 1,
    GenTextParamsMetaNames.TEMPERATURE: 0.3,
    GenTextParamsMetaNames.TOP_K: 10,
    GenTextParamsMetaNames.TOP_P: 0.8
}
watsonx_llm = WatsonxLLM(
model_id=”meta-llama/llama-3-3-70b-instruct”,
url=”https://us-south.ml.cloud.ibm.com”,
project_id=os.getenv(“WATSONX_PROJECT_ID”),
params=graph_gen_parameters,
)

LLMGraphTransformer 允许您设置希望 LLM 生成哪些类型的节点和关系。就您的情况而言，文本描述了公司的员工、他们所属的团队以及职位。将 LLM 限制为仅生成这些实体，可以更有可能在图中得到对知识的准确表示。

调用 convert_to_graph_documents 会让 LLMGraphTransformer 从文本创建知识图谱。此步骤生成正确的 Neo4j 语法，将信息插入到图数据库中，以表示相关上下文和相关实体。

from langchain_experimental.graph_transformers.llm import LLMGraphTransformer
from langchain_core.documents import Document

llm_transformer = LLMGraphTransformer(
    llm=watsonx_llm,
    allowed_nodes=[“Person”, “Title”, “Group”],
    allowed_relationships=[“TITLE”, “COLLABORATES”, “GROUP”]
)
documents = [Document(page_content=graph_text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)

现在从 Memgraph 数据库中清除所有旧数据并插入新的节点和边。

# make sure the database is empty
graph.query(“STORAGE MODE IN_MEMORY_ANALYTICAL”)
graph.query(“DROP GRAPH”)
graph.query(“STORAGE MODE IN_MEMORY_TRANSACTIONAL”)

# create knowledge graph
graph.add_graph_documents(graph_documents)

生成的 Cypher 语法存储在 graph_documents 对象中。您可以直接把它作为字符串输出来查看。

print(f”{graph_documents}”)

可以在图的“get_schema”属性中查看 Cypher 创建的模式和数据类型。

graph.refresh_schema()
print(graph.get_schema)

输出结果如下：

Node labels and properties (name and type) are:
- labels: (:Title)
properties:
- id: string
- labels: (:Group)
properties:
- id: string
- labels: (:Person)
properties:
- id: string

Nodes are connected with the following relationships:
(:Person)-[:COLLABORATES]->(:Person)
(:Person)-[:GROUP]->(:Group)
(:Person)-[:TITLE]->(:Title)

您还可以在 Memgraph Labs 查看器中查看图形结构：

由输入文本生成的 Memgraph 网络

LLM 在创建正确的节点和关系方面做得相当不错。现在是查询知识图谱的时候了。

第 4 步

要正确地提示 LLM，需要进行一定的提示工程。LangChain 提供了一个 FewShotPromptTemplate，可用于在提示中向 LLM 提供示例，以确保其能够编写正确简洁的 Cypher 语法。以下代码给出了 LLM 应使用的问题和查询的几个示例。它还展示了如何将模型的输出限制为仅包含查询内容。一个过于繁琐的 LLM 可能会添加额外的信息，导致无效的 Cypher 查询，因此提示模板会要求模型仅输出查询本身。

添加指示性前缀还有助于限制模型行为，并使 LLM 更有可能输出正确的 Cypher 语法。

from langchain_core.prompts import PromptTemplate, FewShotPromptTemplate

examples = [
{
“question”: “<|begin_of_text|>What group is Charles in?<|eot_id|>“,
“query”: “<|begin_of_text|>MATCH (p:Person {{id: ‘Charles’}})-[:GROUP]->(g:Group) RETURN g.id<|eot_id|>“,
},
{
“question”: “<|begin_of_text|>Who does Paul work with?<|eot_id|>“,
“query”: “<|begin_of_text|>MATCH (a:Person {{id: ‘Paul’}})-[:COLLABORATES]->(p:Person) RETURN p.id<|eot_id|>“,
},
{
“question”: “What title does Rico have?<|eot_id|>“,
“query”: “<|begin_of_text|>MATCH (p:Person {{id: ‘Rico’}})-[:TITLE]->(t:Title) RETURN t.id<|eot_id|>“,
}
]

example_prompt = PromptTemplate.from_template(
“<|begin_of_text|>{query}<|eot_id|>“
)

prefix = “””
Instructions:
- Respond with ONE and ONLY ONE query.
- Use provided node and relationship labels and property names from the
schema which describes the database’s structure. Upon receiving a user
question, synthesize the schema to craft a precise Cypher query that
directly corresponds to the user’s intent.
- Generate valid executable Cypher queries on top of Memgraph database.
Any explanation, context, or additional information that is not a part
of the Cypher query syntax should be omitted entirely.
- Use Memgraph MAGE procedures instead of Neo4j APOC procedures.
- Do not include any explanations or apologies in your responses. Only answer the question asked.
- Do not include additional questions. Only the original user question.
- Do not include any text except the generated Cypher statement.
- For queries that ask for information or functionalities outside the direct
generation of Cypher queries, use the Cypher query format to communicate
limitations or capabilities. For example: RETURN “I am designed to generate Cypher queries based on the provided schema only.”

Here is the schema information

{schema}

With all the above information and instructions, generate Cypher query for the
user question.

The question is:

{question}

Below are a number of examples of questions and their corresponding Cypher queries.”””

cypher_prompt = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    prefix=prefix,
    suffix=”User input: {question}\nCypher query: “,
    input_variables=[“question”, “schema”],
)

接下来，您将创建一个提示来控制 LLM 如何根据 Memgraph 返回的信息回答问题。我们将为 LLM 提供几个示例和指令，说明在从图数据库获取上下文信息后如何进行响应。

qa_examples = [
    {
        “question”: “<|begin_of_text|>What group is Charles in?<|eot_id|>“,
        “context”: “[{{‘g.id’: ‘Executive Group’}}]”,
        “response”: “Charles is in the Executive Group<|eot_id|>“
    },
    {
        “question”: “<|begin_of_text|>Who does Paul work with?<|eot_id|>“,
        “context”: “[{{‘p.id’: ‘Greg’}}, {{‘p2.id’: ‘Norma’}}]”,
        “response”: “Paul works with Greg and Norma<|eot_id|>“
    },
    {
        “question”: “What title does Rico have?<|eot_id|>“,
        “context”: “[{{‘t.id’: ‘Vice President of Sales’}}]”,
        “response”: “Vice President of Sales<|eot_id|>“
    }
]

qa_template = “””
Use the provided question and context to create an answer.Question: {question}

Context: {context}
Use only names departments or titles contained within {question} and {context}.
“””
qa_example_prompt = PromptTemplate.from_template(“”)

qa_prompt = FewShotPromptTemplate(
    examples=qa_examples,
    prefix=qa_template,
    input_variables=[“question”, “context”],
    example_prompt=qa_example_prompt,
    suffix=” “
)

现在是时候创建问答链了。MemgraphQAChain 允许您设置要使用的 LLM、要使用的图形模式以及有关调试的信息。使用温度设置为 0 并加上长度惩罚，可以促使 LLM 保持 Cypher 提示简短且直接。

query_gen_parameters = {
    GenTextParamsMetaNames.DECODING_METHOD: “sample”,
    GenTextParamsMetaNames.MAX_NEW_TOKENS: 100,
    GenTextParamsMetaNames.MIN_NEW_TOKENS: 1,
    GenTextParamsMetaNames.TEMPERATURE: 0.0,
    GenTextParamsMetaNames.TOP_K: 1,
    GenTextParamsMetaNames.TOP_P: 0.9,
    GenTextParamsMetaNames.LENGTH_PENALTY: {‘decay_factor’: 1.2, ‘start_index’: 20}
}

chain = MemgraphQAChain.from_llm(
        llm = WatsonxLLM(
        model_id=”meta-llama/llama-3-3-70b-instruct”,
        url=”https://us-south.ml.cloud.ibm.com”,
        project_id=”dfe8787b-1f6f-4e18-b36a-e22c00f141d1”,
        params=query_gen_parameters
    ),
    graph = graph,
    allow_dangerous_requests = True,
    verbose = True,
    return_intermediate_steps = True, # for debugging
    cypher_prompt=cypher_prompt,
    qa_prompt=qa_prompt
)

现在可以使用自然语言问题调用该链（请注意，由于 LLM 并非完全确定性，响应结果可能略有不同）。

chain.invoke(“What is Johns title?”)

这将输出：

> Entering new MemgraphQAChain chain...
Generated Cypher:
 MATCH (p:Person {id: 'John'})-[:TITLE]->(t:Title) RETURN t.id
Full Context:
[{'t.id': 'Director of the Digital Marketing Group'}]

> Finished chain.
{'query': 'What is Johns title?',
 'result': ' \nAnswer: Director of the Digital Marketing Group.',
 'intermediate_steps': [{'query': " MATCH (p:Person {id: 'John'})-[:TITLE]->(t:Title) RETURN t.id"},
  {'context': [{'t.id': 'Director of the Digital Marketing Group'}]}]}

在下一个问题中，向链提出一个稍微复杂一点的问题：

chain.invoke(“Who does John collaborate with?”)

这应返回：

> Entering new MemgraphQAChain chain...
Generated Cypher:
MATCH (p:Person {id: ‘John’})-[:COLLABORATES]->(c:Person) RETURN c
Full Context:
[{‘c’: {‘id’: ‘Jane’}}]

> Finished chain.
{‘query’: ‘Who does John collaborate with?’,
‘result’: ‘ \nAnswer: John collaborates with Jane.’,
‘intermediate_steps’: [{‘query’: “ MATCH (p:Person {id: ‘John’})-[:COLLABORATES]->(c:Person) RETURN c”},
{‘context’: [{‘c’: {‘id’: ‘Jane’}}]}]}

正确答案已包含在回复中。在某些情况下，您可能需要在将答案返回给最终用户之前删除一些多余的文本。

您可以向 Memgraph 链询问 Group 关系：

chain.invoke(“What group is Jane in?”)

这将返回：

> Entering new MemgraphQAChain chain...
Generated Cypher:
MATCH (p:Person {id: ‘Jane’})-[:GROUP]->(g:Group) RETURN g.id
Full Context:
[{‘g.id’: ‘Executive Group’}]

> Finished chain.
{‘query’: ‘What group is Jane in?’,
‘result’: ‘Jane is in Executive Group.’,
‘intermediate_steps’: [{‘query’: “ MATCH (p:Person {id: ‘Jane’})-[:GROUP]->(g:Group) RETURN g.id”},
{‘context’: [{‘g.id’: ‘Executive Group’}]}]}

这是正确答案。

最后，向链提出一个有两个输出的问题：

chain.invoke(“Who does Jane collaborate with?”)

输出结果应为：

> Entering new MemgraphQAChain chain...
Generated Cypher:
MATCH (p:Person {id: ‘Jane’})-[:COLLABORATES]->(c:Person) RETURN c
Full Context:
[{‘c’: {‘id’: ‘Sharon’}}]

> Finished chain.
{‘query’: ‘Who does Jane collaborate with?’,
‘result’: ‘ Jane collaborates with Sharon.’,
‘intermediate_steps’: [{‘query’: “ MATCH (p:Person {id: ‘Jane’})-[:COLLABORATES]->(c:Person) RETURN c”},
{‘context’: [{‘c’: {‘id’: ‘Sharon’}}]}]}

该链正确地识别出了两个合作者。

总结

在本教程中，您使用 Memgraph 和 watsonx 构建了一个 Graph RAG 应用程序，以生成图形数据结构并进行查询。通过 watsonx 使用 LLM，您从自然语言源文本中提取了节点和边信息，并生成了用于填充图数据库的 Cypher 查询语法。然后，您使用 WatsonX 将有关该源文本的自然语言问题转换为 Cypher 查询，从图数据库中提取信息。利用提示工程，LLM 将 Memgraph 数据库的结果转换为自然语言响应。

如何选择合适的 AI 基础模型

了解如何选择正确的方法来准备数据集和使用 AI 模型，如何使用模型选择框架来平衡性能要求与成本、风险、部署需求和利益相关者要求。