实战知识库迁移 03：权限、版本和同步，不要让 AI 读错知识

前两篇我们已经做了两件事：

第 1 篇：盘点知识源，设计分类体系。
第 2 篇：把原始文档清洗成 processed/chunks.jsonl。

现在的问题是：这些知识块能不能直接丢给 RAG 用？

不能。

企业知识库至少还差三道闸：

权限：销售不能看到财务敏感制度，普通客服不能看到 VIP 客户补偿策略。
版本：用户问旧产品时，不能拿新产品手册回答；问当前政策时，不能引用旧政策。
同步：源文档更新后，知识库要知道哪些文件变了，哪些索引需要重建。

本篇会做一个可运行的小实验。你会得到：

text

kb-migration/
  processed/chunks.jsonl
  tools/access_control.py
  tools/sync_state.json
  tools/sync_check.py

先准备一份更真实的 chunks.jsonl

假设第二篇已经生成了 processed/chunks.jsonl。为了讲清楚权限和版本，这里先手动准备一份带差异的样例。

processed/chunks.jsonl：

json

{"id":"refund-2025","text":"2025 版售后政策：签收后 7 天内可申请无理由退货。","source":"docs/refund_policy.md","owner":"客服运营","domain":"business","topic":"after_sales","permission":"internal","department":"customer_service","version":"2025.12","effective_from":"2025-12-01","effective_to":"2026-03-31","lifecycle":"deprecated"}
{"id":"refund-2026","text":"2026 版售后政策：签收后 7 天内，商品未拆封且配件齐全，可申请无理由退货。","source":"docs/refund_policy.md","owner":"客服运营","domain":"business","topic":"after_sales","permission":"internal","department":"customer_service","version":"2026.04","effective_from":"2026-04-01","effective_to":"","lifecycle":"current"}
{"id":"vip-compensation","text":"VIP 客户补偿策略：补偿金额超过 100 元必须由客服主管审批。","source":"docs/vip_policy.md","owner":"客服主管","domain":"business","topic":"after_sales","permission":"department","department":"customer_service","version":"2026.04","effective_from":"2026-04-01","effective_to":"","lifecycle":"current"}
{"id":"finance-rule","text":"财务制度：月度成本报表仅财务部和管理层可见。","source":"docs/finance_policy.md","owner":"财务部","domain":"policy","topic":"finance","permission":"department","department":"finance","version":"2026.01","effective_from":"2026-01-01","effective_to":"","lifecycle":"current"}
{"id":"secret-discount","text":"特殊折扣策略：大客户续约可申请隐藏折扣。","source":"docs/discount_secret.md","owner":"销售总监","domain":"business","topic":"sales","permission":"restricted","allowed_users":["u_manager"],"version":"2026.02","effective_from":"2026-02-01","effective_to":"","lifecycle":"current"}

这 5 条知识覆盖了三类权限和两个版本状态：

id	权限	版本状态	说明
`refund-2025`	internal	deprecated	旧版政策，默认不应该被检索
`refund-2026`	internal	current	当前售后政策
`vip-compensation`	department	current	只有客服部门可见
`finance-rule`	department	current	只有财务部门可见
`secret-discount`	restricted	current	只有指定用户可见

权限不是 Prompt，权限是检索前过滤

很多人会在 Prompt 里写：

text

不要泄露用户无权访问的知识。

这句话有用，但不够。

正确做法是：无权限知识不能进入模型上下文。模型看不到，才不会说出来。

也就是说，权限过滤发生在这里：

text

用户问题 -> 读取候选知识 -> 权限过滤 -> 版本过滤 -> 检索/生成

而不是：

text

用户问题 -> 检索所有知识 -> 交给模型 -> 希望模型别泄露

第二种方式风险很高。只要 Prompt 被绕过，或者模型理解错了权限说明，就可能泄露。

定义用户上下文

权限过滤需要知道“谁在问”。

我们用一个简单用户对象表示：

python

{
    "user_id": "u_cs_001",
    "departments": ["customer_service"],
    "scopes": ["public", "internal"]
}

字段含义：

user_id：用户 ID，用于 restricted 精确授权。
departments：用户所属部门，用于 department 权限。
scopes：用户拥有的通用权限，比如 public、internal。

真实系统里，这些信息通常来自登录态、JWT、IAM、企业通讯录或后台权限系统。

编写权限和版本过滤脚本

tools/access_control.py：

python

import json
from pathlib import Path

CHUNKS_FILE = Path("processed/chunks.jsonl")

def load_chunks() -> list[dict]:
    return [
        json.loads(line)
        for line in CHUNKS_FILE.read_text(encoding="utf-8").splitlines()
        if line.strip()
    ]

def can_access(user: dict, chunk: dict) -> bool:
    permission = chunk.get("permission")

    if permission == "public":
        return True

    if permission == "internal":
        return "internal" in user.get("scopes", [])

    if permission == "department":
        return chunk.get("department") in user.get("departments", [])

    if permission == "restricted":
        return user.get("user_id") in chunk.get("allowed_users", [])

    return False

def is_current(chunk: dict) -> bool:
    is_not_expired = chunk.get("effective_to") in (None, "")
    is_not_deprecated = chunk.get("lifecycle") != "deprecated"
    return is_not_expired and is_not_deprecated

def version_match(chunk: dict, requested_version: str | None = None) -> bool:
    if requested_version:
        return chunk.get("version") == requested_version
    return is_current(chunk)

def filter_chunks(user: dict, chunks: list[dict], requested_version: str | None = None):
    return [
        chunk for chunk in chunks
        if can_access(user, chunk) and version_match(chunk, requested_version)
    ]

这个脚本只做两件事：

can_access：判断这个用户能不能看这条知识。
version_match：判断这条知识是不是当前应该使用的版本。

注意，权限和版本是两个不同维度。一个用户有权限看某条知识，不代表这条知识就是当前有效版本。

写一个测试入口

继续在 tools/access_control.py 末尾加：

python

def print_visible(user: dict, requested_version: str | None = None):
    chunks = load_chunks()
    visible = filter_chunks(user, chunks, requested_version)
    print(f"user={user['user_id']}, version={requested_version or 'current'}")
    for chunk in visible:
        print("-", chunk["id"], chunk["version"], chunk["permission"], chunk["text"])

if __name__ == "__main__":
    users = [
        {
            "user_id": "u_cs_001",
            "departments": ["customer_service"],
            "scopes": ["public", "internal"],
        },
        {
            "user_id": "u_fin_001",
            "departments": ["finance"],
            "scopes": ["public", "internal"],
        },
        {
            "user_id": "u_manager",
            "departments": ["sales"],
            "scopes": ["public", "internal"],
        },
    ]

    for user in users:
        print_visible(user)
        print()

    print("指定查看旧版本：")
    print_visible(users[0], requested_version="2025.12")

运行：

bash

python3 tools/access_control.py

你应该看到类似结果：

text

user=u_cs_001, version=current
- refund-2026 2026.04 internal 2026 版售后政策...
- vip-compensation 2026.04 department VIP 客户补偿策略...

user=u_fin_001, version=current
- refund-2026 2026.04 internal 2026 版售后政策...
- finance-rule 2026.01 department 财务制度...

user=u_manager, version=current
- refund-2026 2026.04 internal 2026 版售后政策...
- secret-discount 2026.02 restricted 特殊折扣策略...

指定查看旧版本：
user=u_cs_001, version=2025.12
- refund-2025 2025.12 internal 2025 版售后政策...

这个结果说明：

客服能看到当前售后政策和客服部门知识。
财务能看到当前售后政策和财务制度，但看不到客服补偿策略。
指定用户 u_manager 能看到 restricted 折扣策略。
默认不返回 refund-2025，除非明确请求旧版本。

版本选择为什么要这么做

版本逻辑最容易被低估。

假设用户问：

text

2025 年买的产品还能按旧政策退货吗？

这时候你可能需要查旧版本。

但如果用户问：

text

现在七天无理由退货条件是什么？

就应该只查当前版本。

所以默认策略应该是：

不传版本：只返回当前有效知识。
明确传版本：返回指定版本。
旧版本可以查，但不能默认参与当前问答。

这能避免 AI 引用过期知识。

把过滤接到 RAG 检索前

真实 RAG 里通常有两种做法。

第一种：先过滤再做向量检索。

python

visible_chunks = filter_chunks(user, all_chunks)
results = vector_search(query, visible_chunks)

适合小规模知识库或内存检索。

第二种：把权限和版本字段写进向量库 metadata，在查询时加 filter。

伪代码：

python

vector_store.search(
    query=query,
    filters={
        "permission_scope": {"$in": user["scopes"]},
        "lifecycle": "current",
    }
)

适合生产系统。因为数据量大时，先从所有 chunk 里加载再过滤会很慢。

本篇先用 Python 列表演示，是为了让你看懂权限和版本的逻辑。等逻辑确认后，再把它搬到 Chroma、Qdrant、Milvus、Elasticsearch 或 pgvector 的 metadata filter 里。

同步：怎么知道源文档变了

知识库不是一次性导入。源文档会更新，所以需要同步状态。

我们先做最简单的同步状态文件：

tools/sync_state.json：

json

{
  "docs/refund_policy.md": {
    "last_hash": "",
    "last_synced_at": ""
  },
  "docs/vip_policy.md": {
    "last_hash": "",
    "last_synced_at": ""
  }
}

含义：

last_hash：上次同步时文件内容的 hash。
last_synced_at：上次同步时间。

只要当前文件 hash 和 last_hash 不一致，就说明文件变了，需要重新清洗、分块、入库。

编写同步检查脚本

tools/sync_check.py：

python

import hashlib
import json
from datetime import datetime, timezone
from pathlib import Path

STATE_FILE = Path("tools/sync_state.json")

def file_hash(path: Path) -> str:
    return hashlib.sha256(path.read_bytes()).hexdigest()

def load_state() -> dict:
    if not STATE_FILE.exists():
        return {}
    return json.loads(STATE_FILE.read_text(encoding="utf-8"))

def save_state(state: dict):
    STATE_FILE.write_text(json.dumps(state, ensure_ascii=False, indent=2), encoding="utf-8")

def check_file(path: Path, state: dict) -> bool:
    current_hash = file_hash(path)
    old_hash = state.get(str(path), {}).get("last_hash")
    return current_hash != old_hash

def mark_synced(path: Path, state: dict):
    state[str(path)] = {
        "last_hash": file_hash(path),
        "last_synced_at": datetime.now(timezone.utc).isoformat(),
    }

def main():
    state = load_state()
    files = [Path("docs/refund_policy.md"), Path("docs/vip_policy.md")]

    changed_files = []
    for path in files:
        if not path.exists():
            print(f"skip missing: {path}")
            continue
        if check_file(path, state):
            changed_files.append(path)

    if not changed_files:
        print("no changes")
        return

    print("changed files:")
    for path in changed_files:
        print("-", path)
        mark_synced(path, state)

    save_state(state)

if __name__ == "__main__":
    main()

准备两个源文档：

bash

mkdir -p docs
echo 'refund policy v1' > docs/refund_policy.md
echo 'vip policy v1' > docs/vip_policy.md
python3 tools/sync_check.py

第一次运行会输出：

text

changed files:
- docs/refund_policy.md
- docs/vip_policy.md

再运行一次：

bash

python3 tools/sync_check.py

应该输出：

text

no changes

修改一个文件：

bash

echo 'refund policy v2' > docs/refund_policy.md
python3 tools/sync_check.py

应该只看到：

text

changed files:
- docs/refund_policy.md

这就完成了最小同步检测。

同步检测之后要做什么

检测到文件变化后，不是只更新状态就完了。真实流程应该是：

text

发现源文档变化
  -> 重新解析文档
  -> 重新清洗和分块
  -> 生成新的 chunk id 或覆盖旧 chunk
  -> 更新向量索引
  -> 跑验收问题
  -> 记录同步日志

如果是高风险知识，例如财务制度、法务条款、补偿规则，还应该进入人工审核：

text

发现变化 -> 清洗分块 -> 标记 draft -> 人工审核 -> approved 后入正式索引

不要让所有文档一更新就直接进入生产 RAG。不同知识的风险不同，同步策略也应该不同。

常见问题

为什么客服能看到 internal 的售后政策？

因为用户 scopes 里有 internal。这表示公司内部用户都能访问这条知识。

为什么客服看不到 finance-rule？

因为 finance-rule 是 department 权限，且它的 department 是 finance。客服用户的 departments 是 customer_service，所以被过滤掉。

为什么默认看不到 refund-2025？

因为它的 lifecycle 是 deprecated，并且 effective_to 有值。默认查询只返回当前有效版本。

restricted 和 department 有什么区别？

department 是部门级授权，适合“财务部可见”“客服部可见”。restricted 是用户级授权，适合少数人可见的敏感策略。

能不能只用 permission，不用 department？

不建议。permission=department 只是说明它按部门授权，具体哪个部门能看还需要 department 字段。

验收清单

完成本篇后，你应该能验证：

普通内部用户看不到 restricted 知识。
客服部门能看到客服知识，看不到财务部门知识。
财务部门能看到财务知识，看不到客服部门的补偿策略。
默认只返回当前有效版本。
指定 requested_version 时可以查询旧版本。
文件内容没变时不会重复同步。
文件内容变化时能被识别出来。

先准备一份更真实的 chunks.jsonl ​

权限不是 Prompt，权限是检索前过滤 ​

定义用户上下文 ​

编写权限和版本过滤脚本 ​

写一个测试入口 ​

版本选择为什么要这么做 ​

把过滤接到 RAG 检索前 ​

同步：怎么知道源文档变了 ​

编写同步检查脚本 ​

同步检测之后要做什么 ​

常见问题 ​

为什么客服能看到 internal 的售后政策？ ​

为什么客服看不到 finance-rule？ ​

为什么默认看不到 refund-2025？ ​

restricted 和 department 有什么区别？ ​

能不能只用 permission，不用 department？ ​

验收清单 ​

先准备一份更真实的 chunks.jsonl

权限不是 Prompt，权限是检索前过滤

定义用户上下文

编写权限和版本过滤脚本

写一个测试入口

版本选择为什么要这么做

把过滤接到 RAG 检索前

同步：怎么知道源文档变了

编写同步检查脚本

同步检测之后要做什么

常见问题

为什么客服能看到 internal 的售后政策？

为什么客服看不到 finance-rule？

为什么默认看不到 refund-2025？

restricted 和 department 有什么区别？

能不能只用 permission，不用 department？

验收清单