chore: 文档与 IDE 配置整理

- .kiro/specs/ → docs/specs/(41 个历史需求 spec 迁移,移除 .config.kiro)
- CLAUDE.md 三层拆分:根文件精简 + apps/backend/CLAUDE.md + .claude/commands/
- 新增 /spec-close、/pre-change 两个工作流命令
- DDL 基线刷新(从测试库重新导出 11 个文件,dws 35→38 表,biz 18→21 表)
- BD_Manual → BD_manual 命名统一(48 个文件)
- 修复 3 处文档与数据库不一致(auth.users.status 默认值、scheduled_tasks 字段、RLS 视图数)
- 新增 BD_manual_public_rbac_tables.md(public schema 8 张 RBAC/工作流表)
- 合并 biz.trigger_jobs 文档(10→12 字段,归档独立文档)
- docs/database/README.md 索引更新

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Neo
2026-04-06 00:02:37 +08:00
parent 8228b3fa37
commit 70324d8542
185 changed files with 13595 additions and 1219 deletions

View File

@@ -0,0 +1,596 @@
# Design: 开发调试全链路日志系统
## Overview
为小程序前后端联调提供全链路请求追踪能力。后端采集从 HTTP 请求进入到数据库查询的每一层细粒度日志span写入 JSON Lines 日志文件。admin-web 提供「开发测试日志」板块,支持按时间、类型等维度筛选和查看完整请求链路。
仅在开发/测试环境启用,生产环境通过开关关闭。
## Architecture
### 数据流
```
小程序请求 → FastAPI 后端
TraceMiddleware生成 request_id开始计时
CORS 中间件(记录 MIDDLEWARE span
ResponseWrapperMiddleware记录 MIDDLEWARE span
鉴权层(记录 AUTH span含失败原因分类
路由处理函数(记录 ROUTE span
Service 层(记录 SERVICE span含函数名、参数
数据库层(记录 DB_QUERY span含 SQL、参数、行数、耗时
├─ 连接获取(记录 DB_CONN span含连接耗时
└─ 连接释放(记录 DB_CONN_RELEASE span
[分支] SSE 流式响应(记录 SSE_START / SSE_EVENT / SSE_END span
├─ AI 调用(记录 AI_CALL span含 app_id、prompt 长度、token 数)
└─ 流式 token记录 SSE_EVENT span含累计 token 数)
[分支] 异常处理(记录 ERROR span含异常类型、堆栈、发生层级
响应返回(记录 HTTP_OUT span含状态码、耗时汇总、响应体大小
TraceMiddleware 写入 JSON Lines 文件
admin-web 通过 API 读取日志文件 → 展示
WebSocket 连接 → 独立 Trace
├─ WS_CONNECT span连接建立
├─ WS_MESSAGE span每条消息
└─ WS_DISCONNECT span连接断开
后台 Job → 独立 Tracejob_id 作为 root span
├─ JOB_START span任务开始
├─ SERVICE / DB_QUERY span内部调用
└─ JOB_END / JOB_ERROR span任务结束/失败)
```
### 核心组件
#### 1. 后端Trace 采集系统
##### 1.1 TraceContextcontextvars
```python
# apps/backend/app/trace/context.py
import contextvars
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import uuid
trace_context: contextvars.ContextVar['TraceContext'] = contextvars.ContextVar('trace_context')
@dataclass
class TraceSpan:
"""单个追踪节点"""
span_type: str # HTTP_IN, AUTH, ROUTE, SERVICE, DB_QUERY, DB_CONN, DB_CONN_RELEASE,
# HTTP_OUT, ERROR, DB_ERROR, MIDDLEWARE, MIDDLEWARE_ERROR,
# SSE_START, SSE_EVENT, SSE_END, AI_CALL, AI_STREAM, AI_ERROR,
# WS_CONNECT, WS_MESSAGE, WS_DISCONNECT,
# JOB_START, JOB_END, JOB_ERROR
module: str # 模块路径 (e.g. "xcx_tasks")
function: str # 函数名 (e.g. "get_task_list")
description_zh: str # 中文描述
description_en: str # 英文描述
params: dict[str, Any] # 参数
result_summary: str # 结果摘要
duration_ms: float # 耗时毫秒
timestamp: str # ISO 时间戳
extra: dict[str, Any] = field(default_factory=dict) # SQL语句等额外信息
@dataclass
class TraceContext:
"""请求级追踪上下文"""
request_id: str = field(default_factory=lambda: uuid.uuid4().hex[:12])
trace_type: str = "http" # http, sse, ws, job
start_time: datetime = field(default_factory=datetime.now)
method: str = ""
path: str = ""
user_id: int | None = None
site_id: int | None = None
spans: list[TraceSpan] = field(default_factory=list)
def add_span(self, span: TraceSpan):
self.spans.append(span)
```
##### 1.2 TraceMiddlewareASGI 中间件)
```python
# apps/backend/app/trace/middleware.py
# - 每个请求创建 TraceContext存入 contextvars
# - 记录 HTTP_IN spanmethod, path, query_params, body_preview
# - 请求结束时记录 HTTP_OUT spanstatus_code, duration, body_size
# - 将完整 trace 写入日志文件
# - 响应头写入 X-Request-ID, X-Process-Time, X-DB-Queries, X-DB-Time
```
##### 1.3 trace_span 装饰器
```python
# apps/backend/app/trace/decorators.py
def trace_service(description_zh: str, description_en: str):
"""Service 层函数装饰器,自动记录函数调用 span"""
# 记录:模块名、函数名、参数名+值、返回值摘要、耗时
def trace_db(description_zh: str, description_en: str):
"""数据库查询装饰器,自动记录 SQL span"""
# 记录SQL 语句、参数、返回行数、耗时
```
##### 1.4 数据库连接包装
```python
# apps/backend/app/trace/db_wrapper.py
# 包装 get_connection(),拦截 cursor.execute()
# 自动记录每条 SQL 的:
# - 完整 SQL 语句(参数化)
# - 绑定参数值
# - 返回行数
# - 执行耗时
# - 调用来源(哪个 service 函数)
# 同时记录连接生命周期:
# - DB_CONN span连接获取耗时
# - DB_CONN_RELEASE span连接释放
```
##### 1.5 鉴权层追踪
```python
# 在 require_approved() / get_current_user() 等依赖注入中添加 span
# 记录token 前缀、user_id、site_id、roles、是否通过
# 鉴权失败时记录详细原因分类:
# - AUTH_EXPIRED令牌过期
# - AUTH_INVALID令牌无效签名错误
# - AUTH_MALFORMED令牌格式错误缺少字段
# - AUTH_LIMITED受限令牌访问完整接口
# - AUTH_FORBIDDEN角色权限不足
```
##### 1.6 SSE 流式响应追踪
```python
# apps/backend/app/trace/sse_wrapper.py
# 包装 StreamingResponse 的 event_generator追踪 SSE 全流程:
# - SSE_START span流开始记录端点、用户、chat_id
# - SSE_EVENT span每个事件message/done/error记录累计 token 数
# - SSE_END span流结束总 token 数、总耗时、是否正常完成)
# 特别处理 AI 调用链:
# - AI_CALL spanDashScope API 调用app_id、prompt 长度、session_id
# - AI_STREAM span流式 token 接收(每 N 个 token 记录一次,避免 span 爆炸)
# - AI_ERROR spanAI 调用失败(错误类型、重试次数)
```
##### 1.7 异常/错误追踪
```python
# apps/backend/app/trace/error_handler.py
# 集成到全局异常处理器http_exception_handler / unhandled_exception_handler
# - ERROR span记录异常类型、异常消息、堆栈摘要前 5 行)、发生层级
# - 区分 HTTPException业务错误和未捕获异常系统错误
# - 数据库异常psycopg2.Error单独分类DB_ERROR span
# - 确保异常时 trace 仍能正确写入(异常处理器中调用 TraceWriter
```
##### 1.8 WebSocket 追踪
```python
# apps/backend/app/trace/ws_wrapper.py
# 包装 WebSocket 端点,追踪连接全生命周期:
# - WS_CONNECT span连接建立execution_id、客户端信息
# - WS_MESSAGE span消息推送消息数量、累计字节数每 N 条记录一次)
# - WS_DISCONNECT span连接断开原因、总消息数、总耗时
# WebSocket trace 使用独立的 request_idws_ 前缀),与 HTTP trace 区分
```
##### 1.9 后台 Job 追踪
```python
# apps/backend/app/trace/job_wrapper.py
# 包装 lifespan 中注册的 job handler追踪后台任务执行
# - JOB_START span任务开始job_name、触发时间
# - 内部的 SERVICE / DB_QUERY span 自动关联到 job trace
# - JOB_END span任务正常结束耗时、处理记录数
# - JOB_ERROR span任务异常异常类型、堆栈摘要
# Job trace 使用独立的 request_idjob_ 前缀),写入同一日志文件
# 在 admin-web 中可按 trace 类型http/sse/ws/job筛选
```
##### 1.10 中间件层追踪
```python
# 在 TraceMiddleware 中记录中间件链的执行耗时:
# - MIDDLEWARE spanResponseWrapperMiddleware 执行耗时
# - 如果响应包装失败JSON 解析错误),记录 MIDDLEWARE_ERROR span
# - 记录响应体大小(用于检测异常大响应)
```
#### 2. 日志文件方案
##### 2.1 文件组织
```
export/dev-trace-logs/
├── 2026-03-22/
│ ├── trace_2026-03-22_00.jsonl # 按小时分割
│ ├── trace_2026-03-22_01.jsonl
│ └── ...
├── 2026-03-23/
│ └── ...
└── _index.json # 索引文件(日期→文件列表→记录数)
```
##### 2.2 单条日志格式JSON Lines
```json
{
"request_id": "a1b2c3d4e5f6",
"timestamp": "2026-03-22T14:30:15.123",
"method": "POST",
"path": "/api/xcx/tasks",
"status_code": 200,
"total_duration_ms": 45,
"user_id": 7,
"site_id": 1,
"db_query_count": 2,
"db_total_ms": 20,
"error": null,
"spans": [
{
"span_type": "HTTP_IN",
"module": "trace.middleware",
"function": "TraceMiddleware.__call__",
"description_zh": "接收请求 POST /api/xcx/tasks",
"description_en": "Received request POST /api/xcx/tasks",
"params": {"query": {"status": "pending", "page": "1"}, "body_preview": ""},
"result_summary": "",
"duration_ms": 0,
"timestamp": "2026-03-22T14:30:15.123"
},
{
"span_type": "AUTH",
"module": "auth.dependencies",
"function": "require_approved",
"description_zh": "JWT 鉴权通过用户ID=7, 门店ID=1, 角色=[coach]",
"description_en": "JWT auth passed: user_id=7, site_id=1, roles=[coach]",
"params": {"token_prefix": "eyJ..."},
"result_summary": "approved",
"duration_ms": 2,
"timestamp": "2026-03-22T14:30:15.125"
},
{
"span_type": "SERVICE",
"module": "services.task_manager",
"function": "get_task_list_v2",
"description_zh": "调用任务管理服务:查询待处理任务",
"description_en": "Called task manager service: query pending tasks",
"params": {"user_id": 7, "site_id": 1, "status": "pending"},
"result_summary": "返回 15 条任务",
"duration_ms": 38,
"timestamp": "2026-03-22T14:30:15.126"
},
{
"span_type": "DB_QUERY",
"module": "services.task_manager",
"function": "get_task_list_v2",
"description_zh": "查询任务表:按门店和状态筛选",
"description_en": "Query tasks table: filter by site and status",
"params": {"site_id": 1, "status": "pending"},
"result_summary": "15 行",
"duration_ms": 12,
"timestamp": "2026-03-22T14:30:15.128",
"extra": {
"sql": "SELECT id, customer_name, task_type, ... FROM biz.tasks WHERE site_id = $1 AND status = $2",
"params": [1, "pending"],
"row_count": 15
}
},
{
"span_type": "HTTP_OUT",
"module": "trace.middleware",
"function": "TraceMiddleware.__call__",
"description_zh": "响应返回 200 OK耗时 45ms",
"description_en": "Response sent 200 OK, took 45ms",
"params": {},
"result_summary": "200 OK, 3.2KB body",
"duration_ms": 45,
"timestamp": "2026-03-22T14:30:15.168"
}
]
}
```
##### 2.3 文件分割策略
- 按日期分目录:`YYYY-MM-DD/`
- 按小时分文件:`trace_YYYY-MM-DD_HH.jsonl`
- 单文件超过 10MB 自动轮转:`trace_YYYY-MM-DD_HH_001.jsonl`
- 索引文件 `_index.json` 记录每个文件的记录数和大小
##### 2.4 清理策略
- 自动清理:每天凌晨检查,删除超过 N 天的日志(默认 7 天)
- 手动清理admin-web 提供按日期范围清理的功能
- 配置项:`DEV_TRACE_LOG_RETENTION_DAYS=7`.env
#### 3. 后端 API日志读取 + 覆盖率)
```
GET /api/admin/dev-trace/coverage # 获取最近一次覆盖率扫描结果
POST /api/admin/dev-trace/coverage/scan # 手动触发覆盖率扫描
GET /api/admin/dev-trace/dates # 获取有日志的日期列表
GET /api/admin/dev-trace/requests # 按条件查询请求列表
?date=2026-03-22
&start_time=14:00
&end_time=15:00
&trace_type=http|sse|ws|job # 新增:按 trace 类型筛选
&method=POST
&path_contains=tasks
&status_code=200
&min_duration=100
&has_error=true # 新增:只看有错误的请求
&span_type=DB_QUERY,ERROR # 新增:包含特定 span 类型的请求
&page=1&page_size=50
GET /api/admin/dev-trace/request/{id} # 获取单个请求的完整 span 链路
POST /api/admin/dev-trace/cleanup # 手动清理指定日期范围的日志
GET /api/admin/dev-trace/settings # 获取日志设置(保留天数、开关状态)
PUT /api/admin/dev-trace/settings # 更新日志设置
```
#### 4. admin-web开发测试日志板块
##### 4.1 页面结构
左右分栏布局:
- 左侧:请求列表(时间、方法/类型、路径、状态码、耗时、DB查询数
- 右侧:选中请求的完整 span 链路树(层级缩进展示)
- 顶部筛选栏日期、时间范围、Trace 类型[HTTP/SSE/WS/Job]、方法、路径关键词、状态码、最小耗时、span_type 筛选)
- Span 类型颜色编码HTTP=蓝、AUTH=橙、SERVICE=绿、DB=紫、ERROR=红、SSE=青、WS=黄、JOB=灰
##### 4.2 覆盖率仪表盘(页面顶部)
在 DevTrace 页面顶部展示 Trace 覆盖率状态栏:
```
┌─────────────────────────────────────────────────────────────────┐
│ 📊 Trace 覆盖率:路由 10/11 (91%) | Service 7/23 (30%) | Job 4/4 (100%) │
│ 未覆盖xcx_test, fdw_queries, matching, application, ... [🔄 扫描] │
└─────────────────────────────────────────────────────────────────┘
```
后端扫描逻辑(`apps/backend/app/trace/coverage.py`
```python
# 扫描维度:
# 1. 路由覆盖:扫描 app/routers/xcx_*.py 中的路由函数,
# 对比 TraceMiddleware 的路由前缀匹配规则,判断是否在 trace 范围内
# 2. Service 覆盖:扫描 app/services/ 下所有公开函数(非 _ 开头),
# 检查是否有 @trace_service 装饰器
# 3. Job 覆盖:扫描 lifespan 中注册的 job handler
# 检查是否被 job_wrapper 包装
# 4. SSE/WS 覆盖:扫描 SSE/WS 端点,检查是否集成了对应 wrapper
# 输出结构:
{
"scan_time": "2026-03-22T14:30:00",
"routes": {
"total": 11, "covered": 10,
"uncovered": ["xcx_test"],
"details": [{"name": "xcx_tasks", "covered": true, "functions": 4}, ...]
},
"services": {
"total": 23, "covered": 7,
"uncovered": ["fdw_queries.get_member_data", "matching.find_best_match", ...],
"details": [{"module": "task_manager", "total": 5, "covered": 5}, ...]
},
"jobs": {"total": 4, "covered": 4, "uncovered": []},
"sse_endpoints": {"total": 1, "covered": 1, "uncovered": []},
"ws_endpoints": {"total": 1, "covered": 1, "uncovered": []}
}
```
扫描触发方式:
- 手动扫描admin-web 页面点击「扫描」按钮,调用 API 立即执行
- 定时扫描:后端启动时扫描一次,之后按配置间隔定期扫描(默认 1 小时)
- 扫描结果缓存在内存中API 返回最近一次扫描结果
##### 4.3 设置面板
- 日志开关(启用/禁用)
- 保留天数配置
- 自动清理开关
- 手动清理(按日期范围)
- 磁盘占用统计
- 覆盖率扫描间隔配置(分钟)
#### 5. 开关机制
##### 5.1 环境变量
```env
DEV_TRACE_ENABLED=true # 总开关
DEV_TRACE_LOG_DIR=export/dev-trace-logs # 日志目录
DEV_TRACE_LOG_RETENTION_DAYS=7 # 自动清理保留天数
DEV_TRACE_LOG_SQL=true # 是否记录完整 SQL
DEV_TRACE_LOG_PARAMS=true # 是否记录函数参数值
```
##### 5.2 运行时开关
- admin-web 设置面板可动态开关(通过 API 修改内存状态)
- 不需要重启后端
- 重启后回退到 .env 配置
## Considerations
### 性能影响
- JSON Lines 追加写入IO 开销极小
- contextvars 无锁,线程安全
- 装饰器开销:每个 span 约 0.01ms(可忽略)
- 文件写入异步化(写入失败不影响请求处理)
### 安全
- 仅 admin 角色可访问日志 API
- SQL 参数值在日志中记录(开发环境可接受,生产环境关闭)
- Token 只记录前缀,不记录完整值
### 与现有系统的关系
- 不影响现有的 `ResponseWrapperMiddleware`trace 中间件在其外层)
- 不影响现有的 logging 配置
- 日志文件路径遵循 `export-paths` 规范
### 实施范围
- xcx_* 路由全覆盖登录、任务、备注、绩效、AI 对话、客户、助教、看板、配置)
- SSE 流式端点xcx_chat 的 AI 对话流)完整追踪
- WebSocket 端点(/ws/logs连接生命周期追踪
- 后台 Jobtask_generator、task_expiry、recall_detector、note_reclassifier执行追踪
- 异常/错误全链路追踪(业务异常 + 系统异常 + 数据库异常)
- 数据库连接生命周期追踪(获取/释放)
- 中间件层耗时追踪
- 后续可扩展到 admin_* 路由
- Service 层装饰器按需添加(优先覆盖联调涉及的 service
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Request ID 唯一性
*For any* sequence of N HTTP requests processed by the TraceMiddleware, all N generated request_id values shall be distinct.
**Validates: Requirement 1.1**
### Property 2: Span 顺序保持
*For any* sequence of TraceSpan objects added to a TraceContext, the spans list shall preserve the insertion order (i.e., `spans[i].timestamp <= spans[i+1].timestamp` for all valid i).
**Validates: Requirement 1.5**
### Property 3: TraceSpan 结构完整性
*For any* TraceSpan instance regardless of span_type, the serialized JSON output shall contain all required fields: span_type, module, function, description_zh, description_en, params, result_summary, duration_ms, timestamp, and extra. Additionally, the top-level trace record shall contain request_id, timestamp, method, path, status_code, total_duration_ms, user_id, site_id, db_query_count, db_total_ms, error, and spans.
**Validates: Requirements 2.4, 3.1, 3.2**
### Property 4: Token 前缀截断
*For any* JWT token string of any length, the AUTH span shall record only a prefix (not exceeding a fixed maximum length) and the recorded value shall not equal the complete token when the token exceeds that length.
**Validates: Requirement 2.5**
### Property 5: JSON 序列化往返一致性
*For any* valid TraceContext object, serializing it to a JSON line then parsing that JSON line back shall produce an equivalent data structure (all field values preserved).
**Validates: Requirement 3.5**
### Property 6: 日志文件路径生成
*For any* timestamp, the generated log directory name shall match the format `YYYY-MM-DD/` and the generated log file name shall match the format `trace_YYYY-MM-DD_HH.jsonl`, where the date and hour components correspond to the input timestamp.
**Validates: Requirements 4.1, 4.2**
### Property 7: 索引文件一致性
*For any* sequence of log write operations, the `_index.json` file shall accurately reflect the current state: every referenced file exists, every existing log file is referenced, and the record count and file size for each entry match the actual file.
**Validates: Requirements 4.4, 4.5**
### Property 8: 清理保留期正确性
*For any* set of date directories and a configured retention period of N days, after cleanup executes, all directories with dates older than N days from today shall be deleted, all directories within the retention window shall be preserved, and the `_index.json` shall not reference any deleted directories.
**Validates: Requirements 5.2, 5.4**
### Property 9: API 筛选正确性
*For any* set of stored trace records and any combination of filter parameters (date, time range, method, path keyword, status code, minimum duration), all returned results shall satisfy every specified filter criterion, and no matching record shall be omitted from the results (within pagination bounds).
**Validates: Requirement 6.2**
### Property 10: Trace 写入-读取往返一致性
*For any* TraceContext written to a log file, querying the Trace_API with the corresponding request_id shall return a record equivalent to the original TraceContext (all fields and spans preserved).
**Validates: Requirement 6.3**
### Property 11: Admin 权限强制
*For any* Trace_API endpoint and any user without admin role, the API shall return a 403 Forbidden response.
**Validates: Requirements 6.5, 6.6**
### Property 12: 设置更新往返一致性
*For any* valid settings update (enabled status, retention days, SQL logging flag, parameter logging flag), after a PUT to the settings API, a subsequent GET shall return the updated values.
**Validates: Requirement 7.2**
### Property 13: 开关关闭时无 Trace 产出
*For any* HTTP request processed while the trace system is disabled (either via `DEV_TRACE_ENABLED=false` or runtime switch off), no TraceContext shall be created, no spans shall be recorded, and no log file entries shall be written.
**Validates: Requirements 8.2, 8.3**
### Property 14: 功能开关控制 Span 内容
*For any* DB_QUERY span when `DEV_TRACE_LOG_SQL` is false, the span shall not contain the full SQL statement. *For any* SERVICE span when `DEV_TRACE_LOG_PARAMS` is false, the span shall not contain function parameter values.
**Validates: Requirements 8.5, 8.6**
### Property 15: 路由前缀过滤
*For any* HTTP request, the TraceMiddleware shall produce trace data if and only if the request path matches the `xcx_*` route prefix. Non-matching requests shall produce no trace output.
**Validates: Requirements 11.1, 11.2**
### Property 16: 异常时 Trace 完整性
*For any* HTTP request that results in an exception (HTTPException or unhandled), the trace record shall contain an ERROR or DB_ERROR span with the exception type and message, and the HTTP_OUT span shall still be recorded with the correct error status code.
**Validates: Requirements 12.1, 12.2, 12.3**
### Property 17: SSE 流式 Trace 完整性
*For any* SSE streaming response, the trace record shall contain SSE_START and SSE_END spans. The SSE_END span's total_tokens shall equal the sum of tokens reported in SSE_EVENT spans. If an error occurs during streaming, an AI_ERROR span shall be present.
**Validates: Requirements 13.1, 13.2, 13.3**
### Property 18: WebSocket Trace 生命周期
*For any* WebSocket connection, the trace record shall contain a WS_CONNECT span and a WS_DISCONNECT span. The WS_DISCONNECT span's total_messages shall be consistent with the number of WS_MESSAGE spans recorded.
**Validates: Requirements 14.1, 14.2**
### Property 19: 后台 Job Trace 完整性
*For any* background job execution, the trace record shall contain a JOB_START span and either a JOB_END or JOB_ERROR span. Internal SERVICE and DB_QUERY spans shall be associated with the same job trace request_id.
**Validates: Requirements 15.1, 15.2**
### Property 20: 鉴权失败原因分类
*For any* authentication failure, the AUTH span shall contain a failure_reason field with one of the defined categories (AUTH_EXPIRED, AUTH_INVALID, AUTH_MALFORMED, AUTH_LIMITED, AUTH_FORBIDDEN), and the reason shall accurately reflect the actual failure cause.
**Validates: Requirement 12.4**
### Property 21: 数据库连接生命周期配对
*For any* database connection acquired during a request, there shall be exactly one DB_CONN span (connection open) and one DB_CONN_RELEASE span (connection close). The DB_CONN_RELEASE timestamp shall be >= the DB_CONN timestamp.
**Validates: Requirement 16.1**
### Property 22: 覆盖率扫描一致性
*For any* set of route files, service modules, and job handlers in the backend codebase, the coverage scanner shall correctly identify all public functions and accurately report which ones have trace decorators/wrappers applied. The total count shall equal covered + uncovered for each category.
**Validates: Requirements 18.1, 18.2**

View File

@@ -0,0 +1,231 @@
# Requirements Document
## Introduction
开发调试全链路日志系统dev-trace-log为小程序前后端联调提供全链路请求追踪能力。后端采集从 HTTP 请求进入到数据库查询的每一层细粒度日志span写入 JSON Lines 日志文件。admin-web 提供「开发测试日志」板块,支持多维度筛选和查看完整请求链路。仅在开发/测试环境启用,生产环境通过开关关闭。
## Glossary
- **Trace_System**: 全链路日志采集系统,包含中间件、装饰器、数据库包装器等组件
- **TraceMiddleware**: ASGI 中间件,负责为每个请求创建追踪上下文、记录 HTTP_IN/HTTP_OUT span、写入日志文件
- **TraceContext**: 基于 contextvars 的请求级追踪上下文,存储 request_id、span 列表等信息
- **TraceSpan**: 单个追踪节点,记录某一层的函数调用信息(类型、模块、函数名、参数、耗时等)
- **Span_Type**: span 类型枚举,包括 HTTP_IN、AUTH、ROUTE、SERVICE、DB_QUERY、HTTP_OUT
- **JSON_Lines_File**: 以 `.jsonl` 为扩展名的日志文件,每行一条完整的 JSON 格式请求追踪记录
- **Log_Writer**: 日志文件写入组件,负责按日期/小时分割文件、轮转和索引维护
- **Trace_API**: admin-web 后端 API 端点集合,提供日志查询、清理、设置等功能
- **Admin_Web_Trace_Page**: admin-web 中的「开发测试日志」板块,提供请求列表和 span 链路树展示
- **Runtime_Switch**: 运行时动态开关,通过 API 修改内存状态控制日志采集的启用/禁用
## Requirements
### Requirement 1: 请求级追踪上下文管理
**User Story:** 作为后端开发者,我希望每个 HTTP 请求自动生成唯一的追踪上下文,以便将请求全链路的所有 span 关联到同一个 request_id。
#### Acceptance Criteria
1. WHEN an HTTP request enters the FastAPI application, THE TraceMiddleware SHALL create a new TraceContext with a unique request_id and store it in contextvars
2. WHEN the TraceContext is created, THE TraceMiddleware SHALL record an HTTP_IN span containing the request method, path, query parameters, and body preview
3. WHEN the HTTP response is sent, THE TraceMiddleware SHALL record an HTTP_OUT span containing the status code, total duration, and response body size
4. WHEN the HTTP response is sent, THE TraceMiddleware SHALL include X-Request-ID, X-Process-Time, X-DB-Queries, and X-DB-Time in the response headers
5. THE TraceContext SHALL maintain an ordered list of TraceSpan objects appended during the request lifecycle
### Requirement 2: 多层 Span 采集
**User Story:** 作为后端开发者我希望鉴权、路由、Service、数据库四层的函数调用都自动记录 span以便在调试时看到完整的请求处理链路。
#### Acceptance Criteria
1. WHEN a request passes through the authentication layer, THE Trace_System SHALL record an AUTH span containing token parse result, user_id, site_id, roles, and approval status
2. WHEN a decorated Service function is called, THE Trace_System SHALL record a SERVICE span containing the module name, function name, parameter names and values, return value summary, and duration
3. WHEN a database query is executed, THE Trace_System SHALL record a DB_QUERY span containing the full parameterized SQL statement, bound parameter values, returned row count, execution duration, and calling source function
4. THE TraceSpan SHALL contain the following fields: span_type, module, function, description_zh, description_en, params, result_summary, duration_ms, timestamp, and extra
5. WHEN the AUTH span records a token, THE Trace_System SHALL record only the token prefix, not the complete token value
### Requirement 3: JSON Lines 日志序列化与写入
**User Story:** 作为后端开发者,我希望每个请求的完整追踪数据以 JSON Lines 格式写入日志文件,以便后续通过 API 读取和展示。
#### Acceptance Criteria
1. WHEN a request completes, THE Log_Writer SHALL serialize the TraceContext into a single JSON line containing request_id, timestamp, method, path, status_code, total_duration_ms, user_id, site_id, db_query_count, db_total_ms, error, and spans array
2. WHEN serializing a TraceSpan, THE Log_Writer SHALL include all TraceSpan fields in the JSON output
3. THE Log_Writer SHALL write log entries by appending to the current JSON Lines file asynchronously
4. IF the log file write operation fails, THEN THE Log_Writer SHALL not affect the HTTP request processing or response
5. THE Log_Writer SHALL produce valid JSON on each line such that parsing then re-serializing a log entry produces an equivalent JSON object (round-trip property)
### Requirement 4: 日志文件分割与轮转
**User Story:** 作为运维人员,我希望日志文件按日期和小时自动分割,并在单文件过大时自动轮转,以便管理磁盘空间。
#### Acceptance Criteria
1. THE Log_Writer SHALL organize log files into date-based directories using the format `YYYY-MM-DD/`
2. THE Log_Writer SHALL split log files by hour using the naming format `trace_YYYY-MM-DD_HH.jsonl`
3. WHEN a log file exceeds 10MB, THE Log_Writer SHALL rotate to a new file with an incremented suffix using the format `trace_YYYY-MM-DD_HH_NNN.jsonl`
4. THE Log_Writer SHALL maintain an `_index.json` file recording the file list, record count, and file size for each date directory
5. WHEN a new log entry is written, THE Log_Writer SHALL update the `_index.json` file to reflect the current state
### Requirement 5: 日志自动清理
**User Story:** 作为运维人员,我希望系统自动清理过期日志,以便防止磁盘空间被无限占用。
#### Acceptance Criteria
1. THE Trace_System SHALL execute an automatic cleanup check daily at midnight
2. WHEN the automatic cleanup runs, THE Trace_System SHALL delete all log directories older than the configured retention days (default 7 days)
3. THE Trace_System SHALL read the retention period from the environment variable `DEV_TRACE_LOG_RETENTION_DAYS`
4. WHEN a date directory is deleted during cleanup, THE Trace_System SHALL update the `_index.json` file accordingly
### Requirement 6: 后端日志查询 API
**User Story:** 作为 admin-web 前端,我希望通过 API 查询日志数据,以便在页面上展示请求列表和链路详情。
#### Acceptance Criteria
1. WHEN a GET request is made to `/api/admin/dev-trace/dates`, THE Trace_API SHALL return a list of dates that have log data available
2. WHEN a GET request is made to `/api/admin/dev-trace/requests` with filter parameters (date, start_time, end_time, method, path_contains, status_code, min_duration, page, page_size), THE Trace_API SHALL return a paginated list of matching request summaries
3. WHEN a GET request is made to `/api/admin/dev-trace/request/{id}`, THE Trace_API SHALL return the complete trace record including all spans for the specified request_id
4. WHEN a POST request is made to `/api/admin/dev-trace/cleanup` with a date range, THE Trace_API SHALL delete log files within the specified date range and return the cleanup result
5. THE Trace_API SHALL require admin role authentication for all endpoints
6. IF a non-admin user attempts to access any Trace_API endpoint, THEN THE Trace_API SHALL return a 403 Forbidden response
### Requirement 7: 日志设置管理 API
**User Story:** 作为管理员,我希望通过 API 读取和修改日志系统的设置,以便在运行时动态调整日志行为。
#### Acceptance Criteria
1. WHEN a GET request is made to `/api/admin/dev-trace/settings`, THE Trace_API SHALL return the current settings including enabled status, retention days, SQL logging flag, and parameter logging flag
2. WHEN a PUT request is made to `/api/admin/dev-trace/settings` with updated values, THE Trace_API SHALL update the in-memory runtime configuration without requiring a server restart
3. WHEN the server restarts, THE Trace_System SHALL reset all runtime settings to the values defined in the `.env` file
### Requirement 8: 开关机制
**User Story:** 作为开发者,我希望通过环境变量和运行时开关控制日志采集的启用/禁用,以便在生产环境关闭日志、在开发环境灵活控制。
#### Acceptance Criteria
1. THE Trace_System SHALL read the master switch from the environment variable `DEV_TRACE_ENABLED`
2. WHILE `DEV_TRACE_ENABLED` is set to false, THE TraceMiddleware SHALL skip all trace context creation, span recording, and log file writing
3. WHEN the Runtime_Switch is toggled via the settings API, THE Trace_System SHALL immediately start or stop trace collection without server restart
4. THE Trace_System SHALL read the log directory path from the environment variable `DEV_TRACE_LOG_DIR`
5. WHERE the `DEV_TRACE_LOG_SQL` option is set to false, THE Trace_System SHALL omit full SQL statements from DB_QUERY spans
6. WHERE the `DEV_TRACE_LOG_PARAMS` option is set to false, THE Trace_System SHALL omit function parameter values from SERVICE spans
### Requirement 9: admin-web 开发测试日志页面
**User Story:** 作为管理员,我希望在 admin-web 中查看请求列表和完整的 span 链路树,以便快速定位前后端联调问题。
#### Acceptance Criteria
1. WHEN the admin navigates to the dev trace log page, THE Admin_Web_Trace_Page SHALL display a left-right split layout with request list on the left and span detail on the right
2. WHEN the admin applies filters (date, time range, HTTP method, path keyword, status code, minimum duration), THE Admin_Web_Trace_Page SHALL query the Trace_API and display matching requests
3. WHEN the admin selects a request from the list, THE Admin_Web_Trace_Page SHALL display the complete span chain as a hierarchical tree with indentation showing the call depth
4. THE Admin_Web_Trace_Page SHALL display each request entry with timestamp, HTTP method, path, status code, total duration, and DB query count
5. THE Admin_Web_Trace_Page SHALL display each span with span_type, function name, description (zh), duration, and relevant extra information (SQL for DB_QUERY spans)
### Requirement 10: admin-web 设置面板
**User Story:** 作为管理员,我希望在 admin-web 中管理日志系统的设置,以便控制日志采集行为和清理策略。
#### Acceptance Criteria
1. WHEN the admin opens the settings panel, THE Admin_Web_Trace_Page SHALL display the current log enabled status, retention days, auto-cleanup toggle, and disk usage statistics
2. WHEN the admin toggles the log enabled switch, THE Admin_Web_Trace_Page SHALL call the settings API and reflect the updated state immediately
3. WHEN the admin initiates a manual cleanup with a date range, THE Admin_Web_Trace_Page SHALL call the cleanup API and display the cleanup result (deleted file count and freed space)
4. WHEN the admin modifies the retention days, THE Admin_Web_Trace_Page SHALL call the settings API and confirm the update
### Requirement 11: 实施范围与路由覆盖
**User Story:** 作为开发者,我希望第一期覆盖所有 xcx_* 路由的追踪,以便在小程序联调时获得完整的调试信息。
#### Acceptance Criteria
1. THE TraceMiddleware SHALL intercept all requests matching the `xcx_*` route prefix (login, tasks, notes, performance, AI conversation)
2. WHEN a request matches a non-xcx route, THE TraceMiddleware SHALL skip trace collection for that request
3. THE Trace_System SHALL not interfere with the existing ResponseWrapperMiddleware or logging configuration
### Requirement 12: 异常/错误全链路追踪
**User Story:** 作为开发者,我希望请求处理中发生的任何异常都被完整记录到 trace 中,以便快速定位错误发生在哪一层、什么原因。
#### Acceptance Criteria
1. WHEN an HTTPException is raised during request processing, THE Trace_System SHALL record an ERROR span containing the exception type, status code, detail message, and the layer where it occurred
2. WHEN an unhandled exception occurs, THE Trace_System SHALL record an ERROR span containing the exception type, message, and stack trace summary (first 5 lines)
3. WHEN a database exception (psycopg2.Error) occurs, THE Trace_System SHALL record a DB_ERROR span containing the PostgreSQL error code, message, and the SQL statement that caused it
4. WHEN authentication fails, THE AUTH span SHALL include a failure_reason field categorized as one of: AUTH_EXPIRED (token expired), AUTH_INVALID (signature error), AUTH_MALFORMED (missing fields), AUTH_LIMITED (limited token on full endpoint), AUTH_FORBIDDEN (insufficient role)
5. WHEN an exception occurs, THE TraceMiddleware SHALL still record the HTTP_OUT span with the error status code and ensure the complete trace is written to the log file
### Requirement 13: SSE 流式响应追踪
**User Story:** 作为开发者,我希望 AI 对话的 SSE 流式响应全过程被追踪,以便看到 AI 调用链的每一步prompt 构建、API 调用、token 流、完成/错误)。
#### Acceptance Criteria
1. WHEN an SSE streaming endpoint is called, THE Trace_System SHALL record an SSE_START span containing the endpoint, user info, and chat_id
2. WHEN the AI API (DashScope) is called, THE Trace_System SHALL record an AI_CALL span containing the app_id, prompt length, and session_id
3. DURING SSE token streaming, THE Trace_System SHALL record SSE_EVENT spans at regular intervals (every N tokens) containing the cumulative token count, to avoid span explosion
4. WHEN the SSE stream completes, THE Trace_System SHALL record an SSE_END span containing total token count, total duration, and whether it completed normally
5. WHEN an error occurs during SSE streaming, THE Trace_System SHALL record an AI_ERROR span containing the error type, message, and retry count
6. THE SSE trace SHALL use trace_type="sse" to distinguish from regular HTTP traces
### Requirement 14: WebSocket 连接追踪
**User Story:** 作为开发者,我希望 WebSocket 连接的全生命周期被追踪,以便看到连接建立、消息推送、断开的完整过程。
#### Acceptance Criteria
1. WHEN a WebSocket connection is established, THE Trace_System SHALL record a WS_CONNECT span containing the execution_id and client information
2. DURING WebSocket message pushing, THE Trace_System SHALL record WS_MESSAGE spans at regular intervals (every N messages) containing the cumulative message count and byte count
3. WHEN a WebSocket connection is closed, THE Trace_System SHALL record a WS_DISCONNECT span containing the disconnect reason, total message count, and total duration
4. THE WebSocket trace SHALL use trace_type="ws" and a request_id with "ws_" prefix to distinguish from HTTP traces
### Requirement 15: 后台 Job 执行追踪
**User Story:** 作为开发者我希望后台定时任务task_generator、task_expiry、recall_detector、note_reclassifier的执行过程被追踪以便排查后台任务的问题。
#### Acceptance Criteria
1. WHEN a background job starts execution, THE Trace_System SHALL create a new TraceContext with trace_type="job" and record a JOB_START span containing the job name and trigger time
2. DURING job execution, THE Trace_System SHALL record SERVICE and DB_QUERY spans associated with the job's TraceContext (via contextvars)
3. WHEN a job completes successfully, THE Trace_System SHALL record a JOB_END span containing the duration and processed record count
4. WHEN a job fails with an exception, THE Trace_System SHALL record a JOB_ERROR span containing the exception type, message, and stack trace summary
5. THE job trace SHALL use a request_id with "job_" prefix and be written to the same log files as HTTP traces
### Requirement 16: 数据库连接生命周期追踪
**User Story:** 作为开发者,我希望看到每个数据库连接的获取和释放时间,以便检测连接泄漏或连接获取瓶颈。
#### Acceptance Criteria
1. WHEN a database connection is acquired via get_connection(), THE Trace_System SHALL record a DB_CONN span containing the connection acquisition duration
2. WHEN a database connection is released (closed), THE Trace_System SHALL record a DB_CONN_RELEASE span
3. EACH DB_CONN span SHALL be paired with exactly one DB_CONN_RELEASE span within the same trace
### Requirement 17: 中间件层追踪
**User Story:** 作为开发者,我希望看到中间件链的执行耗时,以便检测中间件层的性能瓶颈。
#### Acceptance Criteria
1. THE Trace_System SHALL record a MIDDLEWARE span for ResponseWrapperMiddleware containing its execution duration
2. IF the ResponseWrapperMiddleware fails to wrap a response (e.g., JSON parse error), THE Trace_System SHALL record a MIDDLEWARE_ERROR span containing the error details
3. THE MIDDLEWARE span SHALL include the response body size for monitoring abnormally large responses
### Requirement 18: Trace 覆盖率扫描与展示
**User Story:** 作为开发者,我希望在 admin-web 的日志页面顶部看到当前 trace 系统对路由、Service、Job 等模块的覆盖率,以便在新增模块后及时发现未接入 trace 的函数。
#### Acceptance Criteria
1. THE Trace_System SHALL provide a coverage scanner that inspects the backend codebase and reports: (a) xcx_* route coverage (which route files are in trace scope), (b) Service function coverage (which public functions in `app/services/` have `@trace_service` decorator), (c) Job handler coverage (which registered jobs are wrapped by `job_wrapper`), (d) SSE/WS endpoint coverage (which endpoints have trace wrappers)
2. THE coverage scanner SHALL report for each category: total count, covered count, uncovered list with module and function names
3. WHEN a GET request is made to `/api/admin/dev-trace/coverage`, THE Trace_API SHALL return the most recent scan result including scan_time, and per-category totals/details
4. WHEN a POST request is made to `/api/admin/dev-trace/coverage/scan`, THE Trace_API SHALL execute a fresh scan immediately and return the updated result
5. THE Trace_System SHALL execute an automatic coverage scan on server startup and periodically at a configurable interval (default 60 minutes)
6. THE Admin_Web_Trace_Page SHALL display a coverage summary bar at the top of the DevTrace page showing per-category coverage percentages and a list of uncovered items
7. THE Admin_Web_Trace_Page SHALL provide a manual "Scan" button that triggers an immediate coverage scan via the API
8. THE coverage scan interval SHALL be configurable via the settings API and settings panel

View File

@@ -0,0 +1,434 @@
# 实施计划开发调试全链路日志系统dev-trace-log
## 概述
按依赖关系分五个阶段实施:基础设施层 → 核心采集层HTTP + 鉴权 + DB + 中间件) → 扩展采集层SSE + WebSocket + 后台 Job + 异常) → 后端 API + 前端页面 → 收尾。每个阶段完成后设置检查点验证。
后端使用 PythonFastAPI + contextvars前端使用 TypeScriptReact + Vite + Ant Design
覆盖范围HTTP 请求全链路 + SSE 流式响应 + WebSocket 连接 + 后台 Job + 异常/错误 + 数据库连接生命周期 + 中间件层。
## 任务
### 阶段一:基础设施层 — Trace 核心模块
- [x] 1. 环境变量与配置模块
- [x] 1.1 在 `.env``.env.template` 中添加 trace 相关环境变量
- `DEV_TRACE_ENABLED=true`(总开关)
- `DEV_TRACE_LOG_DIR=export/dev-trace-logs`(日志目录)
- `DEV_TRACE_LOG_RETENTION_DAYS=7`(自动清理保留天数)
- `DEV_TRACE_LOG_SQL=true`(是否记录完整 SQL
- `DEV_TRACE_LOG_PARAMS=true`(是否记录函数参数值)
- _Requirements: 8.1, 8.4, 8.5, 8.6, 5.3_
- [x] 1.2 创建 `apps/backend/app/trace/__init__.py``apps/backend/app/trace/config.py`
- 定义 `TraceConfig` 类,从环境变量读取所有配置项
- 实现运行时开关(内存状态),支持通过 API 动态修改
- 重启后回退到 .env 配置值
- _Requirements: 8.1, 8.2, 8.3, 8.5, 8.6_
- [x] 2. TraceContext 与 TraceSpan 数据模型
- [x] 2.1 创建 `apps/backend/app/trace/context.py`
- 定义 `TraceSpan` dataclassspan_type, module, function, description_zh, description_en, params, result_summary, duration_ms, timestamp, extra
- span_type 支持全部类型HTTP_IN, AUTH, ROUTE, SERVICE, DB_QUERY, DB_CONN, DB_CONN_RELEASE, HTTP_OUT, ERROR, DB_ERROR, MIDDLEWARE, MIDDLEWARE_ERROR, SSE_START, SSE_EVENT, SSE_END, AI_CALL, AI_STREAM, AI_ERROR, WS_CONNECT, WS_MESSAGE, WS_DISCONNECT, JOB_START, JOB_END, JOB_ERROR
- 定义 `TraceContext` dataclassrequest_id, trace_type[http/sse/ws/job], start_time, method, path, user_id, site_id, spans
- 使用 `contextvars.ContextVar` 存储请求级 TraceContext
- `request_id` 格式HTTP 用 uuid hex[:12]WS 用 `ws_` 前缀Job 用 `job_` 前缀
- 提供 `add_span()` 方法追加 span
- _Requirements: 1.1, 1.5, 2.4, 13.6, 14.4, 15.5_
- [x] 2.2 编写属性测试Request ID 唯一性
- **Property 1: Request ID 唯一性**
- 使用 Hypothesis 生成 N 个 TraceContext 实例,验证所有 request_id 互不相同
- **验证: 需求 1.1**
- [x] 2.3 编写属性测试Span 顺序保持
- **Property 2: Span 顺序保持**
- 生成随机 TraceSpan 序列并按顺序 add_span验证 spans 列表保持插入顺序
- **验证: 需求 1.5**
- [x] 3. JSON Lines 日志写入器
- [x] 3.1 创建 `apps/backend/app/trace/writer.py`
- 实现 `TraceWriter` 类,负责将 TraceContext 序列化为 JSON 并追加写入 `.jsonl` 文件
- 按日期分目录(`YYYY-MM-DD/`),按小时分文件(`trace_YYYY-MM-DD_HH.jsonl`
- 单文件超过 10MB 自动轮转(`trace_YYYY-MM-DD_HH_NNN.jsonl`
- 维护 `_index.json` 索引文件(文件列表、记录数、文件大小)
- 写入操作异步化,写入失败不影响请求处理
- 序列化输出包含完整字段request_id, trace_type, timestamp, method, path, status_code, total_duration_ms, user_id, site_id, db_query_count, db_total_ms, error, spans
- _Requirements: 3.1, 3.2, 3.3, 3.4, 4.1, 4.2, 4.3, 4.4, 4.5_
- [x] 3.2 编写属性测试TraceSpan 结构完整性
- **Property 3: TraceSpan 结构完整性**
- 使用 Hypothesis 生成任意 span_type 的 TraceSpan验证序列化 JSON 包含所有必需字段
- 生成任意 TraceContext验证顶层 JSON 包含所有必需字段
- **验证: 需求 2.4, 3.1, 3.2**
- [x] 3.3 编写属性测试JSON 序列化往返一致性
- **Property 5: JSON 序列化往返一致性**
- 生成任意 TraceContext序列化为 JSON line → 解析回 dict → 再序列化,验证两次输出等价
- **验证: 需求 3.5**
- [x] 3.4 编写属性测试:日志文件路径生成
- **Property 6: 日志文件路径生成**
- 生成任意 datetime验证目录名匹配 `YYYY-MM-DD/`,文件名匹配 `trace_YYYY-MM-DD_HH.jsonl`
- **验证: 需求 4.1, 4.2**
- [x] 4. 日志自动清理模块
- [x] 4.1 创建 `apps/backend/app/trace/cleanup.py`
- 实现每日凌晨自动清理检查(可在 lifespan 中注册定时任务)
- 删除超过配置保留天数的日期目录
- 清理后更新 `_index.json`
- _Requirements: 5.1, 5.2, 5.3, 5.4_
- [x] 4.2 编写属性测试:清理保留期正确性
- **Property 8: 清理保留期正确性**
- 生成随机日期目录集合和保留天数 N验证清理后超期目录已删除、保留期内目录保留、索引不引用已删除目录
- **验证: 需求 5.2, 5.4**
- [x] 5. 检查点 — 基础设施层验证
- 运行 `cd C:\NeoZQYY && pytest tests/ -v` 确保属性测试通过
- 确保 TraceContext、TraceWriter、TraceConfig、cleanup 模块可独立工作
- ask the user if questions arise.
### 阶段二:核心采集层 — HTTP + 鉴权 + DB + 中间件
- [x] 6. TraceMiddlewareASGI 中间件)
- [x] 6.1 创建 `apps/backend/app/trace/middleware.py`
- 实现 ASGI 中间件,拦截 xcx_* 路由前缀的请求
- 非 xcx 路由直接跳过,不创建 TraceContext
- 检查 DEV_TRACE_ENABLED 开关,关闭时跳过所有采集
- 请求进入时:创建 TraceContext 存入 contextvars记录 HTTP_IN span
- 请求结束时:记录 HTTP_OUT spanstatus_code, duration, body_size
- 响应头写入 X-Request-ID, X-Process-Time, X-DB-Queries, X-DB-Time
- 记录 MIDDLEWARE spanResponseWrapperMiddleware 执行耗时)
- 如果响应包装失败,记录 MIDDLEWARE_ERROR span
- 调用 TraceWriter 写入完整 trace
- _Requirements: 1.1, 1.2, 1.3, 1.4, 8.2, 8.3, 11.1, 11.2, 17.1, 17.2, 17.3_
- [x] 6.2 在 `apps/backend/app/main.py` 中注册 TraceMiddleware
- 在 ResponseWrapperMiddleware 外层添加
- 不影响现有 ResponseWrapperMiddleware 和 logging 配置
- _Requirements: 11.3_
- [x] 6.3 编写属性测试:路由前缀过滤
- **Property 15: 路由前缀过滤**
- 生成随机请求路径,验证仅 xcx_* 前缀路径产生 trace 数据
- **验证: 需求 11.1, 11.2**
- [x] 6.4 编写属性测试:开关关闭时无 Trace 产出
- **Property 13: 开关关闭时无 Trace 产出**
- DEV_TRACE_ENABLED=false 时验证无 TraceContext 创建、无 span 记录、无日志写入
- **验证: 需求 8.2, 8.3**
- [x] 7. 装饰器与鉴权层追踪
- [x] 7.1 创建 `apps/backend/app/trace/decorators.py`
- 实现 `trace_service(description_zh, description_en)` 装饰器
- 记录 SERVICE span模块名、函数名、参数名+值、返回值摘要、耗时
- 当 DEV_TRACE_LOG_PARAMS=false 时省略参数值
- _Requirements: 2.2, 8.6_
- [x] 7.2 修改 `apps/backend/app/auth/dependencies.py`
- 在 get_current_user() / get_current_user_or_limited() 中添加 AUTH span
- 记录token 前缀(非完整 token、user_id、site_id、roles、审批状态
- 鉴权失败时记录详细原因分类AUTH_EXPIRED / AUTH_INVALID / AUTH_MALFORMED / AUTH_LIMITED / AUTH_FORBIDDEN
- 将 user_id 和 site_id 写入 TraceContext
- _Requirements: 2.1, 2.5, 12.4_
- [x] 7.3 编写属性测试Token 前缀截断
- **Property 4: Token 前缀截断**
- 生成任意长度 JWT token验证 AUTH span 仅记录前缀
- **验证: 需求 2.5**
- [x] 7.4 编写属性测试:功能开关控制 Span 内容
- **Property 14: 功能开关控制 Span 内容**
- DEV_TRACE_LOG_SQL=false 时 DB_QUERY span 不含完整 SQL
- DEV_TRACE_LOG_PARAMS=false 时 SERVICE span 不含参数值
- **验证: 需求 8.5, 8.6**
- [x] 7.5 编写属性测试:鉴权失败原因分类
- **Property 20: 鉴权失败原因分类**
- 模拟各类鉴权失败场景,验证 AUTH span 的 failure_reason 正确分类
- **验证: 需求 12.4**
- [x] 8. 数据库连接包装与生命周期追踪
- [x] 8.1 创建 `apps/backend/app/trace/db_wrapper.py`
- 包装 cursor.execute(),记录 DB_QUERY spanSQL、参数、行数、耗时、调用来源
- 包装 get_connection(),记录 DB_CONN span连接获取耗时
- 包装连接关闭,记录 DB_CONN_RELEASE span
- 数据库异常时记录 DB_ERROR spanPostgreSQL 错误码、消息、触发 SQL
- 当 DEV_TRACE_LOG_SQL=false 时省略完整 SQL
- _Requirements: 2.3, 8.5, 12.3, 16.1, 16.2, 16.3_
- [x] 8.2 修改 `apps/backend/app/database.py`
- 在 get_connection() 中集成 trace db_wrapper仅 trace 启用时包装)
- 不影响现有数据库连接逻辑
- _Requirements: 2.3, 16.1_
- [x] 8.3 编写属性测试:数据库连接生命周期配对
- **Property 21: 数据库连接生命周期配对**
- 验证每个 DB_CONN span 都有对应的 DB_CONN_RELEASE span
- **验证: 需求 16.3**
- [x] 9. 异常/错误全链路追踪
- [x] 9.1 创建 `apps/backend/app/trace/error_handler.py`
- 集成到全局异常处理器http_exception_handler / unhandled_exception_handler
- HTTPException → ERROR span异常类型、status_code、detail、发生层级
- 未捕获异常 → ERROR span异常类型、消息、堆栈摘要前 5 行)
- 确保异常时 HTTP_OUT span 仍正确记录错误状态码
- 确保异常时 trace 仍能完整写入日志文件
- _Requirements: 12.1, 12.2, 12.5_
- [x] 9.2 修改 `apps/backend/app/middleware/response_wrapper.py`
- 在异常处理器中调用 trace error_handler 记录 ERROR span
- 不影响现有异常处理逻辑和响应格式
- _Requirements: 12.1, 12.2_
- [x] 9.3 编写属性测试:异常时 Trace 完整性
- **Property 16: 异常时 Trace 完整性**
- 模拟各类异常HTTPException、未捕获异常、数据库异常验证 trace 包含 ERROR/DB_ERROR span 且 HTTP_OUT span 仍存在
- **验证: 需求 12.1, 12.2, 12.3**
- [x] 10. xcx_* 路由与 Service 层装饰器集成
- [x] 10.1 为 xcx_* 路由处理函数添加 `@trace_service` 装饰器
- 覆盖xcx_auth、xcx_tasks、xcx_notes、xcx_performance、xcx_chat、xcx_customers、xcx_coaches、xcx_board、xcx_config、xcx_ai_cache
- 每个路由函数添加中英文描述
- _Requirements: 11.1_
- [x] 10.2 为关键 Service 层函数添加 `@trace_service` 装饰器
- 优先覆盖联调涉及的 servicetask_manager、note_service、performance_service、coach_service、customer_service、board_service、chat_service
- _Requirements: 2.2_
- [x] 11. 检查点 — 核心采集层验证
- 运行 `cd C:\NeoZQYY && pytest tests/ -v` 确保属性测试通过
- 确保 xcx_* 请求能产生完整 span 链路HTTP_IN → AUTH → SERVICE → DB_QUERY → DB_CONN → HTTP_OUT
- 确保异常请求能产生 ERROR span
- 确保中间件层有 MIDDLEWARE span
- ask the user if questions arise.
### 阶段三:扩展采集层 — SSE + WebSocket + 后台 Job
- [x] 12. SSE 流式响应追踪
- [x] 12.1 创建 `apps/backend/app/trace/sse_wrapper.py`
- 包装 SSE event_generator追踪流式响应全过程
- SSE_START span流开始端点、用户、chat_id
- AI_CALL spanDashScope API 调用app_id、prompt 长度、session_id
- SSE_EVENT span每 10 个 token 记录一次(避免 span 爆炸),含累计 token 数
- SSE_END span流结束总 token 数、总耗时、是否正常完成)
- AI_ERROR spanAI 调用失败(错误类型、消息)
- trace_type 设为 "sse"
- _Requirements: 13.1, 13.2, 13.3, 13.4, 13.5, 13.6_
- [x] 12.2 修改 `apps/backend/app/routers/xcx_chat.py`
- 在 chat_stream() 端点中集成 SSE trace wrapper
- 在 event_generator() 内部注入 trace context
- 不影响现有 SSE 事件格式和错误处理逻辑
- _Requirements: 13.1, 13.2_
- [x] 12.3 编写属性测试SSE 流式 Trace 完整性
- **Property 17: SSE 流式 Trace 完整性**
- 验证 SSE trace 包含 SSE_START 和 SSE_END spanSSE_END 的 total_tokens 等于 SSE_EVENT 累计
- **验证: 需求 13.1, 13.2, 13.3**
- [x] 13. WebSocket 连接追踪
- [x] 13.1 创建 `apps/backend/app/trace/ws_wrapper.py`
- 包装 WebSocket 端点,追踪连接全生命周期
- WS_CONNECT span连接建立execution_id、客户端信息
- WS_MESSAGE span每 N 条消息记录一次(累计消息数、字节数)
- WS_DISCONNECT span连接断开原因、总消息数、总耗时
- trace_type 设为 "ws"request_id 用 `ws_` 前缀
- _Requirements: 14.1, 14.2, 14.3, 14.4_
- [x] 13.2 修改 `apps/backend/app/ws/logs.py`
- 在 ws_logs() 端点中集成 WS trace wrapper
- 不影响现有 WebSocket 逻辑和 TaskExecutor 订阅机制
- _Requirements: 14.1_
- [x] 13.3 编写属性测试WebSocket Trace 生命周期
- **Property 18: WebSocket Trace 生命周期**
- 验证 WS trace 包含 WS_CONNECT 和 WS_DISCONNECT span消息数一致
- **验证: 需求 14.1, 14.2**
- [x] 14. 后台 Job 执行追踪
- [x] 14.1 创建 `apps/backend/app/trace/job_wrapper.py`
- 包装 job handler 函数,追踪后台任务执行
- JOB_START span任务开始job_name、触发时间
- 内部 SERVICE / DB_QUERY span 自动关联到 job trace通过 contextvars
- JOB_END span任务正常结束耗时、处理记录数
- JOB_ERROR span任务异常异常类型、堆栈摘要
- trace_type 设为 "job"request_id 用 `job_` 前缀
- _Requirements: 15.1, 15.2, 15.3, 15.4, 15.5_
- [x] 14.2 修改 `apps/backend/app/main.py` lifespan 中的 job 注册
- 用 trace job_wrapper 包装 4 个 job handlertask_generator、task_expiry_check、recall_completion_check、note_reclassify_backfill
- 不影响现有 job 逻辑
- _Requirements: 15.1_
- [x] 14.3 编写属性测试:后台 Job Trace 完整性
- **Property 19: 后台 Job Trace 完整性**
- 验证 job trace 包含 JOB_START 和 JOB_END/JOB_ERROR span内部 span 关联同一 request_id
- **验证: 需求 15.1, 15.2**
- [x] 15. 检查点 — 扩展采集层验证
- 运行 `cd C:\NeoZQYY && pytest tests/ -v` 确保属性测试通过
- 确保 SSE 流式端点产生完整 traceSSE_START → AI_CALL → SSE_EVENT → SSE_END
- 确保 WebSocket 连接产生完整 traceWS_CONNECT → WS_MESSAGE → WS_DISCONNECT
- 确保后台 Job 产生完整 traceJOB_START → SERVICE/DB_QUERY → JOB_END
- ask the user if questions arise.
### 阶段四:后端 API + 前端页面
- [x] 16. 后端 API 路由
- [x] 16.1 创建 `apps/backend/app/routers/admin_dev_trace.py`
- `GET /api/admin/dev-trace/dates` — 返回有日志数据的日期列表
- `GET /api/admin/dev-trace/requests` — 按条件分页查询请求列表date, start_time, end_time, trace_type, method, path_contains, status_code, min_duration, has_error, span_type, page, page_size
- `GET /api/admin/dev-trace/request/{id}` — 返回指定 request_id 的完整 trace 记录(含所有 spans
- `POST /api/admin/dev-trace/cleanup` — 按日期范围手动清理日志
- `GET /api/admin/dev-trace/settings` — 返回当前设置
- `PUT /api/admin/dev-trace/settings` — 更新运行时设置(不需重启)
- `GET /api/admin/dev-trace/coverage` — 返回最近一次覆盖率扫描结果
- `POST /api/admin/dev-trace/coverage/scan` — 手动触发覆盖率扫描
- 所有端点要求 admin 角色鉴权,非 admin 返回 403
- _Requirements: 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 7.1, 7.2, 7.3, 18.3, 18.4_
- [x] 16.2 创建 `apps/backend/app/trace/coverage.py`
- 实现覆盖率扫描器,通过 AST 解析 + inspect 检测以下维度:
- 路由覆盖:扫描 `app/routers/xcx_*.py` 中的路由函数,判断是否在 TraceMiddleware 路由前缀范围内
- Service 覆盖:扫描 `app/services/` 下所有公开函数(非 `_` 开头),检查是否有 `@trace_service` 装饰器
- Job 覆盖:扫描 lifespan 中注册的 job handler检查是否被 `job_wrapper` 包装
- SSE/WS 覆盖:扫描 SSE/WS 端点,检查是否集成了对应 wrapper
- 输出结构:每个维度的 total、covered、uncovered 列表(含模块名和函数名)
- 扫描结果缓存在内存中,支持手动刷新
- 服务启动时自动扫描一次,之后按配置间隔定期扫描(默认 60 分钟)
- _Requirements: 18.1, 18.2, 18.5_
- [x] 16.3 在 `apps/backend/app/main.py` 中注册 admin_dev_trace router
- _Requirements: 6.1_
- [ ]* 16.4 编写属性测试API 筛选正确性
- **Property 9: API 筛选正确性**
- 生成随机 trace 记录集合和筛选参数组合,验证返回结果满足所有筛选条件且无遗漏
- **验证: 需求 6.2**
- [ ]* 16.5 编写属性测试Trace 写入-读取往返一致性
- **Property 10: Trace 写入-读取往返一致性**
- 写入 TraceContext 到日志文件,通过 API 按 request_id 查询,验证返回记录与原始数据等价
- **验证: 需求 6.3**
- [ ]* 16.6 编写属性测试Admin 权限强制
- **Property 11: Admin 权限强制**
- 模拟非 admin 用户访问所有 Trace_API 端点,验证均返回 403
- **验证: 需求 6.5, 6.6**
- [ ]* 16.7 编写属性测试:设置更新往返一致性
- **Property 12: 设置更新往返一致性**
- 生成随机有效设置值PUT 更新后 GET 读取,验证返回值与更新值一致
- **验证: 需求 7.2**
- [ ]* 16.8 编写属性测试:索引文件一致性
- **Property 7: 索引文件一致性**
- 执行一系列日志写入操作后,验证 `_index.json` 中每个引用文件存在、每个存在的日志文件被引用、记录数和文件大小匹配
- **验证: 需求 4.4, 4.5**
- [ ]* 16.9 编写属性测试:覆盖率扫描一致性
- **Property 22: 覆盖率扫描一致性**
- 验证扫描结果中每个维度的 total == covered + len(uncovered)
- 验证 uncovered 列表中的每个函数确实没有对应装饰器
- **验证: 需求 18.1, 18.2**
- [x] 17. 检查点 — 后端 API 验证
- 运行 `cd C:\NeoZQYY && pytest tests/ -v` 确保属性测试通过
- 运行 `cd apps/backend && pytest tests/ -v` 确保后端测试通过
- 确保 6 个 API 端点均可正常响应,权限校验生效
- ask the user if questions arise.
- [x] 18. 前端 API 层与类型定义
- [x] 18.1 创建 `apps/admin-web/src/api/devTrace.ts`
- 封装所有 dev-trace API 调用dates、requests、request/{id}、cleanup、settings、coverage、coverage/scan
- _Requirements: 9.1, 10.1, 18.3, 18.4_
- [x] 18.2 创建 `apps/admin-web/src/types/devTrace.ts`
- 定义 TypeScript 类型TraceRequest, TraceSpan, TraceDetail, TraceSettings, TraceFilter, TraceCoverage, CoverageCategory
- 包含所有 span_type 和 trace_type 枚举
- _Requirements: 9.4, 9.5, 18.6_
- [x] 19. 开发测试日志页面
- [x] 19.1 创建 `apps/admin-web/src/pages/DevTrace.tsx`
- 左右分栏布局Ant Design Layout
- 页面顶部覆盖率状态栏Alert 组件),展示路由/Service/Job/SSE/WS 各维度覆盖率百分比,未覆盖项列表,手动「扫描」按钮
- 左侧请求列表Table展示 timestamp、trace_type、method、path、status_code、duration、db_query_count
- 右侧:选中请求的完整 span 链路树层级缩进span_type 颜色编码)
- 顶部筛选栏日期、时间范围、trace_type、方法、路径关键词、状态码、最小耗时、has_error、span_type
- DB_QUERY span 展示 SQL 语句ERROR span 红色高亮
- _Requirements: 9.1, 9.2, 9.3, 9.4, 9.5, 18.6, 18.7_
- [x] 19.2 在 `apps/admin-web/src/App.tsx` 中注册 DevTrace 页面路由和侧边栏菜单项
- _Requirements: 9.1_
- [x] 20. 设置面板
- [x] 20.1 在 DevTrace 页面中实现设置面板Drawer 或 Modal
- 日志开关Switch调用 settings API 即时生效)
- 保留天数配置InputNumber + 保存按钮)
- SQL 记录开关、参数记录开关
- 手动清理DateRangePicker + 清理按钮,展示清理结果)
- 磁盘占用统计展示
- 覆盖率扫描间隔配置InputNumber单位分钟默认 60
- _Requirements: 10.1, 10.2, 10.3, 10.4, 18.8_
- [x] 21. 检查点 — 前端页面验证
- 确保所有前端组件渲染正常API 调用层工作正确
- 确保筛选、列表、span 树、设置面板交互流畅
- ask the user if questions arise.
### 阶段五:收尾
- [x] 22. 前后端联调与集成验证
- [x] 22.1 启动后端服务,使用测试库验证各端点完整请求-响应链路
- 发送 xcx_* 请求,验证 trace 日志文件正确生成
- 验证 JSON 响应结构与 Schema 定义一致camelCase 序列化)
- 验证 admin 权限校验在真实请求中生效
- 验证响应头包含 X-Request-ID, X-Process-Time, X-DB-Queries, X-DB-Time
- _Requirements: 1.4, 6.5_
- [x] 22.2 前端联调验证
- 确认 DevTrace 页面能正确调用 API 并渲染请求列表和 span 链路树
- 验证按 trace_type 筛选HTTP/SSE/WS/Job工作正常
- 验证设置面板开关即时生效
- 验证空数据/降级场景下前端不崩溃
- _Requirements: 9.2, 9.3, 10.2_
- [x] 23. 文档同步更新
- [x] 23.1 更新后端 API 参考文档
-`apps/backend/docs/API-REFERENCE.md` 新增 admin_dev_trace 路由模块文档6 个端点)
- 更新 `apps/backend/README.md` 路由模块摘要(新增 trace 模块说明)
- _Requirements: 6.1_
- [x] 23.2 更新架构文档
-`docs/architecture/backend-architecture.md` 新增 trace 模块架构说明
- [x] 23.3 更新文档地图
-`docs/DOCUMENTATION-MAP.md` 新增本次模块条目
- _规范: doc-map.md_
- [x] 23.4 更新部署文档
-`docs/deployment/EXPORT-PATHS.md` 新增 `DEV_TRACE_LOG_DIR` 路径说明
- _规范: export-paths.md_
- [x] 24. 最终检查点 — 全量验证
- 运行 Monorepo 属性测试:`cd C:\NeoZQYY && pytest tests/ -v`
- 运行后端单元测试:`cd apps/backend && pytest tests/ -v`
- 确保所有属性测试Property 1-22和单元测试全部通过
- 确保 API 文档、后端 README、架构文档、文档地图、部署文档均已更新
- 确保前端页面连接真实后端运行正常
- 确保全链路覆盖HTTP + SSE + WS + Job + 异常 + DB 连接 + 中间件
- 确保覆盖率扫描正确识别已覆盖和未覆盖的模块/函数
- ask the user if questions arise.
- [x] 25. 服务清理
- [x] 25.1 关闭浏览器、停止后端和前端服务、清理资源
- 停止 uvicorn 后端进程controlPwshProcess stop
- 停止前端开发服务器controlPwshProcess stop
## 备注
- 标记 `*` 的子任务为可选(属性测试),可跳过以加速 MVP
- 每个任务引用了具体的需求编号以确保可追溯性
- 属性测试验证通用正确性属性Property 1-22单元测试验证具体边界条件
- 检查点任务确保增量验证,避免问题累积
- 全链路覆盖范围HTTP 请求 + SSE 流式 + WebSocket + 后台 Job + 异常/错误 + DB 连接生命周期 + 中间件层
- 本 spec 为全栈类,收尾遵循 `spec-closing-checklist.md`