Files
Neo-ZQYY/docs/specs/dev-trace-log/requirements.md
Neo 70324d8542 chore: 文档与 IDE 配置整理
- .kiro/specs/ → docs/specs/(41 个历史需求 spec 迁移,移除 .config.kiro)
- CLAUDE.md 三层拆分:根文件精简 + apps/backend/CLAUDE.md + .claude/commands/
- 新增 /spec-close、/pre-change 两个工作流命令
- DDL 基线刷新(从测试库重新导出 11 个文件,dws 35→38 表,biz 18→21 表)
- BD_Manual → BD_manual 命名统一(48 个文件)
- 修复 3 处文档与数据库不一致(auth.users.status 默认值、scheduled_tasks 字段、RLS 视图数)
- 新增 BD_manual_public_rbac_tables.md(public schema 8 张 RBAC/工作流表)
- 合并 biz.trigger_jobs 文档(10→12 字段,归档独立文档)
- docs/database/README.md 索引更新

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 00:02:37 +08:00

18 KiB
Raw Permalink Blame History

Requirements Document

Introduction

开发调试全链路日志系统dev-trace-log为小程序前后端联调提供全链路请求追踪能力。后端采集从 HTTP 请求进入到数据库查询的每一层细粒度日志span写入 JSON Lines 日志文件。admin-web 提供「开发测试日志」板块,支持多维度筛选和查看完整请求链路。仅在开发/测试环境启用,生产环境通过开关关闭。

Glossary

  • Trace_System: 全链路日志采集系统,包含中间件、装饰器、数据库包装器等组件
  • TraceMiddleware: ASGI 中间件,负责为每个请求创建追踪上下文、记录 HTTP_IN/HTTP_OUT span、写入日志文件
  • TraceContext: 基于 contextvars 的请求级追踪上下文,存储 request_id、span 列表等信息
  • TraceSpan: 单个追踪节点,记录某一层的函数调用信息(类型、模块、函数名、参数、耗时等)
  • Span_Type: span 类型枚举,包括 HTTP_IN、AUTH、ROUTE、SERVICE、DB_QUERY、HTTP_OUT
  • JSON_Lines_File: 以 .jsonl 为扩展名的日志文件,每行一条完整的 JSON 格式请求追踪记录
  • Log_Writer: 日志文件写入组件,负责按日期/小时分割文件、轮转和索引维护
  • Trace_API: admin-web 后端 API 端点集合,提供日志查询、清理、设置等功能
  • Admin_Web_Trace_Page: admin-web 中的「开发测试日志」板块,提供请求列表和 span 链路树展示
  • Runtime_Switch: 运行时动态开关,通过 API 修改内存状态控制日志采集的启用/禁用

Requirements

Requirement 1: 请求级追踪上下文管理

User Story: 作为后端开发者,我希望每个 HTTP 请求自动生成唯一的追踪上下文,以便将请求全链路的所有 span 关联到同一个 request_id。

Acceptance Criteria

  1. WHEN an HTTP request enters the FastAPI application, THE TraceMiddleware SHALL create a new TraceContext with a unique request_id and store it in contextvars
  2. WHEN the TraceContext is created, THE TraceMiddleware SHALL record an HTTP_IN span containing the request method, path, query parameters, and body preview
  3. WHEN the HTTP response is sent, THE TraceMiddleware SHALL record an HTTP_OUT span containing the status code, total duration, and response body size
  4. WHEN the HTTP response is sent, THE TraceMiddleware SHALL include X-Request-ID, X-Process-Time, X-DB-Queries, and X-DB-Time in the response headers
  5. THE TraceContext SHALL maintain an ordered list of TraceSpan objects appended during the request lifecycle

Requirement 2: 多层 Span 采集

User Story: 作为后端开发者我希望鉴权、路由、Service、数据库四层的函数调用都自动记录 span以便在调试时看到完整的请求处理链路。

Acceptance Criteria

  1. WHEN a request passes through the authentication layer, THE Trace_System SHALL record an AUTH span containing token parse result, user_id, site_id, roles, and approval status
  2. WHEN a decorated Service function is called, THE Trace_System SHALL record a SERVICE span containing the module name, function name, parameter names and values, return value summary, and duration
  3. WHEN a database query is executed, THE Trace_System SHALL record a DB_QUERY span containing the full parameterized SQL statement, bound parameter values, returned row count, execution duration, and calling source function
  4. THE TraceSpan SHALL contain the following fields: span_type, module, function, description_zh, description_en, params, result_summary, duration_ms, timestamp, and extra
  5. WHEN the AUTH span records a token, THE Trace_System SHALL record only the token prefix, not the complete token value

Requirement 3: JSON Lines 日志序列化与写入

User Story: 作为后端开发者,我希望每个请求的完整追踪数据以 JSON Lines 格式写入日志文件,以便后续通过 API 读取和展示。

Acceptance Criteria

  1. WHEN a request completes, THE Log_Writer SHALL serialize the TraceContext into a single JSON line containing request_id, timestamp, method, path, status_code, total_duration_ms, user_id, site_id, db_query_count, db_total_ms, error, and spans array
  2. WHEN serializing a TraceSpan, THE Log_Writer SHALL include all TraceSpan fields in the JSON output
  3. THE Log_Writer SHALL write log entries by appending to the current JSON Lines file asynchronously
  4. IF the log file write operation fails, THEN THE Log_Writer SHALL not affect the HTTP request processing or response
  5. THE Log_Writer SHALL produce valid JSON on each line such that parsing then re-serializing a log entry produces an equivalent JSON object (round-trip property)

Requirement 4: 日志文件分割与轮转

User Story: 作为运维人员,我希望日志文件按日期和小时自动分割,并在单文件过大时自动轮转,以便管理磁盘空间。

Acceptance Criteria

  1. THE Log_Writer SHALL organize log files into date-based directories using the format YYYY-MM-DD/
  2. THE Log_Writer SHALL split log files by hour using the naming format trace_YYYY-MM-DD_HH.jsonl
  3. WHEN a log file exceeds 10MB, THE Log_Writer SHALL rotate to a new file with an incremented suffix using the format trace_YYYY-MM-DD_HH_NNN.jsonl
  4. THE Log_Writer SHALL maintain an _index.json file recording the file list, record count, and file size for each date directory
  5. WHEN a new log entry is written, THE Log_Writer SHALL update the _index.json file to reflect the current state

Requirement 5: 日志自动清理

User Story: 作为运维人员,我希望系统自动清理过期日志,以便防止磁盘空间被无限占用。

Acceptance Criteria

  1. THE Trace_System SHALL execute an automatic cleanup check daily at midnight
  2. WHEN the automatic cleanup runs, THE Trace_System SHALL delete all log directories older than the configured retention days (default 7 days)
  3. THE Trace_System SHALL read the retention period from the environment variable DEV_TRACE_LOG_RETENTION_DAYS
  4. WHEN a date directory is deleted during cleanup, THE Trace_System SHALL update the _index.json file accordingly

Requirement 6: 后端日志查询 API

User Story: 作为 admin-web 前端,我希望通过 API 查询日志数据,以便在页面上展示请求列表和链路详情。

Acceptance Criteria

  1. WHEN a GET request is made to /api/admin/dev-trace/dates, THE Trace_API SHALL return a list of dates that have log data available
  2. WHEN a GET request is made to /api/admin/dev-trace/requests with filter parameters (date, start_time, end_time, method, path_contains, status_code, min_duration, page, page_size), THE Trace_API SHALL return a paginated list of matching request summaries
  3. WHEN a GET request is made to /api/admin/dev-trace/request/{id}, THE Trace_API SHALL return the complete trace record including all spans for the specified request_id
  4. WHEN a POST request is made to /api/admin/dev-trace/cleanup with a date range, THE Trace_API SHALL delete log files within the specified date range and return the cleanup result
  5. THE Trace_API SHALL require admin role authentication for all endpoints
  6. IF a non-admin user attempts to access any Trace_API endpoint, THEN THE Trace_API SHALL return a 403 Forbidden response

Requirement 7: 日志设置管理 API

User Story: 作为管理员,我希望通过 API 读取和修改日志系统的设置,以便在运行时动态调整日志行为。

Acceptance Criteria

  1. WHEN a GET request is made to /api/admin/dev-trace/settings, THE Trace_API SHALL return the current settings including enabled status, retention days, SQL logging flag, and parameter logging flag
  2. WHEN a PUT request is made to /api/admin/dev-trace/settings with updated values, THE Trace_API SHALL update the in-memory runtime configuration without requiring a server restart
  3. WHEN the server restarts, THE Trace_System SHALL reset all runtime settings to the values defined in the .env file

Requirement 8: 开关机制

User Story: 作为开发者,我希望通过环境变量和运行时开关控制日志采集的启用/禁用,以便在生产环境关闭日志、在开发环境灵活控制。

Acceptance Criteria

  1. THE Trace_System SHALL read the master switch from the environment variable DEV_TRACE_ENABLED
  2. WHILE DEV_TRACE_ENABLED is set to false, THE TraceMiddleware SHALL skip all trace context creation, span recording, and log file writing
  3. WHEN the Runtime_Switch is toggled via the settings API, THE Trace_System SHALL immediately start or stop trace collection without server restart
  4. THE Trace_System SHALL read the log directory path from the environment variable DEV_TRACE_LOG_DIR
  5. WHERE the DEV_TRACE_LOG_SQL option is set to false, THE Trace_System SHALL omit full SQL statements from DB_QUERY spans
  6. WHERE the DEV_TRACE_LOG_PARAMS option is set to false, THE Trace_System SHALL omit function parameter values from SERVICE spans

Requirement 9: admin-web 开发测试日志页面

User Story: 作为管理员,我希望在 admin-web 中查看请求列表和完整的 span 链路树,以便快速定位前后端联调问题。

Acceptance Criteria

  1. WHEN the admin navigates to the dev trace log page, THE Admin_Web_Trace_Page SHALL display a left-right split layout with request list on the left and span detail on the right
  2. WHEN the admin applies filters (date, time range, HTTP method, path keyword, status code, minimum duration), THE Admin_Web_Trace_Page SHALL query the Trace_API and display matching requests
  3. WHEN the admin selects a request from the list, THE Admin_Web_Trace_Page SHALL display the complete span chain as a hierarchical tree with indentation showing the call depth
  4. THE Admin_Web_Trace_Page SHALL display each request entry with timestamp, HTTP method, path, status code, total duration, and DB query count
  5. THE Admin_Web_Trace_Page SHALL display each span with span_type, function name, description (zh), duration, and relevant extra information (SQL for DB_QUERY spans)

Requirement 10: admin-web 设置面板

User Story: 作为管理员,我希望在 admin-web 中管理日志系统的设置,以便控制日志采集行为和清理策略。

Acceptance Criteria

  1. WHEN the admin opens the settings panel, THE Admin_Web_Trace_Page SHALL display the current log enabled status, retention days, auto-cleanup toggle, and disk usage statistics
  2. WHEN the admin toggles the log enabled switch, THE Admin_Web_Trace_Page SHALL call the settings API and reflect the updated state immediately
  3. WHEN the admin initiates a manual cleanup with a date range, THE Admin_Web_Trace_Page SHALL call the cleanup API and display the cleanup result (deleted file count and freed space)
  4. WHEN the admin modifies the retention days, THE Admin_Web_Trace_Page SHALL call the settings API and confirm the update

Requirement 11: 实施范围与路由覆盖

User Story: 作为开发者,我希望第一期覆盖所有 xcx_* 路由的追踪,以便在小程序联调时获得完整的调试信息。

Acceptance Criteria

  1. THE TraceMiddleware SHALL intercept all requests matching the xcx_* route prefix (login, tasks, notes, performance, AI conversation)
  2. WHEN a request matches a non-xcx route, THE TraceMiddleware SHALL skip trace collection for that request
  3. THE Trace_System SHALL not interfere with the existing ResponseWrapperMiddleware or logging configuration

Requirement 12: 异常/错误全链路追踪

User Story: 作为开发者,我希望请求处理中发生的任何异常都被完整记录到 trace 中,以便快速定位错误发生在哪一层、什么原因。

Acceptance Criteria

  1. WHEN an HTTPException is raised during request processing, THE Trace_System SHALL record an ERROR span containing the exception type, status code, detail message, and the layer where it occurred
  2. WHEN an unhandled exception occurs, THE Trace_System SHALL record an ERROR span containing the exception type, message, and stack trace summary (first 5 lines)
  3. WHEN a database exception (psycopg2.Error) occurs, THE Trace_System SHALL record a DB_ERROR span containing the PostgreSQL error code, message, and the SQL statement that caused it
  4. WHEN authentication fails, THE AUTH span SHALL include a failure_reason field categorized as one of: AUTH_EXPIRED (token expired), AUTH_INVALID (signature error), AUTH_MALFORMED (missing fields), AUTH_LIMITED (limited token on full endpoint), AUTH_FORBIDDEN (insufficient role)
  5. WHEN an exception occurs, THE TraceMiddleware SHALL still record the HTTP_OUT span with the error status code and ensure the complete trace is written to the log file

Requirement 13: SSE 流式响应追踪

User Story: 作为开发者,我希望 AI 对话的 SSE 流式响应全过程被追踪,以便看到 AI 调用链的每一步prompt 构建、API 调用、token 流、完成/错误)。

Acceptance Criteria

  1. WHEN an SSE streaming endpoint is called, THE Trace_System SHALL record an SSE_START span containing the endpoint, user info, and chat_id
  2. WHEN the AI API (DashScope) is called, THE Trace_System SHALL record an AI_CALL span containing the app_id, prompt length, and session_id
  3. DURING SSE token streaming, THE Trace_System SHALL record SSE_EVENT spans at regular intervals (every N tokens) containing the cumulative token count, to avoid span explosion
  4. WHEN the SSE stream completes, THE Trace_System SHALL record an SSE_END span containing total token count, total duration, and whether it completed normally
  5. WHEN an error occurs during SSE streaming, THE Trace_System SHALL record an AI_ERROR span containing the error type, message, and retry count
  6. THE SSE trace SHALL use trace_type="sse" to distinguish from regular HTTP traces

Requirement 14: WebSocket 连接追踪

User Story: 作为开发者,我希望 WebSocket 连接的全生命周期被追踪,以便看到连接建立、消息推送、断开的完整过程。

Acceptance Criteria

  1. WHEN a WebSocket connection is established, THE Trace_System SHALL record a WS_CONNECT span containing the execution_id and client information
  2. DURING WebSocket message pushing, THE Trace_System SHALL record WS_MESSAGE spans at regular intervals (every N messages) containing the cumulative message count and byte count
  3. WHEN a WebSocket connection is closed, THE Trace_System SHALL record a WS_DISCONNECT span containing the disconnect reason, total message count, and total duration
  4. THE WebSocket trace SHALL use trace_type="ws" and a request_id with "ws_" prefix to distinguish from HTTP traces

Requirement 15: 后台 Job 执行追踪

User Story: 作为开发者我希望后台定时任务task_generator、task_expiry、recall_detector、note_reclassifier的执行过程被追踪以便排查后台任务的问题。

Acceptance Criteria

  1. WHEN a background job starts execution, THE Trace_System SHALL create a new TraceContext with trace_type="job" and record a JOB_START span containing the job name and trigger time
  2. DURING job execution, THE Trace_System SHALL record SERVICE and DB_QUERY spans associated with the job's TraceContext (via contextvars)
  3. WHEN a job completes successfully, THE Trace_System SHALL record a JOB_END span containing the duration and processed record count
  4. WHEN a job fails with an exception, THE Trace_System SHALL record a JOB_ERROR span containing the exception type, message, and stack trace summary
  5. THE job trace SHALL use a request_id with "job_" prefix and be written to the same log files as HTTP traces

Requirement 16: 数据库连接生命周期追踪

User Story: 作为开发者,我希望看到每个数据库连接的获取和释放时间,以便检测连接泄漏或连接获取瓶颈。

Acceptance Criteria

  1. WHEN a database connection is acquired via get_connection(), THE Trace_System SHALL record a DB_CONN span containing the connection acquisition duration
  2. WHEN a database connection is released (closed), THE Trace_System SHALL record a DB_CONN_RELEASE span
  3. EACH DB_CONN span SHALL be paired with exactly one DB_CONN_RELEASE span within the same trace

Requirement 17: 中间件层追踪

User Story: 作为开发者,我希望看到中间件链的执行耗时,以便检测中间件层的性能瓶颈。

Acceptance Criteria

  1. THE Trace_System SHALL record a MIDDLEWARE span for ResponseWrapperMiddleware containing its execution duration
  2. IF the ResponseWrapperMiddleware fails to wrap a response (e.g., JSON parse error), THE Trace_System SHALL record a MIDDLEWARE_ERROR span containing the error details
  3. THE MIDDLEWARE span SHALL include the response body size for monitoring abnormally large responses

Requirement 18: Trace 覆盖率扫描与展示

User Story: 作为开发者,我希望在 admin-web 的日志页面顶部看到当前 trace 系统对路由、Service、Job 等模块的覆盖率,以便在新增模块后及时发现未接入 trace 的函数。

Acceptance Criteria

  1. THE Trace_System SHALL provide a coverage scanner that inspects the backend codebase and reports: (a) xcx_* route coverage (which route files are in trace scope), (b) Service function coverage (which public functions in app/services/ have @trace_service decorator), (c) Job handler coverage (which registered jobs are wrapped by job_wrapper), (d) SSE/WS endpoint coverage (which endpoints have trace wrappers)
  2. THE coverage scanner SHALL report for each category: total count, covered count, uncovered list with module and function names
  3. WHEN a GET request is made to /api/admin/dev-trace/coverage, THE Trace_API SHALL return the most recent scan result including scan_time, and per-category totals/details
  4. WHEN a POST request is made to /api/admin/dev-trace/coverage/scan, THE Trace_API SHALL execute a fresh scan immediately and return the updated result
  5. THE Trace_System SHALL execute an automatic coverage scan on server startup and periodically at a configurable interval (default 60 minutes)
  6. THE Admin_Web_Trace_Page SHALL display a coverage summary bar at the top of the DevTrace page showing per-category coverage percentages and a list of uncovered items
  7. THE Admin_Web_Trace_Page SHALL provide a manual "Scan" button that triggers an immediate coverage scan via the API
  8. THE coverage scan interval SHALL be configurable via the settings API and settings panel