在前后端开发联调前的提交20260223

2026-02-23 23:02:20 +08:00
parent 254ccb1e77
commit fafc95e64c
1142 changed files with 10366960 additions and 36957 deletions
--- a/docs/audit/prompt_logs/prompt_log_20260221_115116.md
+++ b/docs/audit/prompt_logs/prompt_log_20260221_115116.md
@@ -0,0 +1,131 @@
+- [P20260221-115116] 2026-02-21 11:51:16 +0800
+  - summary: CONTEXT TRANSFER: We are continuing a conversation that had gotten too long. Here is a summary: --- ## TASK 1: Modify "D…
+  - prompt:
+```text
+CONTEXT TRANSFER: We are continuing a conversation that had gotten too long. Here is a summary:
+
+---
+## TASK 1: Modify "Data Flow Structure Analysis" Hook — Whitelist Refactoring
+
+**STATUS**: in-progress
+
+**USER QUERIES**: 1 (the main request)
+
+**DETAILS**:
+The user requested three changes to the existing `dataflow-analyze` hook and its underlying `gen_dataflow_report.py`:
+
+1. **Add to whitelist**: DWD 维表 SCD2 字段 (`valid_from`, `valid_to`, `is_current`, `etl_loaded_at`, `etl_batch_id`) and API `siteProfile` 嵌套信息
+2. **Refactor whitelist logic**: All whitelist fields must still be checked (not skipped). The term "跳过" (skip) must never be used — only "白名单" (whitelist). Whitelist fields are folded/collapsed in the report (not shown in detailed tables), with a note explaining the whitelist reason.
+3. **Report sections affected**: Section 1.1 (API↔ODS↔DWD field diff) and Section 3 (per-table details)
+
+**What was implemented**:
+- Added global whitelist constants and helper functions at top of `gen_dataflow_report.py`: `WHITELIST_ETL_META_COLS`, `WHITELIST_DWD_SCD2_COLS`, `WHITELIST_API_NESTED_PREFIXES`, plus `is_whitelist_etl_meta()`, `is_whitelist_scd2()`, `is_whitelist_api_nested()`, `whitelist_reason()`
+- Refactored `_write_field_diff_report()` to separate whitelist items from non-whitelist items in all 4 categories (API→ODS unmapped, ODS no JSON source, ODS→DWD unmapped, DWD no ODS source)
+- Added `_write_whitelist_summary()` helper for collapsed whitelist display
+- Updated `_write_dwd_section()` to use `is_whitelist_scd2()` instead of hardcoded tuple
+- Updated `_write_api_section()` to use `is_whitelist_api_nested()` instead of hardcoded `siteProfile.` check
+- Removed all uses of old `etl_meta_cols` local variable
+- Removed all instances of "跳过" from the report generator
+- Updated hook file to v4.0.0 with new prompt describing whitelist rules
+
+**What still needs verification**:
+- The `gen_dataflow_report.py` changes compile clean (verified via `py_compile` and `getDiagnostics`)
+- But the report has NOT been regenerated/tested end-to-end yet
+- The Section 3 per-table details (`_write_api_section`, `_write_ods_section`, `_write_dwd_section`) still show ALL fields in their tables (including whitelist fields). The user's requirement says whitelist fields should also be folded in Section 3. This part may need additional work — currently only Section 1.1 diff details fold whitelist fields.
+
+**NEXT STEPS**:
+- Verify whether Section 3 per-table detail tables also need whitelist folding (user said "3. 逐表详情下的每项，白名单字段不放在详细的表格中"). If so, `_write_api_section`, `_write_ods_section`, and `_write_dwd_section` need modification to exclude whitelist fields from their main tables and add a summary note instead.
+- Run the full pipeline to test: `python scripts/ops/analyze_dataflow.py` then `python scripts/ops/gen_dataflow_report.py`
+
+**FILEPATHS**:
+- `scripts/ops/gen_dataflow_report.py` (modified)
+- `.kiro/hooks/dataflow-analyze.kiro.hook` (modified, v4.0.0)
+
+---
+
+## TASK 2: Create New Hook "ETL Data Consistency Check"
+
+**STATUS**: in-progress
+
+**USER QUERIES**: 1 (the main request)
+
+**DETAILS**:
+The user requested a new hook that:
+1. Finds the last successful ETL task from logs
+2. Uses black-box approach to produce a sampling report similar to `field_level_report_20260220_233443.md` but more detailed
+3. Compares actual data across API→ODS→DWD→DWS/INDEX for every table and field, reporting data differences
+
+**What was implemented**:
+- Created `.kiro/hooks/etl-data-consistency.kiro.hook` (v1.0.0, userTriggered)
+- Created `scripts/ops/etl_consistency_check.py` with full implementation:
+  - `find_latest_log()` / `parse_etl_log()` — find and parse latest ETL log
+  - `load_api_json_records()` — read API JSON from FETCH_ROOT
+  - `check_api_vs_ods()` — field-level + value-level comparison (API JSON vs ODS DB)
+  - `check_ods_vs_dwd()` — column comparison + value sampling (ODS vs DWD)
+  - `check_dwd_vs_dws()` — DWS table overview + numeric column sanity checks
+  - `generate_report()` — full Markdown report with 5 sections
+  - `main()` — orchestration entry point
+- Script uses `PG_DSN` (readonly), `FETCH_ROOT`, `LOG_ROOT`, `ETL_REPORT_ROOT` from `.env`
+- Output: `ETL_REPORT_ROOT/consistency_check_<timestamp>.md`
+- Both files pass `py_compile` and `getDiagnostics` checks
+
+**What still needs verification/improvement**:
+- Script has NOT been run yet — needs end-to-end testing
+- The `COUNT(DISTINCT id)` query uses `cur.fetchone()["cnt"]` but the column alias from `COUNT(DISTINCT id)` is `count` not `cnt` — this will cause a KeyError at runtime. Need to fix the SQL alias or dict key.
+- The user wanted the report to be "more detailed than field_level_report" — current implementation does value sampling but doesn't include per-field statistics (NULL rate, min/max, distinct count) like `field_level_report.py` does. May need enhancement.
+- DWS check is basic (row counts + numeric sanity) — doesn't do actual DWD→DWS aggregation verification
+
+**NEXT STEPS**:
+- Fix the `COUNT(DISTINCT id)` SQL alias bug in `check_api_vs_ods()` and `check_ods_vs_dwd()` — change `cur.fetchone()["cnt"]` to `cur.fetchone()["count"]` or add `AS cnt` alias
+- Run `uv run python scripts/ops/etl_consistency_check.py` to test end-to-end
+- Consider enhancing DWS verification to compare aggregated DWD values against DWS values
+- Consider adding per-field statistics similar to `field_level_report.py`
+
+**FILEPATHS**:
+- `scripts/ops/etl_consistency_check.py` (new)
+- `.kiro/hooks/etl-data-consistency.kiro.hook` (new)
+
+---
+
+## USER CORRECTIONS AND INSTRUCTIONS:
+- Language: All explanatory text must be in simplified Chinese (简体中文)
+- Never use "跳过" (skip) — only use "白名单" (whitelist)
+- Whitelist fields must still be checked/counted — they are only folded in report display
+- Output paths must come from `.env` environment variables via `_env_paths.get_output_path()` — never hardcode
+- Scripts in `scripts/ops/` use `_env_paths.py` for path resolution
+- Database connections use `PG_DSN` from `.env` (readonly mode for reports)
+- The background context document is at `export/ETL-Connectors/feiqiu/REPORTS/context_handoff_task2.md`
+- Existing mapping tables (`ODS_TASK_TO_TABLE`, `DWD_TO_ODS`) are defined in `scripts/ops/blackbox_test_report.py` and were replicated in the new script
+- The project uses `uv run python` to execute scripts
+- `scripts/ops/` scripts must `load_dotenv` from root `.env` and raise errors on missing env vars
+
+## Files to read
+- `scripts/ops/gen_dataflow_report.py` (modified — whitelist refactoring, verify Section 3 folding requirement)
+- `scripts/ops/etl_consistency_check.py` (new — has SQL alias bug to fix, needs testing)
+- `.kiro/hooks/dataflow-analyze.kiro.hook` (modified hook definition)
+- `.kiro/hooks/etl-data-consistency.kiro.hook` (new hook definition)
+- `export/ETL-Connectors/feiqiu/REPORTS/context_handoff_task2.md` (background context)
+- `scripts/ops/_env_paths.py` (shared path utility)
+- `scripts/ops/blackbox_test_report.py` (reference for mapping tables and report patterns)
+- `scripts/ops/field_level_report.py` (reference for the "more detailed" report the user wants)
+
+USER QUERIES(most recent first):
+1. 这是背景信息：export\ETL-Connectors\feiqiu\REPORTS\context_handoff_task2.md修改一个HOOKS：Data Flow Structure Analysis，要求：- 白名单/跳过增加：DWD的维表SCD2字段。API中siteProfile嵌套信。- 重构所有白名单和被跳过字段的处理逻辑：这些字段都要正常检查，检查时不得跳过！之前用“跳过”一词不严谨，进行修改，之后不使用这个词，只用“白名单”。白名单的作用为：在报告中，1.1 API↔ODS↔DWD 字段对比差异章节下的每项，及3. 逐表详情下的每项，白名单字段不放在详细的表格中，并说明因白名单或其他具体原因，不展开详细信息。增加一个HOOKS，名称来定，其工作方式类似Data Flow Structure Analysis，要求：1 获取上一个执行成功的ETL任务。2 黑盒手段，出具类似export\ETL-Connectors\feiqiu\REPORTS\field_level_report_20260220_233443.md的采样报告。但要比这个详细见下文。3 对本次任务中，API - ODS - DWD - DWS/INDEX 的每个表的每个字段进行检查，进行数据上下流的各表各字段对比 （实际数据库中及实际JSON中比对），看是否在本次任务中，有对不上的数据的情况。并对数据差异进行报告列举。
+2. Output:
+Command executed successfully with no output.
+
+Exit Code: 0
+3. Output:
+Command executed successfully with no output.
+
+Exit Code: 0
+4. 这是背景信息：export\ETL-Connectors\feiqiu\REPORTS\context_handoff_task2.md修改一个HOOKS：Data Flow Structure Analysis，要求：- 白名单/跳过增加：DWD的维表SCD2字段。API中siteProfile嵌套信。- 重构所有白名单和被跳过字段的处理逻辑：这些字段都要正常检查，检查时不得跳过！之前用“跳过”一词不严谨，进行修改，之后不使用这个词，只用“白名单”。白名单的作用为：在报告中，1.1 API↔ODS↔DWD 字段对比差异章节下的每项，及3. 逐表详情下的每项，白名单字段不放在详细的表格中，并说明因白名单或其他具体原因，不展开详细信息。增加一个HOOKS，名称来定，其工作方式类似Data Flow Structure Analysis，要求：1 获取上一个执行成功的ETL任务。2 黑盒手段，出具类似export\ETL-Connectors\feiqiu\REPORTS\field_level_report_20260220_233443.md的采样报告。但要比这个详细见下文。3 对本次任务中，API - ODS - DWD - DWS/INDEX 的每个表的每个字段进行检查，进行数据上下流的各表各字段对比 （实际数据库中及实际JSON中比对），看是否在本次任务中，有对不上的数据的情况。并对数据差异进行报告列举。
+---
+
+METADATA:
+The previous conversation had 2 messages.
+
+INSTRUCTIONS:
+Continue working until the user query has been fully addressed. Do not ask for clarification - proceed with the work based on the context provided.
+IMPORTANT: you need to read from the files to Read section
+```