在前后端开发联调前 的提交20260223

This commit is contained in:
Neo
2026-02-23 23:02:20 +08:00
parent 254ccb1e77
commit fafc95e64c
1142 changed files with 10366960 additions and 36957 deletions

View File

@@ -0,0 +1,131 @@
- [P20260221-115116] 2026-02-21 11:51:16 +0800
- summary: CONTEXT TRANSFER: We are continuing a conversation that had gotten too long. Here is a summary: --- ## TASK 1: Modify "D…
- prompt:
```text
CONTEXT TRANSFER: We are continuing a conversation that had gotten too long. Here is a summary:
---
## TASK 1: Modify "Data Flow Structure Analysis" Hook — Whitelist Refactoring
**STATUS**: in-progress
**USER QUERIES**: 1 (the main request)
**DETAILS**:
The user requested three changes to the existing `dataflow-analyze` hook and its underlying `gen_dataflow_report.py`:
1. **Add to whitelist**: DWD 维表 SCD2 字段 (`valid_from`, `valid_to`, `is_current`, `etl_loaded_at`, `etl_batch_id`) and API `siteProfile` 嵌套信息
2. **Refactor whitelist logic**: All whitelist fields must still be checked (not skipped). The term "跳过" (skip) must never be used — only "白名单" (whitelist). Whitelist fields are folded/collapsed in the report (not shown in detailed tables), with a note explaining the whitelist reason.
3. **Report sections affected**: Section 1.1 (API↔ODS↔DWD field diff) and Section 3 (per-table details)
**What was implemented**:
- Added global whitelist constants and helper functions at top of `gen_dataflow_report.py`: `WHITELIST_ETL_META_COLS`, `WHITELIST_DWD_SCD2_COLS`, `WHITELIST_API_NESTED_PREFIXES`, plus `is_whitelist_etl_meta()`, `is_whitelist_scd2()`, `is_whitelist_api_nested()`, `whitelist_reason()`
- Refactored `_write_field_diff_report()` to separate whitelist items from non-whitelist items in all 4 categories (API→ODS unmapped, ODS no JSON source, ODS→DWD unmapped, DWD no ODS source)
- Added `_write_whitelist_summary()` helper for collapsed whitelist display
- Updated `_write_dwd_section()` to use `is_whitelist_scd2()` instead of hardcoded tuple
- Updated `_write_api_section()` to use `is_whitelist_api_nested()` instead of hardcoded `siteProfile.` check
- Removed all uses of old `etl_meta_cols` local variable
- Removed all instances of "跳过" from the report generator
- Updated hook file to v4.0.0 with new prompt describing whitelist rules
**What still needs verification**:
- The `gen_dataflow_report.py` changes compile clean (verified via `py_compile` and `getDiagnostics`)
- But the report has NOT been regenerated/tested end-to-end yet
- The Section 3 per-table details (`_write_api_section`, `_write_ods_section`, `_write_dwd_section`) still show ALL fields in their tables (including whitelist fields). The user's requirement says whitelist fields should also be folded in Section 3. This part may need additional work — currently only Section 1.1 diff details fold whitelist fields.
**NEXT STEPS**:
- Verify whether Section 3 per-table detail tables also need whitelist folding (user said "3. 逐表详情下的每项,白名单字段不放在详细的表格中"). If so, `_write_api_section`, `_write_ods_section`, and `_write_dwd_section` need modification to exclude whitelist fields from their main tables and add a summary note instead.
- Run the full pipeline to test: `python scripts/ops/analyze_dataflow.py` then `python scripts/ops/gen_dataflow_report.py`
**FILEPATHS**:
- `scripts/ops/gen_dataflow_report.py` (modified)
- `.kiro/hooks/dataflow-analyze.kiro.hook` (modified, v4.0.0)
---
## TASK 2: Create New Hook "ETL Data Consistency Check"
**STATUS**: in-progress
**USER QUERIES**: 1 (the main request)
**DETAILS**:
The user requested a new hook that:
1. Finds the last successful ETL task from logs
2. Uses black-box approach to produce a sampling report similar to `field_level_report_20260220_233443.md` but more detailed
3. Compares actual data across API→ODS→DWD→DWS/INDEX for every table and field, reporting data differences
**What was implemented**:
- Created `.kiro/hooks/etl-data-consistency.kiro.hook` (v1.0.0, userTriggered)
- Created `scripts/ops/etl_consistency_check.py` with full implementation:
- `find_latest_log()` / `parse_etl_log()` — find and parse latest ETL log
- `load_api_json_records()` — read API JSON from FETCH_ROOT
- `check_api_vs_ods()` — field-level + value-level comparison (API JSON vs ODS DB)
- `check_ods_vs_dwd()` — column comparison + value sampling (ODS vs DWD)
- `check_dwd_vs_dws()` — DWS table overview + numeric column sanity checks
- `generate_report()` — full Markdown report with 5 sections
- `main()` — orchestration entry point
- Script uses `PG_DSN` (readonly), `FETCH_ROOT`, `LOG_ROOT`, `ETL_REPORT_ROOT` from `.env`
- Output: `ETL_REPORT_ROOT/consistency_check_<timestamp>.md`
- Both files pass `py_compile` and `getDiagnostics` checks
**What still needs verification/improvement**:
- Script has NOT been run yet — needs end-to-end testing
- The `COUNT(DISTINCT id)` query uses `cur.fetchone()["cnt"]` but the column alias from `COUNT(DISTINCT id)` is `count` not `cnt` — this will cause a KeyError at runtime. Need to fix the SQL alias or dict key.
- The user wanted the report to be "more detailed than field_level_report" — current implementation does value sampling but doesn't include per-field statistics (NULL rate, min/max, distinct count) like `field_level_report.py` does. May need enhancement.
- DWS check is basic (row counts + numeric sanity) — doesn't do actual DWD→DWS aggregation verification
**NEXT STEPS**:
- Fix the `COUNT(DISTINCT id)` SQL alias bug in `check_api_vs_ods()` and `check_ods_vs_dwd()` — change `cur.fetchone()["cnt"]` to `cur.fetchone()["count"]` or add `AS cnt` alias
- Run `uv run python scripts/ops/etl_consistency_check.py` to test end-to-end
- Consider enhancing DWS verification to compare aggregated DWD values against DWS values
- Consider adding per-field statistics similar to `field_level_report.py`
**FILEPATHS**:
- `scripts/ops/etl_consistency_check.py` (new)
- `.kiro/hooks/etl-data-consistency.kiro.hook` (new)
---
## USER CORRECTIONS AND INSTRUCTIONS:
- Language: All explanatory text must be in simplified Chinese (简体中文)
- Never use "跳过" (skip) — only use "白名单" (whitelist)
- Whitelist fields must still be checked/counted — they are only folded in report display
- Output paths must come from `.env` environment variables via `_env_paths.get_output_path()` — never hardcode
- Scripts in `scripts/ops/` use `_env_paths.py` for path resolution
- Database connections use `PG_DSN` from `.env` (readonly mode for reports)
- The background context document is at `export/ETL-Connectors/feiqiu/REPORTS/context_handoff_task2.md`
- Existing mapping tables (`ODS_TASK_TO_TABLE`, `DWD_TO_ODS`) are defined in `scripts/ops/blackbox_test_report.py` and were replicated in the new script
- The project uses `uv run python` to execute scripts
- `scripts/ops/` scripts must `load_dotenv` from root `.env` and raise errors on missing env vars
## Files to read
- `scripts/ops/gen_dataflow_report.py` (modified — whitelist refactoring, verify Section 3 folding requirement)
- `scripts/ops/etl_consistency_check.py` (new — has SQL alias bug to fix, needs testing)
- `.kiro/hooks/dataflow-analyze.kiro.hook` (modified hook definition)
- `.kiro/hooks/etl-data-consistency.kiro.hook` (new hook definition)
- `export/ETL-Connectors/feiqiu/REPORTS/context_handoff_task2.md` (background context)
- `scripts/ops/_env_paths.py` (shared path utility)
- `scripts/ops/blackbox_test_report.py` (reference for mapping tables and report patterns)
- `scripts/ops/field_level_report.py` (reference for the "more detailed" report the user wants)
USER QUERIES(most recent first):
1. 这是背景信息export\ETL-Connectors\feiqiu\REPORTS\context_handoff_task2.md修改一个HOOKSData Flow Structure Analysis要求- 白名单/跳过增加DWD的维表SCD2字段。API中siteProfile嵌套信。- 重构所有白名单和被跳过字段的处理逻辑这些字段都要正常检查检查时不得跳过之前用“跳过”一词不严谨进行修改之后不使用这个词只用“白名单”。白名单的作用为在报告中1.1 API↔ODS↔DWD 字段对比差异章节下的每项及3. 逐表详情下的每项白名单字段不放在详细的表格中并说明因白名单或其他具体原因不展开详细信息。增加一个HOOKS名称来定其工作方式类似Data Flow Structure Analysis要求1 获取上一个执行成功的ETL任务。2 黑盒手段出具类似export\ETL-Connectors\feiqiu\REPORTS\field_level_report_20260220_233443.md的采样报告。但要比这个详细见下文。3 对本次任务中API - ODS - DWD - DWS/INDEX 的每个表的每个字段进行检查,进行数据上下流的各表各字段对比 实际数据库中及实际JSON中比对看是否在本次任务中有对不上的数据的情况。并对数据差异进行报告列举。
2. Output:
Command executed successfully with no output.
Exit Code: 0
3. Output:
Command executed successfully with no output.
Exit Code: 0
4. 这是背景信息export\ETL-Connectors\feiqiu\REPORTS\context_handoff_task2.md修改一个HOOKSData Flow Structure Analysis要求- 白名单/跳过增加DWD的维表SCD2字段。API中siteProfile嵌套信。- 重构所有白名单和被跳过字段的处理逻辑这些字段都要正常检查检查时不得跳过之前用“跳过”一词不严谨进行修改之后不使用这个词只用“白名单”。白名单的作用为在报告中1.1 API↔ODS↔DWD 字段对比差异章节下的每项及3. 逐表详情下的每项白名单字段不放在详细的表格中并说明因白名单或其他具体原因不展开详细信息。增加一个HOOKS名称来定其工作方式类似Data Flow Structure Analysis要求1 获取上一个执行成功的ETL任务。2 黑盒手段出具类似export\ETL-Connectors\feiqiu\REPORTS\field_level_report_20260220_233443.md的采样报告。但要比这个详细见下文。3 对本次任务中API - ODS - DWD - DWS/INDEX 的每个表的每个字段进行检查,进行数据上下流的各表各字段对比 实际数据库中及实际JSON中比对看是否在本次任务中有对不上的数据的情况。并对数据差异进行报告列举。
---
METADATA:
The previous conversation had 2 messages.
INSTRUCTIONS:
Continue working until the user query has been fully addressed. Do not ask for clarification - proceed with the work based on the context provided.
IMPORTANT: you need to read from the files to Read section
```