Compare commits

6 Commits
main ... dev

Author SHA1 Message Date
Neo
90fb63feaf 整理 SQL 的 注释 2025-12-13 08:26:09 +08:00
Neo
0ab040b9fb 整理项目 2025-12-09 05:43:04 +08:00
Neo
0c29bd41f8 整理项目 2025-12-09 05:42:57 +08:00
Neo
561c640700 DWD完成 2025-12-09 04:57:05 +08:00
Neo
f301cc1fd5 更新一些文件 2025-12-06 00:17:42 +08:00
Neo
6f1d163a99 DWD文档确认 2025-12-05 18:57:20 +08:00
57 changed files with 40337 additions and 5252 deletions

File diff suppressed because it is too large Load Diff

111
README.md
View File

@@ -1,78 +1,57 @@
# 台球场 ETL 系统 # 飞球 ETL 系统ODS → DWD
用于台球门店业务的数据采集与入湖:从上游 API 拉取订单、支付、会员、库存等数据,先落地 ODS再清洗写入事实/维度表,并提供运行追踪、增量游标、数据质量检查与测试脚手架 面向门店业务的 ETL拉取/或离线灌入上游 JSON先落地 ODS再清洗装载 DWD含 SCD2 维度、事实增量),并提供质量校验报表
## 核心特性 ## 快速运行(离线示例 JSON
- **两阶段链路**ODS 原始留痕 + DWD/事实表清洗,支持回放与重跑。 1) 环境Python 3.10+、PostgreSQL`.env` 关键项:`PG_DSN=postgresql://local-Python:Neo-local-1991125@100.64.0.4:5432/LLZQ-test``INGEST_SOURCE_DIR=C:\dev\LLTQ\export\test-json-doc`
- **任务注册与调度**`TaskRegistry` 统一管理任务代码,`ETLScheduler` 负责游标、运行记录和失败隔离。 2) 安装依赖:
- **统一底座**:配置(默认值 + `.env` + CLI 覆盖)、分页/重试的 API 客户端、批量 Upsert 的数据库封装、SCD2 维度处理、质量检查。
- **测试与回放**ONLINE/OFFLINE 模式切换,`run_tests.py`/`test_presets.py` 支持参数化测试;`MANUAL_INGEST` 可将归档 JSON 重灌入 ODS。
- **可安装**`setup.py` / `entry_point` 提供 `etl-billiards` 命令,或直接 `python -m cli.main` 运行。
## 仓库结构(摘录)
- `etl_billiards/config`:默认配置、环境变量解析、配置加载。
- `etl_billiards/api`HTTP 客户端,内置重试/分页。
- `etl_billiards/database`:连接管理、批量 Upsert。
- `etl_billiards/tasks`业务任务ORDERS、PAYMENTS…、ODS 任务、DWD 任务、人工回放;`base_task.py`/`base_dwd_task.py` 提供模板。
- `etl_billiards/loaders`:事实/维度/ODS Loader`scd/` 为 SCD2。
- `etl_billiards/orchestration`:调度器、任务注册表、游标与运行追踪。
- `etl_billiards/scripts`:测试执行器、数据库连通性检测、预置测试指令。
- `etl_billiards/tests`:单元/集成测试与离线 JSON 归档。
## 支持的任务代码
- **事实/维度**`ORDERS``PAYMENTS``REFUNDS``INVENTORY_CHANGE``COUPON_USAGE``MEMBERS``ASSISTANTS``PRODUCTS``TABLES``PACKAGES_DEF``TOPUPS``TABLE_DISCOUNT``ASSISTANT_ABOLISH``LEDGER``TICKET_DWD``PAYMENTS_DWD``MEMBERS_DWD`
- **ODS 原始采集**`ODS_ORDER_SETTLE``ODS_TABLE_USE``ODS_ASSISTANT_LEDGER``ODS_ASSISTANT_ABOLISH``ODS_GOODS_LEDGER``ODS_PAYMENT``ODS_REFUND``ODS_COUPON_VERIFY``ODS_MEMBER``ODS_MEMBER_CARD``ODS_PACKAGE``ODS_INVENTORY_STOCK``ODS_INVENTORY_CHANGE`
- **辅助**`MANUAL_INGEST`(将归档 JSON 回放到 ODS
## 快速开始
1. **环境要求**Python 3.10+、PostgreSQL。推荐在 `etl_billiards/` 目录下执行命令。
2. **安装依赖**
```bash ```bash
cd etl_billiards cd etl_billiards
pip install -r requirements.txt pip install -r requirements.txt
# 开发模式pip install -e .
``` ```
3. **配置 `.env`** 3) 一键 ODS→DWD→质检
```bash ```bash
cp .env.example .env python -m etl_billiards.cli.main --tasks INIT_ODS_SCHEMA,INIT_DWD_SCHEMA --pipeline-flow INGEST_ONLY
# 核心项 python -m etl_billiards.cli.main --tasks MANUAL_INGEST --pipeline-flow INGEST_ONLY --ingest-source "C:\dev\LLTQ\export\test-json-doc"
PG_DSN=postgresql://user:pwd@host:5432/LLZQ python -m etl_billiards.cli.main --tasks DWD_LOAD_FROM_ODS --pipeline-flow INGEST_ONLY
API_BASE=https://api.example.com python -m etl_billiards.cli.main --tasks DWD_QUALITY_CHECK --pipeline-flow INGEST_ONLY
API_TOKEN=your_token # 报表etl_billiards/reports/dwd_quality_report.json
STORE_ID=2790685415443269
EXPORT_ROOT=/path/to/export
LOG_ROOT=/path/to/logs
``` ```
配置的生效顺序为 “默认值” < “环境变量/.env” < “CLI 参数”。
4. **运行任务**
```bash
# 运行默认任务集
python -m cli.main
# 按需选择任务(逗号分隔) ## 目录与文件作用
python -m cli.main --tasks ODS_ORDER_SETTLE,ORDERS,PAYMENTS - 根目录:`etl_billiards/` 主代码;`requirements.txt` 依赖;`run_etl.sh/.bat` 启动脚本;`.env/.env.example` 配置;`tmp/` 存放草稿/调试/备份。
- etl_billiards/ 主线目录
- `config/``defaults.py` 默认值,`env_parser.py` 解析 .env`settings.py` 统一配置加载。
- `api/``client.py` HTTP 请求、重试与分页。
- `database/``connection.py` 连接封装,`operations.py` 批量 upsertDDL`schema_ODS_doc.sql`、`schema_dwd_doc.sql`。
- `tasks/`:业务任务
- `init_schema_task.py`INIT_ODS_SCHEMA / INIT_DWD_SCHEMA。
- `manual_ingest_task.py`:示例 JSON → ODS。
- `dwd_load_task.py`ODS → DWD映射、SCD2/事实增量)。
- 其他任务按需扩展。
- `loaders/`ODS/DWD/SCD2 Loader 实现。
- `scd/``scd2_handler.py` 处理维度 SCD2 历史。
- `quality/`:质量检查器(行数/金额对照)。
- `orchestration/``scheduler.py` 调度;`task_registry.py` 任务注册;`run_tracker.py` 运行记录。
- `scripts/`:重建/测试/探活工具。
- `docs/``ods_to_dwd_mapping.md` 映射说明,`ods_sample_json.md` 示例 JSON 说明,`dwd_quality_check.md` 质检说明。
- `reports/`:质检输出(如 `dwd_quality_report.json`)。
- `tests/`:单元/集成测试;`utils/`:通用工具。
- `backups/`(若存在):关键文件备份。
# Dry-run 示例(不提交事务) ## 业务流程与文件关系
python -m cli.main --tasks ORDERS --dry-run 1) 调度入口:`cli/main.py` 解析 CLI → `orchestration/scheduler.py` 依 `task_registry.py` 创建任务 → 初始化 DB/API/Config 上下文。
2) ODS`init_schema_task.py` 执行 `schema_ODS_doc.sql` 建表;`manual_ingest_task.py` 从 `INGEST_SOURCE_DIR` 读 JSON批量 upsert ODS。
3) DWD`init_schema_task.py` 执行 `schema_dwd_doc.sql` 建表;`dwd_load_task.py` 依据 `TABLE_MAP/FACT_MAPPINGS` 从 ODS 清洗写入 DWD维度走 SCD2`scd/scd2_handler.py`),事实按时间/水位增量。
4) 质检:质量任务读取 ODS/DWD统计行数/金额,输出 `reports/dwd_quality_report.json`。
5) 配置:`config/defaults.py` + `.env` + CLI 参数叠加HTTP如启用在线走 `api/client.py`DB 访问走 `database/connection.py`。
6) 文档:`docs/ods_to_dwd_mapping.md` 记录字段映射;`docs/ods_sample_json.md` 描述示例数据结构,便于对照调试。
# Windows 批处理 ## 当前状态2025-12-09
..\\run_etl.bat --tasks PAYMENTS - 示例 JSON 全量灌入DWD 行数与 ODS 对齐。
``` - 分类维度已展平大类+子类:`dim_goods_category` 26 行category_level/leaf 已赋值)。
5. **查看输出**:日志目录与导出目录分别由 `LOG_ROOT`、`EXPORT_ROOT` 控制;运行追踪与游标记录写入数据库 `etl_admin.*` 表 - 剩余空值多因源数据为空;补值需先确认上游是否提供
## 数据与运行流转
- CLI 解析参数 → `AppConfig.load()` 组装配置 → `ETLScheduler` 创建 DB/API/游标/运行追踪器。
- 调度器按任务代码实例化任务,读取/推进游标,落盘运行记录。
- 任务模板:确定时间窗口 → 调用 API/ODS 数据 → 解析校验 → Loader 批量 Upsert/SCD2 → 质量检查 → 提交事务并回写游标。
## 测试与回放
- 单元/集成测试:`pytest` 或 `python scripts/run_tests.py --suite online`。
- 预置组合:`python scripts/run_tests.py --preset offline_realdb`(见 `scripts/test_presets.py`)。
- 离线模式:`TEST_MODE=OFFLINE TEST_JSON_ARCHIVE_DIR=... pytest tests/unit/test_etl_tasks_offline.py`。
- 数据库连通性:`python scripts/test_db_connection.py --dsn postgresql://... --query "SELECT 1"`。
## 其他提示
- `.env.example` 列出了所有常用配置;`config/defaults.py` 记录默认值与任务窗口配置。
- `loaders/ods/generic.py` 支持定义主键/列名即可落 ODS`tasks/manual_ingest_task.py` 可将归档 JSON 快速灌入对应 ODS 表。
- 需要新增任务时,在 `tasks/` 中实现并在 `orchestration/task_registry.py` 注册即可复用调度能力。
## 可精简/归档
- `tmp/`、`tmp/etl_billiards_misc/` 中的草稿、旧备份、调试脚本仅供参考,运行不依赖。
- 根级保留必要文件README、requirements、run_etl.*、.env/.env.example其余临时文件已移至 tmp。

216
README_FULL.md Normal file
View File

@@ -0,0 +1,216 @@
# 飞球 ETL 系统ODS → DWD— 详细版
> 本文为项目的详细说明,保持与当前代码一致,覆盖 ODS 任务、DWD 装载、质检及开发扩展要点。
---
## 1. 项目概览
面向门店业务的 ETL从上游 API 或离线 JSON 采集订单、支付、会员、库存等数据,先落地 **ODS**,再清洗装载 **DWD**(含 SCD2 维度、事实增量),并输出质量校验报表。项目采用模块化/分层架构配置、API、数据库、Loader/SCD、质量、调度、CLI、测试统一通过 CLI 调度。
---
## 2. 快速开始(离线示例 JSON
**环境要求**Python 3.10+PostgreSQL`.env` 关键项:
- `PG_DSN=postgresql://local-Python:Neo-local-1991125@100.64.0.4:5432/LLZQ-test`
- `INGEST_SOURCE_DIR=C:\dev\LLTQ\export\test-json-doc`
**安装依赖**
```bash
cd etl_billiards
pip install -r requirements.txt
```
**一键 ODS → DWD → 质检(离线回放)**
```bash
# 初始化 ODS + DWD
python -m etl_billiards.cli.main --tasks INIT_ODS_SCHEMA,INIT_DWD_SCHEMA --pipeline-flow INGEST_ONLY
# 灌入示例 JSON 到 ODS可用 .env 的 INGEST_SOURCE_DIR 覆盖)
python -m etl_billiards.cli.main --tasks MANUAL_INGEST --pipeline-flow INGEST_ONLY --ingest-source "C:\dev\LLTQ\export\test-json-doc"
# 从 ODS 装载 DWD
python -m etl_billiards.cli.main --tasks DWD_LOAD_FROM_ODS --pipeline-flow INGEST_ONLY
# 质量校验报表
python -m etl_billiards.cli.main --tasks DWD_QUALITY_CHECK --pipeline-flow INGEST_ONLY
# 报表输出etl_billiards/reports/dwd_quality_report.json
```
> 可按需单独运行:
> - 仅建表:`python -m etl_billiards.cli.main --tasks INIT_ODS_SCHEMA`
> - 仅 ODS 灌入:`python -m etl_billiards.cli.main --tasks MANUAL_INGEST`
> - 仅 DWD 装载:`python -m etl_billiards.cli.main --tasks INIT_DWD_SCHEMA,DWD_LOAD_FROM_ODS`
---
## 3. 配置与路径
- 示例数据目录:`C:\dev\LLTQ\export\test-json-doc`(可由 `.env``INGEST_SOURCE_DIR` 覆盖)。
- 日志/导出目录:`LOG_ROOT``EXPORT_ROOT``.env`
- 报表:`etl_billiards/reports/dwd_quality_report.json`
- DDL`etl_billiards/database/schema_ODS_doc.sql``etl_billiards/database/schema_dwd_doc.sql`
- 任务注册:`etl_billiards/orchestration/task_registry.py`(默认启用 INIT_ODS_SCHEMA、MANUAL_INGEST、INIT_DWD_SCHEMA、DWD_LOAD_FROM_ODS、DWD_QUALITY_CHECK
**安全提示**:建议将数据库凭证保存在 `.env` 或受控秘钥管理中,生产环境使用最小权限账号。
---
## 4. 目录结构与关键文件
- 根目录:`etl_billiards/` 主代码;`requirements.txt` 依赖;`run_etl.sh/.bat` 启动脚本;`.env/.env.example` 配置;`tmp/` 草稿/调试归档。
- `config/``defaults.py` 默认值,`env_parser.py` 解析 .env`settings.py` AppConfig 统一加载。
- `api/``client.py` HTTP 请求、重试、分页。
- `database/``connection.py` 连接封装;`operations.py` 批量 upsertDDL SQLODS/DWD
- `tasks/`
- `init_schema_task.py`INIT_ODS_SCHEMA/INIT_DWD_SCHEMA
- `manual_ingest_task.py`(示例 JSON → ODS
- `dwd_load_task.py`ODS → DWD 映射、SCD2/事实增量);
- 其他任务按需扩展。
- `loaders/`ODS/DWD/SCD2 Loader 实现。
- `scd/``scd2_handler.py` 处理维度 SCD2 历史。
- `quality/`:质量检查器(行数/金额对照)。
- `orchestration/``scheduler.py` 调度;`task_registry.py` 注册;`run_tracker.py` 运行记录;`cursor_manager.py` 水位管理。
- `scripts/`:重建/测试/探活工具。
- `docs/``ods_to_dwd_mapping.md` 映射说明;`ods_sample_json.md` 示例 JSON 说明;`dwd_quality_check.md` 质检说明。
- `reports/`:质检输出(如 `dwd_quality_report.json`)。
- `tests/`:单元/集成测试;`utils/`:通用工具;`backups/`:备份(若存在)。
---
## 5. 架构与流程
执行链路(控制流):
1) CLI`cli/main.py`)解析参数 → 生成 AppConfig → 初始化日志/DB 连接;
2) 调度层(`scheduler.py`)按 `task_registry.py` 中的注册表实例化任务,设置 run_uuid、cursor水位、上下文
3) 任务基类模板:
- 获取时间窗口/水位cursor_manager
- 拉取数据:在线模式调用 `api/client.py` 支持分页、重试;离线模式直接读取 JSON 文件;
- 解析与校验:类型转换、必填校验(如任务内部 parse/validate
- 加载:调用 Loader`loaders/`)执行批量 Upsert/SCD2/增量写入(底层用 `database/operations.py`
- 质量检查(如需):质量模块对行数/金额等进行对比;
- 更新水位与运行记录(`run_tracker.py`),提交/回滚事务。
数据流与依赖:
- 配置:`config/defaults.py` + `.env` + CLI 参数叠加,形成 AppConfig。
- API 访问:`api/client.py` 支撑分页/重试;离线 ingest 直接读文件。
- DB 访问:`database/connection.py` 提供连接上下文;`operations.py` 负责批量 upsert/分页写入。
- ODS`manual_ingest_task.py` 读取 JSON → ODS 表(保留 payload/来源/时间戳)。
- DWD`dwd_load_task.py` 依据 `TABLE_MAP/FACT_MAPPINGS` 从 ODS 选取字段;维度走 SCD2`scd/scd2_handler.py`事实走增量支持字段表达式JSON->>、CAST
- 质检:`quality` 模块或相关任务对 ODS/DWD 行数、金额等进行比对,输出 `reports/`
---
## 6. ODS → DWD 策略
1. ODS 留底保留源主键、payload、时间/来源信息。
2. DWD 清洗:维度 SCD2事实按时间/水位增量;字段类型、单位、枚举标准化,保留可溯源字段。
3. 业务键统一site_id、member_id、table_id、order_settle_id、order_trade_no 等统一命名。
4. 不过度汇总DWD 只做明细/轻度清洗,汇总留待 DWS/报表。
5. 去嵌套:数组展开为子表/子行,重复 profile 提炼为维度。
6. 长期演进:优先加列/加表,避免频繁改已有表结构。
---
## 7. 常用 CLI
```bash
# 运行所有已注册任务
python -m etl_billiards.cli.main
# 运行指定任务
python -m etl_billiards.cli.main --tasks INIT_ODS_SCHEMA,MANUAL_INGEST
# 覆盖 DSN
python -m etl_billiards.cli.main --pg-dsn "postgresql://user:pwd@host:5432/db"
# 覆盖 API
python -m etl_billiards.cli.main --api-base "https://api.example.com" --api-token "..."
# 试运行(不写库)
python -m etl_billiards.cli.main --dry-run --tasks DWD_LOAD_FROM_ODS
```
---
## 8. 测试ONLINE / OFFLINE
- `TEST_MODE=ONLINE`:调用真实 API全链路 E/T/L。
- `TEST_MODE=OFFLINE`:从 `TEST_JSON_ARCHIVE_DIR` 读取离线 JSON只做 Transform + Load。
- `TEST_DB_DSN`:如设置,则集成测试连真库;未设置用内存/临时库。
示例:
```bash
TEST_MODE=ONLINE pytest tests/unit/test_etl_tasks_online.py
TEST_MODE=OFFLINE TEST_JSON_ARCHIVE_DIR=tests/source-data-doc pytest tests/unit/test_etl_tasks_offline.py
python scripts/test_db_connection.py --dsn postgresql://user:pwd@host:5432/db --query "SELECT 1"
```
---
## 9. 开发与扩展
- 新任务:在 `tasks/` 继承 BaseTask实现 `get_task_code/execute`,并在 `orchestration/task_registry.py` 注册。
- 新 Loader/Checker参考 `loaders/``quality/` 复用批量 upsert/质检接口。
- 配置:`config/defaults.py` + `.env` + CLI 叠加,新增配置需在 defaults 与 env_parser 中声明。
---
## 10. ODS 任务上线指引
- 任务注册脚本:`etl_billiards/database/seed_ods_tasks.sql`(替换 store_id 后执行:`psql "$PG_DSN" -f ...`)。
- 确认 `etl_admin.etl_task` 中已启用所需 ODS 任务。
- 离线回放:可用 `scripts/rebuild_ods_from_json`(如有)从本地 JSON 重建 ODS。
- 单测:`pytest etl_billiards/tests/unit/test_ods_tasks.py`
---
## 11. ODS 表概览(数据路径)
| ODS 表名 | 接口 Path | 数据列表路径 |
| ------------------------------------ | ------------------------------------------------- | ----------------------------- |
| assistant_accounts_master | /PersonnelManagement/SearchAssistantInfo | data.assistantInfos |
| assistant_service_records | /AssistantPerformance/GetOrderAssistantDetails | data.orderAssistantDetails |
| assistant_cancellation_records | /AssistantPerformance/GetAbolitionAssistant | data.abolitionAssistants |
| goods_stock_movements | /GoodsStockManage/QueryGoodsOutboundReceipt | data.queryDeliveryRecordsList |
| goods_stock_summary | /TenantGoods/GetGoodsStockReport | data |
| group_buy_packages | /PackageCoupon/QueryPackageCouponList | data.packageCouponList |
| group_buy_redemption_records | /Site/GetSiteTableUseDetails | data.siteTableUseDetailsList |
| member_profiles | /MemberProfile/GetTenantMemberList | data.tenantMemberInfos |
| member_balance_changes | /MemberProfile/GetMemberCardBalanceChange | data.tenantMemberCardLogs |
| member_stored_value_cards | /MemberProfile/GetTenantMemberCardList | data.tenantMemberCards |
| payment_transactions | /PayLog/GetPayLogListPage | data |
| platform_coupon_redemption_records | /Promotion/GetOfflineCouponConsumePageList | data |
| recharge_settlements | /Site/GetRechargeSettleList | data.settleList |
| refund_transactions | /Order/GetRefundPayLogList | data |
| settlement_records | /Site/GetAllOrderSettleList | data.settleList |
| settlement_ticket_details | /Order/GetOrderSettleTicketNew | 完整 JSON |
| site_tables_master | /Table/GetSiteTables | data.siteTables |
| stock_goods_category_tree | /TenantGoodsCategory/QueryPrimarySecondaryCategory| data.goodsCategoryList |
| store_goods_master | /TenantGoods/GetGoodsInventoryList | data.orderGoodsList |
| store_goods_sales_records | /TenantGoods/GetGoodsSalesList | data.orderGoodsLedgers |
| table_fee_discount_records | /Site/GetTaiFeeAdjustList | data.taiFeeAdjustInfos |
| table_fee_transactions | /Site/GetSiteTableOrderDetails | data.siteTableUseDetailsList |
| tenant_goods_master | /TenantGoods/QueryTenantGoods | data.tenantGoodsList |
> 完整字段级映射见 `docs/` 与 ODS/DWD DDL。
---
## 12. DWD 维度与建模要点
1. 颗粒一致、单一业务键:一张 DWD 表只承载一种业务事件/颗粒,避免混颗粒。
2. 先理解业务链路,再建模;不要机械按 JSON 列表建表。
3. 业务键统一site_id、member_id、table_id、order_settle_id、order_trade_no 等必须一致命名。
4. 保留明细,不过度汇总;聚合留到 DWS/报表。
5. 清洗标准化同时保留溯源字段源主键、时间、金额、payload
6. 去嵌套与解耦:数组展开子行,重复 profile 提炼维度。
7. 演进优先加列/加表,减少对已有表结构的破坏。
---
## 13. 当前状态2025-12-09
- 示例 JSON 已全量灌入DWD 行数与 ODS 对齐。
- 分类维度已展平大类+子类:`dim_goods_category` 26 行category_level/leaf 已赋值)。
- 部分空字段源数据即为空,如需补值请先确认上游。
---
## 14. 可精简/归档
- `tmp/``tmp/etl_billiards_misc/` 中草稿、旧备份、调试脚本仅供参考,不影响运行。
- 根级保留必要文件README、requirements、run_etl.*、.env/.env.example其他临时文件已移至 tmp。
---
## 15. FAQ
- 字段空值:若映射已存在且源列非空仍为空,再检查上游 JSON维度 SCD2 按全量合并。
- DSN/路径:确认 `.env``PG_DSN``INGEST_SOURCE_DIR` 与本地一致。
- 新增任务:在 `tasks/` 实现并注册到 `task_registry.py`,必要时同步更新 DDL 与映射。
- 权限/运行:检查网络、账号权限;脚本需执行权限(如 `chmod +x run_etl.sh`)。

View File

@@ -1,53 +1,49 @@
# 数据库配置(真实库) # -*- coding: utf-8 -*-
# 文件说明ETL 环境变量config/env_parser.py 读取),用于数据库连接、目录与运行参数。
# 数据库连接字符串config/env_parser.py -> db.dsn所有任务必需
PG_DSN=postgresql://local-Python:Neo-local-1991125@100.64.0.4:5432/LLZQ-test PG_DSN=postgresql://local-Python:Neo-local-1991125@100.64.0.4:5432/LLZQ-test
# 数据库连接超时秒config/env_parser.py -> db.connect_timeout_sec
PG_CONNECT_TIMEOUT=10 PG_CONNECT_TIMEOUT=10
# 如需拆分配置PG_HOST=... PG_PORT=... PG_NAME=... PG_USER=... PG_PASSWORD=...
# API配置如需走真实接口再填写 # 门店/租户IDconfig/env_parser.py -> app.store_id任务调度记录使用
API_BASE=https://api.example.com
API_TOKEN=your_token_here
# API_TIMEOUT=20
# API_PAGE_SIZE=200
# API_RETRY_MAX=3
# 应用配置
STORE_ID=2790685415443269 STORE_ID=2790685415443269
# TIMEZONE=Asia/Taipei # 时区标识config/env_parser.py -> app.timezone
# SCHEMA_OLTP=billiards TIMEZONE=Asia/Taipei
# SCHEMA_ETL=etl_admin
# 路径配置 # API 基础地址config/env_parser.py -> api.base_urlFETCH 类任务调用
EXPORT_ROOT=C:\dev\LLTQ\export\JSON API_BASE=https://api.example.com
# API 鉴权 Tokenconfig/env_parser.py -> api.tokenFETCH 类任务调用
API_TOKEN=your_token_here
# API 请求超时秒config/env_parser.py -> api.timeout_sec
API_TIMEOUT=20
# API 分页大小config/env_parser.py -> api.page_size
API_PAGE_SIZE=200
# API 最大重试次数config/env_parser.py -> api.retries.max_attempts
API_RETRY_MAX=3
# 日志根目录config/env_parser.py -> io.log_rootInit/任务运行写日志
LOG_ROOT=C:\dev\LLTQ\export\LOG LOG_ROOT=C:\dev\LLTQ\export\LOG
FETCH_ROOT= # JSON 导出根目录config/env_parser.py -> io.export_rootFETCH 产出及 INIT 准备
INGEST_SOURCE_DIR= EXPORT_ROOT=C:\dev\LLTQ\export\JSON
WRITE_PRETTY_JSON=false
PGCLIENTENCODING=utf8
# ETL配置 # FETCH 模式本地输出目录config/env_parser.py -> pipeline.fetch_root
FETCH_ROOT=C:\dev\LLTQ\export\JSON
# 本地入库 JSON 目录config/env_parser.py -> pipeline.ingest_source_dirMANUAL_INGEST/INGEST_ONLY 使用
INGEST_SOURCE_DIR=C:\dev\LLTQ\export\test-json-doc
# JSON 漂亮格式输出开关config/env_parser.py -> io.write_pretty_json
WRITE_PRETTY_JSON=false
# 运行流程FULL / FETCH_ONLY / INGEST_ONLYconfig/env_parser.py -> pipeline.flow
PIPELINE_FLOW=FULL
# 指定任务列表逗号分隔覆盖默认config/env_parser.py -> run.tasks
# RUN_TASKS=INIT_ODS_SCHEMA,MANUAL_INGEST
# 窗口/补偿参数config/env_parser.py -> run.*
OVERLAP_SECONDS=120 OVERLAP_SECONDS=120
WINDOW_BUSY_MIN=30 WINDOW_BUSY_MIN=30
WINDOW_IDLE_MIN=180 WINDOW_IDLE_MIN=180
IDLE_START=04:00 IDLE_START=04:00
IDLE_END=16:00 IDLE_END=16:00
ALLOW_EMPTY_RESULT_ADVANCE=true ALLOW_EMPTY_RESULT_ADVANCE=true
# 清洗配置
LOG_UNKNOWN_FIELDS=true
HASH_ALGO=sha1
STRICT_NUMERIC=true
ROUND_MONEY_SCALE=2
# 测试/离线模式(真实库联调建议 ONLINE
TEST_MODE=ONLINE
TEST_JSON_ARCHIVE_DIR=tests/source-data-doc
TEST_JSON_TEMP_DIR=/tmp/etl_billiards_json_tmp
# 测试数据库
TEST_DB_DSN=postgresql://local-Python:Neo-local-1991125@100.64.0.4:5432/LLZQ-test
# ODS <20>ؽ<EFBFBD><D8BD>ű<EFBFBD><C5B1><EFBFBD><EFBFBD>ã<EFBFBD><C3A3><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ã<EFBFBD>
JSON_DOC_DIR=C:\dev\LLTQ\export\test-json-doc
ODS_INCLUDE_FILES=
ODS_DROP_SCHEMA_FIRST=true

View File

@@ -1,837 +0,0 @@
# 台球场 ETL 系统(模块化版本)合并文档
本文为原多份文档(如 `INDEX.md``QUICK_START.md``ARCHITECTURE.md``MIGRATION_GUIDE.md``PROJECT_STRUCTURE.md``README.md` 等)的合并版,只保留与**当前项目本身**相关的内容:项目说明、目录结构、架构设计、数据与控制流程、迁移与扩展指南等,不包含修改历史和重构过程描述。
---
## 1. 项目概述
台球场 ETL 系统是一个面向门店业务的专业 ETL 工程项目,用于从外部业务 API 拉取订单、支付、会员等数据经过解析、校验、SCD2 处理、质量检查后写入 PostgreSQL 数据库,并支持增量同步和任务运行追踪。
系统采用模块化、分层架构设计,核心特性包括:
- 模块化目录结构配置、数据库、API、模型、加载器、SCD2、质量检查、编排、任务、CLI、工具、测试等分层清晰
- 完整的配置管理:默认值 + 环境变量 + CLI 参数多层覆盖。
- 可复用的数据库访问层(连接管理、批量 Upsert 封装)。
- 支持重试与分页的 API 客户端。
- 类型安全的数据解析与校验模块。
- SCD2 维度历史管理。
- 数据质量检查(例如余额一致性检查)。
- 任务编排层统一调度、游标管理与运行追踪。
- 命令行入口统一管理任务执行支持筛选任务、Dry-run 等模式。
---
## 2. 快速开始
### 2.1 环境准备
- Python 版本:建议 3.10+
- 数据库PostgreSQL
- 操作系统Windows / Linux / macOS 均可
```bash
# 克隆/下载代码后进入项目目录
cd etl_billiards/
ls -la
```
你会看到下述目录结构的顶层部分(详细见第 4 章):
- `config/` - 配置管理
- `database/` - 数据库访问
- `api/` - API 客户端
- `tasks/` - ETL 任务实现
- `cli/` - 命令行入口
- `docs/` - 技术文档
### 2.2 安装依赖
```bash
pip install -r requirements.txt
```
主要依赖示例(按实际 `requirements.txt` 为准):
- `psycopg2-binary`PostgreSQL 驱动
- `requests`HTTP 客户端
- `python-dateutil`:时间处理
- `tzdata`:时区数据
### 2.3 配置环境变量
复制并修改环境变量模板:
```bash
cp .env.example .env
# 使用你习惯的编辑器修改 .env
```
`.env` 示例(最小配置):
```bash
# 数据库
PG_DSN=postgresql://user:password@localhost:5432/....
# API
API_BASE=https://api.example.com
API_TOKEN=your_token_here
# 门店/应用
STORE_ID=2790685415443269
TIMEZONE=Asia/Taipei
# 目录
EXPORT_ROOT=/path/to/export
LOG_ROOT=/path/to/logs
```
> 所有配置项的默认值见 `config/defaults.py`,最终生效配置由「默认值 + 环境变量 + CLI 参数」三层叠加。
### 2.4 运行第一个任务
通过 CLI 入口运行:
```bash
# 运行所有任务
python -m cli.main
# 仅运行订单任务
python -m cli.main --tasks ORDERS
# 运行订单 + 支付
python -m cli.main --tasks ORDERS,PAYMENTS
# Windows 使用脚本
run_etl.bat --tasks ORDERS
# Linux / macOS 使用脚本
./run_etl.sh --tasks ORDERS
```
### 2.5 查看结果
- 日志目录:使用 `LOG_ROOT` 指定,例如
```bash
ls -la C:\dev\LLTQ\export\LOG/
```
- 导出目录:使用 `EXPORT_ROOT` 指定,例如
```bash
ls -la C:\dev\LLTQ\export\JSON/
```
---
## 3. 常用命令与开发工具
### 3.1 CLI 常用命令
```bash
# 运行所有任务
python -m cli.main
# 运行指定任务
python -m cli.main --tasks ORDERS,PAYMENTS,MEMBERS
# 使用自定义数据库
python -m cli.main --pg-dsn "postgresql://user:password@host:5432/db"
# 使用自定义 API 端点
python -m cli.main --api-base "https://api.example.com" --api-token "..."
# 试运行(不写入数据库)
python -m cli.main --dry-run --tasks ORDERS
```
### 3.2 IDE / 代码质量工具示例VSCode
`.vscode/settings.json` 示例:
```json
{
"python.linting.enabled": true,
"python.linting.pylintEnabled": true,
"python.formatting.provider": "black",
"python.testing.pytestEnabled": true
}
```
代码格式化与检查:
```bash
pip install black isort pylint
black .
isort .
pylint etl_billiards/
```
### 3.3 测试
```bash
# 安装测试依赖(按需)
pip install pytest pytest-cov
# 运行全部测试
pytest
# 仅运行单元测试
pytest tests/unit/
# 生成覆盖率报告
pytest --cov=. --cov-report=html
```
测试示例(按实际项目为准):
- `tests/unit/test_config.py` 配置管理单元测试
- `tests/unit/test_parsers.py` 解析器单元测试
- `tests/integration/test_database.py` 数据库集成测试
#### 3.3.1 测试模式ONLINE / OFFLINE
- `TEST_MODE=ONLINE`(默认)时,测试会模拟实时 API完整执行 E/T/L。
- `TEST_MODE=OFFLINE` 时,测试改为从 `TEST_JSON_ARCHIVE_DIR` 指定的归档 JSON 中读取数据,仅做 Transform + Load适合验证本地归档数据是否仍可回放。
- `TEST_JSON_ARCHIVE_DIR`:离线 JSON 归档目录(示例:`tests/source-data-doc` 或 CI 产出的快照)。
- `TEST_JSON_TEMP_DIR`:测试生成的临时 JSON 输出目录,便于隔离每次运行的数据。
- `TEST_DB_DSN`:可选,若设置则单元测试会连接到此 PostgreSQL DSN实打实执行写库留空时测试使用内存伪库避免依赖数据库。
示例命令:
```bash
# 在线模式覆盖所有任务
TEST_MODE=ONLINE pytest tests/unit/test_etl_tasks_online.py
# 离线模式使用归档 JSON 覆盖所有任务
TEST_MODE=OFFLINE TEST_JSON_ARCHIVE_DIR=tests/source-data-doc pytest tests/unit/test_etl_tasks_offline.py
# 使用脚本按需组合参数(示例:在线 + 仅订单用例)
python scripts/run_tests.py --suite online --mode ONLINE --keyword ORDERS
# 使用脚本连接真实测试库并回放离线模式
python scripts/run_tests.py --suite offline --mode OFFLINE --db-dsn postgresql://user:pwd@localhost:5432/testdb
# 使用“指令仓库”中的预置命令
python scripts/run_tests.py --preset offline_realdb
python scripts/run_tests.py --list-presets # 查看或自定义 scripts/test_presets.py
```
#### 3.3.2 脚本化测试组合(`run_tests.py` / `test_presets.py`
- `scripts/run_tests.py` 是 pytest 的统一入口:自动把项目根目录加入 `sys.path`,并提供 `--suite online/offline/integration`、`--tests`(自定义路径)、`--mode`、`--db-dsn`、`--json-archive`、`--json-temp`、`--keyword/-k`、`--pytest-args`、`--env KEY=VALUE` 等参数,可以像搭积木一样自由组合;
- `--preset foo` 会读取 `scripts/test_presets.py` 内 `PRESETS["foo"]` 的配置,并叠加到当前命令;`--list-presets` 与 `--dry-run` 可用来审阅或仅打印命令;
- 直接执行 `python scripts/test_presets.py` 可依次运行 `AUTO_RUN_PRESETS` 中列出的预置;传入 `--preset x --dry-run` 则只打印对应命令。
`test_presets.py` 充当“指令仓库”。每个预置都是一个字典,常用字段解释如下:
| 字段 | 作用 |
| ---------------------------- | ------------------------------------------------------------------ |
| `suite` | 复用 `run_tests.py` 内置套件online/offline/integration可多选 |
| `tests` | 追加任意 pytest 路径,例如 `tests/unit/test_config.py` |
| `mode` | 覆盖 `TEST_MODE`ONLINE / OFFLINE |
| `db_dsn` | 覆盖 `TEST_DB_DSN`,用于连入真实测试库 |
| `json_archive` / `json_temp` | 配置离线 JSON 归档与临时目录 |
| `keyword` | 映射到 `pytest -k`,用于关键字过滤 |
| `pytest_args` | 附加 pytest 参数,例 `-vv --maxfail=1` |
| `env` | 额外环境变量列表,如 `["STORE_ID=123"]` |
| `preset_meta` | 说明性文字,便于描述场景 |
示例:`offline_realdb` 预置会设置 `TEST_MODE=OFFLINE`、指定 `tests/source-data-doc` 为归档目录,并通过 `db_dsn` 连到测试库。执行 `python scripts/run_tests.py --preset offline_realdb` 或 `python scripts/test_presets.py --preset offline_realdb` 即可复用该组合保证本地、CI 与生产回放脚本一致。
#### 3.3.3 数据库连通性快速检查
`python scripts/test_db_connection.py` 提供最轻量的 PostgreSQL 连通性检测:默认使用 `TEST_DB_DSN`(也可传 `--dsn`),尝试连接并执行 `SELECT 1 AS ok`(可通过 `--query` 自定义)。典型用途:
```bash
# 读取 .env/环境变量中的 TEST_DB_DSN
python scripts/test_db_connection.py
# 临时指定 DSN并检查任务配置表
python scripts/test_db_connection.py --dsn postgresql://user:pwd@host:5432/.... --query "SELECT count(*) FROM etl_admin.etl_task"
```
脚本返回 0 代表连接与查询成功;若返回非 0可结合第 8 章“常见问题排查”的数据库章节(网络、防火墙、账号权限等)先定位问题,再运行完整 ETL。
---
## 4. 项目结构与文件说明
### 4.1 总体目录结构(树状图)
```text
etl_billiards/
├── README.md # 项目总览和使用说明
├── MIGRATION_GUIDE.md # 从旧版本迁移指南
├── requirements.txt # Python 依赖列表
├── setup.py # 项目安装配置
├── .env.example # 环境变量配置模板
├── .gitignore # Git 忽略文件配置
├── run_etl.sh # Linux/Mac 运行脚本
├── run_etl.bat # Windows 运行脚本
├── config/ # 配置管理模块
│ ├── __init__.py
│ ├── defaults.py # 默认配置值定义
│ ├── env_parser.py # 环境变量解析器
│ └── settings.py # 配置管理主类
├── database/ # 数据库访问层
│ ├── __init__.py
│ ├── connection.py # 数据库连接管理
│ └── operations.py # 批量操作封装
├── api/ # HTTP API 客户端
│ ├── __init__.py
│ └── client.py # API 客户端(重试 + 分页)
├── models/ # 数据模型层
│ ├── __init__.py
│ ├── parsers.py # 类型解析器
│ └── validators.py # 数据验证器
├── loaders/ # 数据加载器层
│ ├── __init__.py
│ ├── base_loader.py # 加载器基类
│ ├── dimensions/ # 维度表加载器
│ │ ├── __init__.py
│ │ └── member.py # 会员维度加载器
│ └── facts/ # 事实表加载器
│ ├── __init__.py
│ ├── order.py # 订单事实表加载器
│ └── payment.py # 支付记录加载器
├── scd/ # SCD2 处理层
│ ├── __init__.py
│ └── scd2_handler.py # SCD2 历史记录处理器
├── quality/ # 数据质量检查层
│ ├── __init__.py
│ ├── base_checker.py # 质量检查器基类
│ └── balance_checker.py # 余额一致性检查器
├── orchestration/ # ETL 编排层
│ ├── __init__.py
│ ├── scheduler.py # ETL 调度器
│ ├── task_registry.py # 任务注册表(工厂模式)
│ ├── cursor_manager.py # 游标管理器
│ └── run_tracker.py # 运行记录追踪器
├── tasks/ # ETL 任务层
│ ├── __init__.py
│ ├── base_task.py # 任务基类(模板方法)
│ ├── orders_task.py # 订单 ETL 任务
│ ├── payments_task.py # 支付 ETL 任务
│ └── members_task.py # 会员 ETL 任务
├── cli/ # 命令行接口层
│ ├── __init__.py
│ └── main.py # CLI 主入口
├── utils/ # 工具函数
│ ├── __init__.py
│ └── helpers.py # 通用工具函数
├── tests/ # 测试代码
│ ├── __init__.py
│ ├── unit/ # 单元测试
│ │ ├── __init__.py
│ │ ├── test_config.py
│ │ └── test_parsers.py
│ ├── testdata_json/ # 清洗入库用的测试Json文件
│ │ └── XX.json
│ └── integration/ # 集成测试
│ ├── __init__.py
│ └── test_database.py
└── docs/ # 文档
└── ARCHITECTURE.md # 架构设计文档
```
### 4.2 各模块职责概览
- **config/**
- 统一配置入口,支持默认值、环境变量、命令行参数三层覆盖。
- **database/**
- 封装 PostgreSQL 连接与批量操作插入、更新、Upsert 等)。
- **api/**
- 对上游业务 API 的 HTTP 调用进行统一封装,支持重试、分页与超时控制。
- **models/**
- 提供类型解析器(时间戳、金额、整数等)与业务级数据校验器。
- **loaders/**
- 提供事实表与维度表的加载逻辑(包含批量 Upsert、统计写入结果等
- **scd/**
- 维度型数据的 SCD2 历史管理(有效期、版本标记等)。
- **quality/**
- 质量检查策略,例如余额一致性、记录数量对齐等。
- **orchestration/**
- 任务调度、任务注册、游标管理(增量窗口)、运行记录追踪。
- **tasks/**
- 具体业务任务(订单、支付、会员等),封装了从“取数 → 处理 → 写库 → 记录结果”的完整流程。
- **cli/**
- 命令行入口,解析参数并启动调度流程。
- **utils/**
- 杂项工具函数。
- **tests/**
- 单元测试与集成测试代码。
---
## 5. 架构设计与流程说明
### 5.1 分层架构图
```text
┌─────────────────────────────────────┐
│ CLI 命令行接口 │ <- cli/main.py
└─────────────┬───────────────────────┘
┌─────────────▼───────────────────────┐
│ Orchestration 编排层 │ <- orchestration/
│ (Scheduler, TaskRegistry, ...) │
└─────────────┬───────────────────────┘
┌─────────────▼───────────────────────┐
│ Tasks 任务层 │ <- tasks/
│ (OrdersTask, PaymentsTask, ...) │
└───┬─────────┬─────────┬─────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌─────┐ ┌──────────┐
│Loaders │ │ SCD │ │ Quality │ <- loaders/, scd/, quality/
└────────┘ └─────┘ └──────────┘
┌───────▼────────┐
│ Models 模型 │ <- models/
└───────┬────────┘
┌───────▼────────┐
│ API 客户端 │ <- api/
└───────┬────────┘
┌───────▼────────┐
│ Database 访问 │ <- database/
└───────┬────────┘
┌───────▼────────┐
│ Config 配置 │ <- config/
└────────────────┘
```
### 5.2 各层职责(当前设计)
- **CLI 层 (`cli/`)**
- 解析命令行参数指定任务列表、Dry-run、覆盖配置项等
- 初始化配置与日志后交由编排层执行。
- **编排层 (`orchestration/`)**
- `scheduler.py`:根据配置与 CLI 参数选择需要执行的任务,控制执行顺序和并行策略。
- `task_registry.py`:提供任务注册表,按任务代码创建任务实例(工厂模式)。
- `cursor_manager.py`:管理增量游标(时间窗口 / ID 游标)。
- `run_tracker.py`:记录每次任务运行的状态、统计信息和错误信息。
- **任务层 (`tasks/`)**
- `base_task.py`:定义任务执行模板流程(模板方法模式),包括获取窗口、调用上游、解析 / 校验、写库、更新游标等。
- `orders_task.py` / `payments_task.py` / `members_task.py`:实现具体任务逻辑(订单、支付、会员)。
- **加载器 / SCD / 质量层**
- `loaders/`:根据目标表封装 Upsert / Insert / Update 逻辑。
- `scd/scd2_handler.py`:为维度表提供 SCD2 历史管理能力。
- `quality/`:执行数据质量检查,如余额对账。
- **模型层 (`models/`)**
- `parsers.py`:负责数据类型转换(字符串 → 时间戳、Decimal、int 等)。
- `validators.py`:执行字段级和记录级的数据校验。
- **API 层 (`api/client.py`)**
- 封装 HTTP 调用,处理重试、超时及分页。
- **数据库层 (`database/`)**
- 管理数据库连接及上下文。
- 提供批量插入 / 更新 / Upsert 操作接口。
- **配置层 (`config/`)**
- 定义配置项默认值。
- 解析环境变量并进行类型转换。
- 对外提供统一配置对象。
### 5.3 设计模式(当前使用)
- 工厂模式:任务注册 / 创建(`TaskRegistry`)。
- 模板方法模式:任务执行流程(`BaseTask`)。
- 策略模式:不同 Loader / Checker 实现不同策略。
- 依赖注入:通过构造函数向任务传入 `db`、`api`、`config` 等依赖。
### 5.4 数据与控制流程
整体流程:
1. CLI 解析参数并加载配置。
2. Scheduler 构建数据库连接、API 客户端等依赖。
3. Scheduler 遍历任务配置,从 `TaskRegistry` 获取任务类并实例化。
4. 每个任务按统一模板执行:
- 读取游标 / 时间窗口。
- 调用 API 拉取数据(可分页)。
- 解析、验证数据。
- 通过 Loader 写入数据库(事实表 / 维度表 / SCD2
- 执行质量检查。
- 更新游标与运行记录。
5. 所有任务执行完成后,释放连接并退出进程。
### 5.5 错误处理策略
- 单个任务失败不影响其他任务执行。
- 数据库操作异常自动回滚当前事务。
- API 请求失败时按配置进行重试,超过重试次数记录错误并终止该任务。
- 所有错误被记录到日志和运行追踪表,便于事后排查。
### 5.6 ODS + DWD 双阶段策略(新增)
为了支撑回溯/重放与后续 DWD 宽表构建,项目新增了 `billiards_ods` Schema 以及一组专门的 ODS 任务/Loader
- **ODS 表**`billiards_ods.ods_order_settle`、`ods_table_use_detail`、`ods_assistant_ledger`、`ods_assistant_abolish`、`ods_goods_ledger`、`ods_payment`、`ods_refund`、`ods_coupon_verify`、`ods_member`、`ods_member_card`、`ods_package_coupon`、`ods_inventory_stock`、`ods_inventory_change`。每条记录都会保存 `store_id + 源主键 + payload JSON + fetched_at + source_endpoint` 等信息。
- **通用 Loader**`loaders/ods/generic.py::GenericODSLoader` 统一封装了 `INSERT ... ON CONFLICT ...` 与批量写入逻辑,调用方只需提供列名与主键列即可。
- **ODS 任务**`tasks/ods_tasks.py` 内通过 `OdsTaskSpec` 定义了一组任务(`ODS_ORDER_SETTLE`、`ODS_PAYMENT`、`ODS_ASSISTANT_LEDGER` 等),并在 `TaskRegistry` 中自动注册,可直接通过 `python -m cli.main --tasks ODS_ORDER_SETTLE,ODS_PAYMENT` 执行。
- **双阶段链路**
1. 阶段 1ODS调用 API/离线归档 JSON将原始记录写入 ODS 表,保留分页、抓取时间、来源文件等元数据。
2. 阶段 2DWD/DIM后续订单、支付、券等事实任务将改为从 ODS 读取 payload经过解析/校验后写入 `billiards.fact_*`、`dim_*` 表,避免重复拉取上游接口。
> 新增的单元测试 `tests/unit/test_ods_tasks.py` 覆盖了 `ODS_ORDER_SETTLE`、`ODS_PAYMENT` 的入库路径,可作为扩展其他 ODS 任务的模板。
---
## 6. 迁移指南(从旧脚本到当前项目)
本节用于说明如何从旧的单文件脚本(如 `task_merged.py`)迁移到当前模块化项目,属于当前项目的使用说明,不涉及历史对比细节。
### 6.1 核心功能映射示意
| 旧版本函数 / 类 | 新版本位置 | 说明 |
| --------------------- | ----------------------------------------------------- | ---------- |
| `DEFAULTS` 字典 | `config/defaults.py` | 配置默认值 |
| `build_config()` | `config/settings.py::AppConfig.load()` | 配置加载 |
| `Pg` 类 | `database/connection.py::DatabaseConnection` | 数据库连接 |
| `http_get_json()` | `api/client.py::APIClient.get()` | API 请求 |
| `paged_get()` | `api/client.py::APIClient.get_paginated()` | 分页请求 |
| `parse_ts()` | `models/parsers.py::TypeParser.parse_timestamp()` | 时间解析 |
| `upsert_fact_order()` | `loaders/facts/order.py::OrderLoader.upsert_orders()` | 订单加载 |
| `scd2_upsert()` | `scd/scd2_handler.py::SCD2Handler.upsert()` | SCD2 处理 |
| `run_task_orders()` | `tasks/orders_task.py::OrdersTask.execute()` | 订单任务 |
| `main()` | `cli/main.py::main()` | 主入口 |
### 6.2 典型迁移步骤
1. **配置迁移**
- 原来在 `DEFAULTS` 或脚本内硬编码的配置,迁移到 `.env` 与 `config/defaults.py`。
- 使用 `AppConfig.load()` 统一获取配置。
2. **并行运行验证**
```bash
# 旧脚本
python task_merged.py --tasks ORDERS
# 新项目
python -m cli.main --tasks ORDERS
```
对比新旧版本导出的数据表和日志,确认一致性。
3. **自定义逻辑迁移**
- 原脚本中的自定义清洗逻辑 → 放入相应 `loaders/` 或任务类中。
- 自定义任务 → 在 `tasks/` 中实现并在 `task_registry` 中注册。
- 自定义 API 调用 → 扩展 `api/client.py` 或单独封装服务类。
4. **逐步切换**
- 先在测试环境并行运行。
- 再逐步切换生产任务到新版本。
---
## 7. 开发与扩展指南(当前项目)
### 7.1 添加新任务
1. 在 `tasks/` 目录创建任务类:
```python
from .base_task import BaseTask
class MyTask(BaseTask):
def get_task_code(self) -> str:
return "MY_TASK"
def execute(self) -> dict:
# 1. 获取时间窗口
window_start, window_end, _ = self._get_time_window()
# 2. 调用 API 获取数据
records, _ = self.api.get_paginated(...)
# 3. 解析 / 校验
parsed = [self._parse(r) for r in records]
# 4. 加载数据
loader = MyLoader(self.db)
inserted, updated, _ = loader.upsert(parsed)
# 5. 提交并返回结果
self.db.commit()
return self._build_result("SUCCESS", {
"inserted": inserted,
"updated": updated,
})
```
2. 在 `orchestration/task_registry.py` 中注册:
```python
from tasks.my_task import MyTask
default_registry.register("MY_TASK", MyTask)
```
3. 在任务配置表中启用(示例):
```sql
INSERT INTO etl_admin.etl_task (task_code, store_id, enabled)
VALUES ('MY_TASK', 123456, TRUE);
```
### 7.2 添加新加载器
```python
from loaders.base_loader import BaseLoader
class MyLoader(BaseLoader):
def upsert(self, records: list) -> tuple:
sql = "INSERT INTO table_name (...) VALUES (...) ON CONFLICT (...) DO UPDATE SET ... RETURNING (xmax = 0) AS inserted"
inserted, updated = self.db.batch_upsert_with_returning(
sql, records, page_size=self._batch_size()
)
return (inserted, updated, 0)
```
### 7.3 添加新质量检查器
1. 在 `quality/` 中实现检查器,继承 `base_checker.py`。
2. 在任务或调度流程中调用该检查器,在写库后进行验证。
### 7.4 类型解析与校验扩展
- 在 `models/parsers.py` 中添加新类型解析方法。
- 在 `models/validators.py` 中添加新规则(如枚举校验、跨字段校验等)。
---
## 8. 常见问题排查
### 8.1 数据库连接失败
```text
错误: could not connect to server
```
排查要点:
- 检查 `PG_DSN` 或相关数据库配置是否正确。
- 确认数据库服务是否启动、网络是否可达。
### 8.2 API 请求超时
```text
错误: requests.exceptions.Timeout
```
排查要点:
- 检查 `API_BASE` 地址与网络连通性。
- 适当提高超时与重试次数(在配置中调整)。
### 8.3 模块导入错误
```text
错误: ModuleNotFoundError
```
排查要点:
- 确认在项目根目录下运行(包含 `etl_billiards/` 包)。
- 或通过 `pip install -e .` 以可编辑模式安装项目。
### 8.4 权限相关问题
```text
错误: Permission denied
```
排查要点:
- 脚本无执行权限:`chmod +x run_etl.sh`。
- Windows 需要以管理员身份运行,或修改日志 / 导出目录权限。
---
## 9. 使用前检查清单
在正式运行前建议确认:
- [ ] 已安装 Python 3.10+。
- [ ] 已执行 `pip install -r requirements.txt`。
- [ ] `.env` 已配置正确数据库、API、门店 ID、路径等
- [ ] PostgreSQL 数据库可连接。
- [ ] API 服务可访问且凭证有效。
- [ ] `LOG_ROOT`、`EXPORT_ROOT` 目录存在且拥有写权限。
---
## 10. 参考说明
- 本文已合并原有的快速开始、项目结构、架构说明、迁移指南等内容,可作为当前项目的统一说明文档。
- 如需在此基础上拆分多份文档,可按章节拆出,例如「快速开始」「架构设计」「迁移指南」「开发扩展」等。
## 11. 运行/调试模式说明
- 生产环境仅保留“任务模式”:通过调度/CLI 执行注册的任务ETL/ODS不使用调试脚本。
- 开发/调试可使用的辅助脚本(上线前可删除或禁用):
- `python -m etl_billiards.scripts.rebuild_ods_from_json`:从本地 JSON 目录重建 `billiards_ods`,用于离线初始化/验证。环境变量:`PG_DSN`(必填)、`JSON_DOC_DIR`(可选,默认 `C:\dev\LLTQ\export\test-json-doc`)、`INCLUDE_FILES`(逗号分隔文件名)、`DROP_SCHEMA_FIRST`(默认 true
- 如需在生产环境保留脚本,请在运维手册中明确用途和禁用条件,避免误用。
## 12. ODS 任务上线指引
- 任务注册:`etl_billiards/database/seed_ods_tasks.sql` 列出了当前启用的 ODS 任务。将其中的 `store_id` 替换为实际门店后执行:
```
psql "$PG_DSN" -f etl_billiards/database/seed_ods_tasks.sql
```
`ON CONFLICT` 会保持 enabled=true避免重复。
- 调度:确认 `etl_admin.etl_task` 中已启用所需的 ODS 任务(任务代码见 seed 脚本),调度器或 CLI `--tasks` 即可调用。
- 离线回灌:开发环境可用 `rebuild_ods_from_json` 以样例 JSON 初始化 ODS生产慎用默认按 `(source_file, record_index)` 去重。
- 测试:`pytest etl_billiards/tests/unit/test_ods_tasks.py` 覆盖核心 ODS 任务;测试时可设置 `ETL_SKIP_DOTENV=1` 跳过本地 .env 读取。
## 13. ODS 表映射总览
| ODS 表名 | 接口 Path | 数据列表路径 |
| ------------------------------------ | ---------------------------------------------------- | ----------------------------- |
| `assistant_accounts_master` | `/PersonnelManagement/SearchAssistantInfo` | data.assistantInfos |
| `assistant_service_records` | `/AssistantPerformance/GetOrderAssistantDetails` | data.orderAssistantDetails |
| `assistant_cancellation_records` | `/AssistantPerformance/GetAbolitionAssistant` | data.abolitionAssistants |
| `goods_stock_movements` | `/GoodsStockManage/QueryGoodsOutboundReceipt` | data.queryDeliveryRecordsList |
| `goods_stock_summary` | `/TenantGoods/GetGoodsStockReport` | data |
| `group_buy_packages` | `/PackageCoupon/QueryPackageCouponList` | data.packageCouponList |
| `group_buy_redemption_records` | `/Site/GetSiteTableUseDetails` | data.siteTableUseDetailsList |
| `member_profiles` | `/MemberProfile/GetTenantMemberList` | data.tenantMemberInfos |
| `member_balance_changes` | `/MemberProfile/GetMemberCardBalanceChange` | data.tenantMemberCardLogs |
| `member_stored_value_cards` | `/MemberProfile/GetTenantMemberCardList` | data.tenantMemberCards |
| `payment_transactions` | `/PayLog/GetPayLogListPage` | data |
| `platform_coupon_redemption_records` | `/Promotion/GetOfflineCouponConsumePageList` | data |
| `recharge_settlements` | `/Site/GetRechargeSettleList` | data.settleList |
| `refund_transactions` | `/Order/GetRefundPayLogList` | data |
| `settlement_records` | `/Site/GetAllOrderSettleList` | data.settleList |
| `settlement_ticket_details` | `/Order/GetOrderSettleTicketNew` | (整包原始 JSON |
| `site_tables_master` | `/Table/GetSiteTables` | data.siteTables |
| `stock_goods_category_tree` | `/TenantGoodsCategory/QueryPrimarySecondaryCategory` | data.goodsCategoryList |
| `store_goods_master` | `/TenantGoods/GetGoodsInventoryList` | data.orderGoodsList |
| `store_goods_sales_records` | `/TenantGoods/GetGoodsSalesList` | data.orderGoodsLedgers |
| `table_fee_discount_records` | `/Site/GetTaiFeeAdjustList` | data.taiFeeAdjustInfos |
| `table_fee_transactions` | `/Site/GetSiteTableOrderDetails` | data.siteTableUseDetailsList |
| `tenant_goods_master` | `/TenantGoods/QueryTenantGoods` | data.tenantGoodsList |
## 14. ODS 相关环境变量/默认值
- `.env` / 环境变量:
- `JSON_DOC_DIR`ODS 样例 JSON 目录(开发/回灌用)
- `ODS_INCLUDE_FILES`:限定导入的文件名(逗号分隔,不含 .json
- `ODS_DROP_SCHEMA_FIRST`true/false是否重建 schema
- `ETL_SKIP_DOTENV`:测试/CI 时设为 1 跳过本地 .env 读取
- `config/defaults.py` 中 `ods` 默认值:
- `json_doc_dir`: `C:\dev\LLTQ\export\test-json-doc`
- `include_files`: `""`
- `drop_schema_first`: `True`
---
## 15. DWD 维度 “业务事件”
1. 粒度唯一、原子
- 一张 DWD 表只能有一种业务粒度,比如:
- 一条记录 = 一次结账;
- 一条记录 = 一段台费流水;
- 一条记录 = 一次助教服务;
- 一条记录 = 一次会员余额变动。
- 表里面不能又混“订单头”又混“订单行”,不能一部分是“汇总”,一部分是“明细”。
- 一旦粒度确定,所有字段都要跟这个粒度匹配:
- 比如“结账头表”就不要塞每一行商品明细;
- 商品明细就不要塞整单级别的总金额。
- 这是 DWD 层最重要的一条。
2. 以业务过程建模,不以 JSON 列表建模
- 先画清楚你真实的业务链路:
- 开台 / 换台 / 关台 → 台费流水
- 助教上桌 → 助教服务流水 / 废除事件
- 点单 → 商品销售流水
- 充值 / 消费 → 余额变更 / 充值单
- 结账 → 结账头表 + 支付流水 / 退款流水
- 团购 / 平台券 → 核销流水
3. 主键明确、外键统一
- 每张 DWD 表必须有业务主键(哪怕是接口给的 id不要依赖数据库自增。
- 所有“同一概念”的字段必须统一命名、统一含义:
- 门店:统一叫 site_id都对应 siteProfile.id
- 会员:统一叫 member_id 对应 member_profiles.idsystem_member_id 单独一列;
- 台桌:统一 table_id 对应 site_tables_master.id
- 结账:统一 order_settle_id
- 订单:统一 order_trade_no 等。
- 否则后面 DWS、AI 要把表拼起来会非常痛苦。
4. 保留明细,不做过度汇总
- DWD 层的事实表原则上只做“明细级”的数据:
- 不要在 DWD 就把“日汇总、周汇总、月汇总”算出来,那是 DWS 的事;
- 也不要把多个事件折成一行(例如一张表同时放日汇总+单笔流水)。
- 需要聚合时,再在 DWS 做主题宽表:
- dws_member_day_profile、dws_site_day_summary 等。
- DWD 只负责细颗粒度的真相。
5. 统一清洗、标准化,但保持可追溯
- 在 DWD 层一定要做的清洗:
- 类型转换:字符串时间 → 时间类型,金额统一为 decimal布尔统一为 0/1
- 单位统一:秒 / 分钟、元 / 分都统一;
- 枚举标准化:状态码、类型码在 DWD 里就定死含义,必要时建枚举维表。
- 同时要保证:
- 每条 DWD 记录都能追溯回 ODS
- 保留源系统主键;
- 保留原始时间 / 原始金额字段(不要覆盖掉)。
6. 扁平化、去嵌套
- JSON 里常见结构是:分页壳 + 头 + 明细数组 + 各种嵌套对象siteProfile、tableProfile、goodsLedgers…
- DWD 的原则是:
- 去掉分页壳;
- 把“数组”拆成子表(头表 / 行表);
- 把重复出现的 profile 抽出去做维度表(门店、台、商品、会员……)。
- 目标是DWD 表都是二维表结构,不存复杂嵌套 JSON。
7. 模型长期稳定,可扩展
- DWD 的表结构要尽可能稳定,新增需求尽量通过:
- 加字段;
- 新建事实表 / 维度表;
- 在 DWS 做派生指标;
- 而不是频繁重构已有 DWD 表结构。
- 这点跟你后面要喂给 LLM 也很相关AI 配的 prompt、schema 理解都要尽量少改。

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,105 @@
-- 文件说明etl_admin 调度元数据 DDL独立文件便于初始化任务单独执行
-- 包含任务注册表、游标表、运行记录表;字段注释使用中文。
CREATE SCHEMA IF NOT EXISTS etl_admin;
CREATE TABLE IF NOT EXISTS etl_admin.etl_task (
task_id BIGSERIAL PRIMARY KEY,
task_code TEXT NOT NULL,
store_id BIGINT NOT NULL,
enabled BOOLEAN DEFAULT TRUE,
cursor_field TEXT,
window_minutes_default INT DEFAULT 30,
overlap_seconds INT DEFAULT 120,
page_size INT DEFAULT 200,
retry_max INT DEFAULT 3,
params JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (task_code, store_id)
);
COMMENT ON TABLE etl_admin.etl_task IS '任务注册表:调度依据的任务清单(与 task_registry 中的任务码对应)。';
COMMENT ON COLUMN etl_admin.etl_task.task_code IS '任务编码,需与代码中的任务码一致。';
COMMENT ON COLUMN etl_admin.etl_task.store_id IS '门店/租户粒度,区分多门店执行。';
COMMENT ON COLUMN etl_admin.etl_task.enabled IS '是否启用此任务。';
COMMENT ON COLUMN etl_admin.etl_task.cursor_field IS '增量游标字段名(可选)。';
COMMENT ON COLUMN etl_admin.etl_task.window_minutes_default IS '默认时间窗口(分钟)。';
COMMENT ON COLUMN etl_admin.etl_task.overlap_seconds IS '窗口重叠秒数,用于防止遗漏。';
COMMENT ON COLUMN etl_admin.etl_task.page_size IS '默认分页大小。';
COMMENT ON COLUMN etl_admin.etl_task.retry_max IS 'API重试次数上限。';
COMMENT ON COLUMN etl_admin.etl_task.params IS '任务级自定义参数 JSON。';
COMMENT ON COLUMN etl_admin.etl_task.created_at IS '创建时间。';
COMMENT ON COLUMN etl_admin.etl_task.updated_at IS '更新时间。';
CREATE TABLE IF NOT EXISTS etl_admin.etl_cursor (
cursor_id BIGSERIAL PRIMARY KEY,
task_id BIGINT NOT NULL REFERENCES etl_admin.etl_task(task_id) ON DELETE CASCADE,
store_id BIGINT NOT NULL,
last_start TIMESTAMPTZ,
last_end TIMESTAMPTZ,
last_id BIGINT,
last_run_id BIGINT,
extra JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (task_id, store_id)
);
COMMENT ON TABLE etl_admin.etl_cursor IS '任务游标表:记录每个任务/门店的增量窗口及最后 run。';
COMMENT ON COLUMN etl_admin.etl_cursor.task_id IS '关联 etl_task.task_id。';
COMMENT ON COLUMN etl_admin.etl_cursor.store_id IS '门店/租户粒度。';
COMMENT ON COLUMN etl_admin.etl_cursor.last_start IS '上次窗口开始时间(含重叠偏移)。';
COMMENT ON COLUMN etl_admin.etl_cursor.last_end IS '上次窗口结束时间。';
COMMENT ON COLUMN etl_admin.etl_cursor.last_id IS '上次处理的最大主键/游标值(可选)。';
COMMENT ON COLUMN etl_admin.etl_cursor.last_run_id IS '上次运行ID对应 etl_run.run_id。';
COMMENT ON COLUMN etl_admin.etl_cursor.extra IS '附加游标信息 JSON。';
COMMENT ON COLUMN etl_admin.etl_cursor.created_at IS '创建时间。';
COMMENT ON COLUMN etl_admin.etl_cursor.updated_at IS '更新时间。';
CREATE TABLE IF NOT EXISTS etl_admin.etl_run (
run_id BIGSERIAL PRIMARY KEY,
run_uuid TEXT NOT NULL,
task_id BIGINT NOT NULL REFERENCES etl_admin.etl_task(task_id) ON DELETE CASCADE,
store_id BIGINT NOT NULL,
status TEXT NOT NULL,
started_at TIMESTAMPTZ DEFAULT now(),
ended_at TIMESTAMPTZ,
window_start TIMESTAMPTZ,
window_end TIMESTAMPTZ,
window_minutes INT,
overlap_seconds INT,
fetched_count INT DEFAULT 0,
loaded_count INT DEFAULT 0,
updated_count INT DEFAULT 0,
skipped_count INT DEFAULT 0,
error_count INT DEFAULT 0,
unknown_fields INT DEFAULT 0,
export_dir TEXT,
log_path TEXT,
request_params JSONB DEFAULT '{}'::jsonb,
manifest JSONB DEFAULT '{}'::jsonb,
error_message TEXT,
extra JSONB DEFAULT '{}'::jsonb
);
COMMENT ON TABLE etl_admin.etl_run IS '运行记录表:记录每次任务执行的窗口、状态、计数与日志路径。';
COMMENT ON COLUMN etl_admin.etl_run.run_uuid IS '本次调度的唯一标识。';
COMMENT ON COLUMN etl_admin.etl_run.task_id IS '关联 etl_task.task_id。';
COMMENT ON COLUMN etl_admin.etl_run.store_id IS '门店/租户粒度。';
COMMENT ON COLUMN etl_admin.etl_run.status IS '运行状态SUCC/FAIL/PARTIAL 等)。';
COMMENT ON COLUMN etl_admin.etl_run.started_at IS '开始时间。';
COMMENT ON COLUMN etl_admin.etl_run.ended_at IS '结束时间。';
COMMENT ON COLUMN etl_admin.etl_run.window_start IS '本次窗口开始时间。';
COMMENT ON COLUMN etl_admin.etl_run.window_end IS '本次窗口结束时间。';
COMMENT ON COLUMN etl_admin.etl_run.window_minutes IS '窗口跨度(分钟)。';
COMMENT ON COLUMN etl_admin.etl_run.overlap_seconds IS '窗口重叠秒数。';
COMMENT ON COLUMN etl_admin.etl_run.fetched_count IS '抓取/读取的记录数。';
COMMENT ON COLUMN etl_admin.etl_run.loaded_count IS '插入的记录数。';
COMMENT ON COLUMN etl_admin.etl_run.updated_count IS '更新的记录数。';
COMMENT ON COLUMN etl_admin.etl_run.skipped_count IS '跳过的记录数。';
COMMENT ON COLUMN etl_admin.etl_run.error_count IS '错误记录数。';
COMMENT ON COLUMN etl_admin.etl_run.unknown_fields IS '未知字段计数(清洗阶段)。';
COMMENT ON COLUMN etl_admin.etl_run.export_dir IS '抓取/导出目录。';
COMMENT ON COLUMN etl_admin.etl_run.log_path IS '日志路径。';
COMMENT ON COLUMN etl_admin.etl_run.request_params IS '请求参数 JSON。';
COMMENT ON COLUMN etl_admin.etl_run.manifest IS '运行产出清单/统计 JSON。';
COMMENT ON COLUMN etl_admin.etl_run.error_message IS '错误信息(若失败)。';
COMMENT ON COLUMN etl_admin.etl_run.extra IS '附加字段,保留扩展。';

File diff suppressed because it is too large Load Diff

View File

@@ -1,34 +1,35 @@
-- 将新的 ODS 任务注册到 etl_admin.etl_task根据需要替换 store_id -- 将新的 ODS 任务注册到 etl_admin.etl_task按需替换 store_id
-- 使用方式(示例): -- 使用方式(示例):
-- psql "$PG_DSN" -f etl_billiards/database/seed_ods_tasks.sql -- psql "$PG_DSN" -f etl_billiards/database/seed_ods_tasks.sql
-- 或在 psql 中执行本文件内容。 -- 或在 psql 中直接执行本文件内容。
WITH target_store AS ( WITH target_store AS (
SELECT 2790685415443269::bigint AS store_id -- TODO: 替换为实际 store_id SELECT 2790685415443269::bigint AS store_id -- TODO: 替换为实际 store_id
), ),
task_codes AS ( task_codes AS (
SELECT unnest(ARRAY[ SELECT unnest(ARRAY[
'ODS_ASSISTANT_ACCOUNTS', 'assistant_accounts_masterS',
'ODS_ASSISTANT_LEDGER', 'assistant_service_records',
'ODS_ASSISTANT_ABOLISH', 'assistant_cancellation_records',
'ODS_INVENTORY_CHANGE', 'goods_stock_movements',
'ODS_INVENTORY_STOCK', 'ODS_INVENTORY_STOCK',
'ODS_PACKAGE', 'ODS_PACKAGE',
'ODS_GROUP_BUY_REDEMPTION', 'ODS_GROUP_BUY_REDEMPTION',
'ODS_MEMBER', 'ODS_MEMBER',
'ODS_MEMBER_BALANCE', 'ODS_MEMBER_BALANCE',
'ODS_MEMBER_CARD', 'member_stored_value_cards',
'ODS_PAYMENT', 'ODS_PAYMENT',
'ODS_REFUND', 'ODS_REFUND',
'ODS_COUPON_VERIFY', 'platform_coupon_redemption_records',
'ODS_RECHARGE_SETTLE', 'recharge_settlements',
'ODS_TABLES', 'ODS_TABLES',
'ODS_GOODS_CATEGORY', 'ODS_GOODS_CATEGORY',
'ODS_STORE_GOODS', 'ODS_STORE_GOODS',
'ODS_TABLE_DISCOUNT', 'table_fee_discount_records',
'ODS_TENANT_GOODS', 'ODS_TENANT_GOODS',
'ODS_SETTLEMENT_TICKET', 'ODS_SETTLEMENT_TICKET',
'ODS_ORDER_SETTLE' 'settlement_records',
'INIT_ODS_SCHEMA'
]) AS task_code ]) AS task_code
) )
INSERT INTO etl_admin.etl_task (task_code, store_id, enabled) INSERT INTO etl_admin.etl_task (task_code, store_id, enabled)
@@ -36,4 +37,3 @@ SELECT t.task_code, s.store_id, TRUE
FROM task_codes t CROSS JOIN target_store s FROM task_codes t CROSS JOIN target_store s
ON CONFLICT (task_code, store_id) DO UPDATE ON CONFLICT (task_code, store_id) DO UPDATE
SET enabled = EXCLUDED.enabled; SET enabled = EXCLUDED.enabled;

View File

@@ -0,0 +1,52 @@
{
"source_counts": {
"assistant_accounts_master.json": 2,
"assistant_cancellation_records.json": 2,
"assistant_service_records.json": 2,
"goods_stock_movements.json": 2,
"goods_stock_summary.json": 161,
"group_buy_packages.json": 2,
"group_buy_redemption_records.json": 2,
"member_balance_changes.json": 2,
"member_profiles.json": 2,
"member_stored_value_cards.json": 2,
"payment_transactions.json": 200,
"platform_coupon_redemption_records.json": 200,
"recharge_settlements.json": 2,
"refund_transactions.json": 11,
"settlement_records.json": 2,
"settlement_ticket_details.json": 193,
"site_tables_master.json": 2,
"stock_goods_category_tree.json": 2,
"store_goods_master.json": 2,
"store_goods_sales_records.json": 2,
"table_fee_discount_records.json": 2,
"table_fee_transactions.json": 2,
"tenant_goods_master.json": 2
},
"ods_counts": {
"member_profiles": 199,
"member_balance_changes": 200,
"member_stored_value_cards": 200,
"recharge_settlements": 75,
"settlement_records": 200,
"assistant_cancellation_records": 15,
"assistant_accounts_master": 50,
"assistant_service_records": 200,
"site_tables_master": 71,
"table_fee_discount_records": 200,
"table_fee_transactions": 200,
"goods_stock_movements": 200,
"stock_goods_category_tree": 9,
"goods_stock_summary": 161,
"payment_transactions": 200,
"refund_transactions": 11,
"platform_coupon_redemption_records": 200,
"tenant_goods_master": 156,
"group_buy_packages": 17,
"group_buy_redemption_records": 200,
"settlement_ticket_details": 193,
"store_goods_master": 161,
"store_goods_sales_records": 200
}
}

View File

@@ -15,10 +15,14 @@ from tasks.table_discount_task import TableDiscountTask
from tasks.assistant_abolish_task import AssistantAbolishTask from tasks.assistant_abolish_task import AssistantAbolishTask
from tasks.ledger_task import LedgerTask from tasks.ledger_task import LedgerTask
from tasks.ods_tasks import ODS_TASK_CLASSES from tasks.ods_tasks import ODS_TASK_CLASSES
from tasks.ticket_dwd_task import TicketDwdTask
from tasks.manual_ingest_task import ManualIngestTask from tasks.manual_ingest_task import ManualIngestTask
from tasks.payments_dwd_task import PaymentsDwdTask from tasks.payments_dwd_task import PaymentsDwdTask
from tasks.members_dwd_task import MembersDwdTask from tasks.members_dwd_task import MembersDwdTask
from tasks.init_schema_task import InitOdsSchemaTask
from tasks.init_dwd_schema_task import InitDwdSchemaTask
from tasks.dwd_load_task import DwdLoadTask
from tasks.ticket_dwd_task import TicketDwdTask
from tasks.dwd_quality_task import DwdQualityTask
class TaskRegistry: class TaskRegistry:
"""任务注册和工厂""" """任务注册和工厂"""
@@ -64,5 +68,9 @@ default_registry.register("TICKET_DWD", TicketDwdTask)
default_registry.register("MANUAL_INGEST", ManualIngestTask) default_registry.register("MANUAL_INGEST", ManualIngestTask)
default_registry.register("PAYMENTS_DWD", PaymentsDwdTask) default_registry.register("PAYMENTS_DWD", PaymentsDwdTask)
default_registry.register("MEMBERS_DWD", MembersDwdTask) default_registry.register("MEMBERS_DWD", MembersDwdTask)
default_registry.register("INIT_ODS_SCHEMA", InitOdsSchemaTask)
default_registry.register("INIT_DWD_SCHEMA", InitDwdSchemaTask)
default_registry.register("DWD_LOAD_FROM_ODS", DwdLoadTask)
default_registry.register("DWD_QUALITY_CHECK", DwdQualityTask)
for code, task_cls in ODS_TASK_CLASSES.items(): for code, task_cls in ODS_TASK_CLASSES.items():
default_registry.register(code, task_cls) default_registry.register(code, task_cls)

View File

@@ -0,0 +1,692 @@
{
"generated_at": "2025-12-09T05:21:24.745244",
"tables": [
{
"dwd_table": "billiards_dwd.dim_site",
"ods_table": "billiards_ods.table_fee_transactions",
"count": {
"dwd": 1,
"ods": 200,
"diff": -199
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_site_ex",
"ods_table": "billiards_ods.table_fee_transactions",
"count": {
"dwd": 1,
"ods": 200,
"diff": -199
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_table",
"ods_table": "billiards_ods.site_tables_master",
"count": {
"dwd": 71,
"ods": 71,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_table_ex",
"ods_table": "billiards_ods.site_tables_master",
"count": {
"dwd": 71,
"ods": 71,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_assistant",
"ods_table": "billiards_ods.assistant_accounts_master",
"count": {
"dwd": 50,
"ods": 50,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_assistant_ex",
"ods_table": "billiards_ods.assistant_accounts_master",
"count": {
"dwd": 50,
"ods": 50,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_member",
"ods_table": "billiards_ods.member_profiles",
"count": {
"dwd": 199,
"ods": 199,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_member_ex",
"ods_table": "billiards_ods.member_profiles",
"count": {
"dwd": 199,
"ods": 199,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_member_card_account",
"ods_table": "billiards_ods.member_stored_value_cards",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "balance",
"dwd_sum": 31061.03,
"ods_sum": 31061.03,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dim_member_card_account_ex",
"ods_table": "billiards_ods.member_stored_value_cards",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "deliveryfeededuct",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dim_tenant_goods",
"ods_table": "billiards_ods.tenant_goods_master",
"count": {
"dwd": 156,
"ods": 156,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_tenant_goods_ex",
"ods_table": "billiards_ods.tenant_goods_master",
"count": {
"dwd": 156,
"ods": 156,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_store_goods",
"ods_table": "billiards_ods.store_goods_master",
"count": {
"dwd": 161,
"ods": 161,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_store_goods_ex",
"ods_table": "billiards_ods.store_goods_master",
"count": {
"dwd": 161,
"ods": 161,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_goods_category",
"ods_table": "billiards_ods.stock_goods_category_tree",
"count": {
"dwd": 26,
"ods": 9,
"diff": 17
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_groupbuy_package",
"ods_table": "billiards_ods.group_buy_packages",
"count": {
"dwd": 17,
"ods": 17,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dim_groupbuy_package_ex",
"ods_table": "billiards_ods.group_buy_packages",
"count": {
"dwd": 17,
"ods": 17,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_settlement_head",
"ods_table": "billiards_ods.settlement_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_settlement_head_ex",
"ods_table": "billiards_ods.settlement_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_table_fee_log",
"ods_table": "billiards_ods.table_fee_transactions",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "adjust_amount",
"dwd_sum": 1157.45,
"ods_sum": 1157.45,
"diff": 0.0
},
{
"column": "coupon_promotion_amount",
"dwd_sum": 11244.49,
"ods_sum": 11244.49,
"diff": 0.0
},
{
"column": "ledger_amount",
"dwd_sum": 18107.0,
"ods_sum": 18107.0,
"diff": 0.0
},
{
"column": "member_discount_amount",
"dwd_sum": 1149.19,
"ods_sum": 1149.19,
"diff": 0.0
},
{
"column": "real_table_charge_money",
"dwd_sum": 5705.06,
"ods_sum": 5705.06,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_table_fee_log_ex",
"ods_table": "billiards_ods.table_fee_transactions",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "fee_total",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "mgmt_fee",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "service_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "used_card_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_table_fee_adjust",
"ods_table": "billiards_ods.table_fee_discount_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "ledger_amount",
"dwd_sum": 20650.84,
"ods_sum": 20650.84,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_table_fee_adjust_ex",
"ods_table": "billiards_ods.table_fee_discount_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_store_goods_sale",
"ods_table": "billiards_ods.store_goods_sales_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "cost_money",
"dwd_sum": 22.3,
"ods_sum": 22.3,
"diff": 0.0
},
{
"column": "ledger_amount",
"dwd_sum": 4583.0,
"ods_sum": 4583.0,
"diff": 0.0
},
{
"column": "real_goods_money",
"dwd_sum": 3791.0,
"ods_sum": 3791.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_store_goods_sale_ex",
"ods_table": "billiards_ods.store_goods_sales_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "coupon_deduct_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "discount_money",
"dwd_sum": 792.0,
"ods_sum": 792.0,
"diff": 0.0
},
{
"column": "member_discount_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "option_coupon_deduct_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "option_member_discount_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "point_discount_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "point_discount_money_cost",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "push_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_assistant_service_log",
"ods_table": "billiards_ods.assistant_service_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "coupon_deduct_money",
"dwd_sum": 626.83,
"ods_sum": 626.83,
"diff": 0.0
},
{
"column": "ledger_amount",
"dwd_sum": 63251.37,
"ods_sum": 63251.37,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_assistant_service_log_ex",
"ods_table": "billiards_ods.assistant_service_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "manual_discount_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "member_discount_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "service_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_assistant_trash_event",
"ods_table": "billiards_ods.assistant_cancellation_records",
"count": {
"dwd": 15,
"ods": 15,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_assistant_trash_event_ex",
"ods_table": "billiards_ods.assistant_cancellation_records",
"count": {
"dwd": 15,
"ods": 15,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_member_balance_change",
"ods_table": "billiards_ods.member_balance_changes",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_member_balance_change_ex",
"ods_table": "billiards_ods.member_balance_changes",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "refund_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_groupbuy_redemption",
"ods_table": "billiards_ods.group_buy_redemption_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "coupon_money",
"dwd_sum": 12266.0,
"ods_sum": 12266.0,
"diff": 0.0
},
{
"column": "ledger_amount",
"dwd_sum": 12049.53,
"ods_sum": 12049.53,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_groupbuy_redemption_ex",
"ods_table": "billiards_ods.group_buy_redemption_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "assistant_promotion_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "assistant_service_promotion_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "goods_promotion_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "recharge_promotion_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "reward_promotion_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "table_service_promotion_money",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_platform_coupon_redemption",
"ods_table": "billiards_ods.platform_coupon_redemption_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "coupon_money",
"dwd_sum": 11956.0,
"ods_sum": 11956.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_platform_coupon_redemption_ex",
"ods_table": "billiards_ods.platform_coupon_redemption_records",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_recharge_order",
"ods_table": "billiards_ods.recharge_settlements",
"count": {
"dwd": 74,
"ods": 74,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_recharge_order_ex",
"ods_table": "billiards_ods.recharge_settlements",
"count": {
"dwd": 74,
"ods": 74,
"diff": 0
},
"amounts": []
},
{
"dwd_table": "billiards_dwd.dwd_payment",
"ods_table": "billiards_ods.payment_transactions",
"count": {
"dwd": 200,
"ods": 200,
"diff": 0
},
"amounts": [
{
"column": "pay_amount",
"dwd_sum": 10863.0,
"ods_sum": 10863.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_refund",
"ods_table": "billiards_ods.refund_transactions",
"count": {
"dwd": 11,
"ods": 11,
"diff": 0
},
"amounts": [
{
"column": "channel_fee",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "pay_amount",
"dwd_sum": -62186.0,
"ods_sum": -62186.0,
"diff": 0.0
}
]
},
{
"dwd_table": "billiards_dwd.dwd_refund_ex",
"ods_table": "billiards_ods.refund_transactions",
"count": {
"dwd": 11,
"ods": 11,
"diff": 0
},
"amounts": [
{
"column": "balance_frozen_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "card_frozen_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "refund_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
},
{
"column": "round_amount",
"dwd_sum": 0.0,
"ods_sum": 0.0,
"diff": 0.0
}
]
}
],
"note": "行数/金额核对,金额字段基于列名包含 amount/money/fee/balance 的数值列自动扫描。"
}

27
etl_billiards/run_ods.bat Normal file
View File

@@ -0,0 +1,27 @@
@echo off
REM -*- coding: utf-8 -*-
REM 说明:一键重建 ODS执行 INIT_ODS_SCHEMA并灌入示例 JSON执行 MANUAL_INGEST
REM 使用配置:.env 中 PG_DSN、INGEST_SOURCE_DIR或通过参数覆盖
setlocal
cd /d %~dp0
REM 如果需要覆盖示例目录,可修改下面的 INGEST_DIR
set "INGEST_DIR=C:\dev\LLTQ\export\test-json-doc"
echo [INIT_ODS_SCHEMA] 准备执行,源目录=%INGEST_DIR%
python -m cli.main --tasks INIT_ODS_SCHEMA --pipeline-flow INGEST_ONLY --ingest-source "%INGEST_DIR%"
if errorlevel 1 (
echo INIT_ODS_SCHEMA 失败,退出
exit /b 1
)
echo [MANUAL_INGEST] 准备执行,源目录=%INGEST_DIR%
python -m cli.main --tasks MANUAL_INGEST --pipeline-flow INGEST_ONLY --ingest-source "%INGEST_DIR%"
if errorlevel 1 (
echo MANUAL_INGEST 失败,退出
exit /b 1
)
echo 全部完成。
endlocal

View File

@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
"""Populate PRD DWD tables from ODS payload snapshots.""" """Populate PRD DWD tables from ODS payload snapshots."""
from __future__ import annotations from __future__ import annotations
@@ -16,9 +16,9 @@ SQL_STEPS: list[tuple[str, str]] = [
INSERT INTO billiards_dwd.dim_tenant (tenant_id, tenant_name, status) INSERT INTO billiards_dwd.dim_tenant (tenant_id, tenant_name, status)
SELECT DISTINCT tenant_id, 'default' AS tenant_name, 'active' AS status SELECT DISTINCT tenant_id, 'default' AS tenant_name, 'active' AS status
FROM ( FROM (
SELECT tenant_id FROM billiards_ods.ods_order_settle SELECT tenant_id FROM billiards_ods.settlement_records
UNION SELECT tenant_id FROM billiards_ods.ods_order_receipt_detail UNION SELECT tenant_id FROM billiards_ods.ods_order_receipt_detail
UNION SELECT tenant_id FROM billiards_ods.ods_member_profile UNION SELECT tenant_id FROM billiards_ods.member_profiles
) s ) s
WHERE tenant_id IS NOT NULL WHERE tenant_id IS NOT NULL
ON CONFLICT (tenant_id) DO UPDATE SET updated_at = now(); ON CONFLICT (tenant_id) DO UPDATE SET updated_at = now();
@@ -30,7 +30,7 @@ SQL_STEPS: list[tuple[str, str]] = [
INSERT INTO billiards_dwd.dim_site (site_id, tenant_id, site_name, status) INSERT INTO billiards_dwd.dim_site (site_id, tenant_id, site_name, status)
SELECT DISTINCT site_id, MAX(tenant_id) AS tenant_id, 'default' AS site_name, 'active' AS status SELECT DISTINCT site_id, MAX(tenant_id) AS tenant_id, 'default' AS site_name, 'active' AS status
FROM ( FROM (
SELECT site_id, tenant_id FROM billiards_ods.ods_order_settle SELECT site_id, tenant_id FROM billiards_ods.settlement_records
UNION SELECT site_id, tenant_id FROM billiards_ods.ods_order_receipt_detail UNION SELECT site_id, tenant_id FROM billiards_ods.ods_order_receipt_detail
UNION SELECT site_id, tenant_id FROM billiards_ods.ods_table_info UNION SELECT site_id, tenant_id FROM billiards_ods.ods_table_info
) s ) s
@@ -84,7 +84,7 @@ SQL_STEPS: list[tuple[str, str]] = [
""" """
INSERT INTO billiards_dwd.dim_member_card_type (card_type_id, card_type_name, discount_rate) INSERT INTO billiards_dwd.dim_member_card_type (card_type_id, card_type_name, discount_rate)
SELECT DISTINCT card_type_id, card_type_name, discount_rate SELECT DISTINCT card_type_id, card_type_name, discount_rate
FROM billiards_ods.ods_member_card FROM billiards_ods.member_stored_value_cards
WHERE card_type_id IS NOT NULL WHERE card_type_id IS NOT NULL
ON CONFLICT (card_type_id) DO UPDATE SET ON CONFLICT (card_type_id) DO UPDATE SET
card_type_name = EXCLUDED.card_type_name, card_type_name = EXCLUDED.card_type_name,
@@ -119,10 +119,10 @@ SQL_STEPS: list[tuple[str, str]] = [
prof.wechat_id, prof.wechat_id,
prof.alipay_id, prof.alipay_id,
prof.remarks prof.remarks
FROM billiards_ods.ods_member_profile prof FROM billiards_ods.member_profiles prof
LEFT JOIN ( LEFT JOIN (
SELECT DISTINCT site_id, member_id, card_type_id AS member_type_id, card_type_name AS member_type_name SELECT DISTINCT site_id, member_id, card_type_id AS member_type_id, card_type_name AS member_type_name
FROM billiards_ods.ods_member_card FROM billiards_ods.member_stored_value_cards
) card ) card
ON prof.site_id = card.site_id AND prof.member_id = card.member_id ON prof.site_id = card.site_id AND prof.member_id = card.member_id
WHERE prof.member_id IS NOT NULL WHERE prof.member_id IS NOT NULL
@@ -167,7 +167,7 @@ SQL_STEPS: list[tuple[str, str]] = [
""" """
INSERT INTO billiards_dwd.dim_assistant (assistant_id, assistant_name, mobile, status) INSERT INTO billiards_dwd.dim_assistant (assistant_id, assistant_name, mobile, status)
SELECT DISTINCT assistant_id, assistant_name, mobile, status SELECT DISTINCT assistant_id, assistant_name, mobile, status
FROM billiards_ods.ods_assistant_account FROM billiards_ods.assistant_accounts_master
WHERE assistant_id IS NOT NULL WHERE assistant_id IS NOT NULL
ON CONFLICT (assistant_id) DO UPDATE SET ON CONFLICT (assistant_id) DO UPDATE SET
assistant_name = EXCLUDED.assistant_name, assistant_name = EXCLUDED.assistant_name,
@@ -181,7 +181,7 @@ SQL_STEPS: list[tuple[str, str]] = [
""" """
INSERT INTO billiards_dwd.dim_pay_method (pay_method_code, pay_method_name, is_stored_value, status) INSERT INTO billiards_dwd.dim_pay_method (pay_method_code, pay_method_name, is_stored_value, status)
SELECT DISTINCT pay_method_code, pay_method_name, FALSE AS is_stored_value, 'active' AS status SELECT DISTINCT pay_method_code, pay_method_name, FALSE AS is_stored_value, 'active' AS status
FROM billiards_ods.ods_payment_record FROM billiards_ods.payment_transactions
WHERE pay_method_code IS NOT NULL WHERE pay_method_code IS NOT NULL
ON CONFLICT (pay_method_code) DO UPDATE SET ON CONFLICT (pay_method_code) DO UPDATE SET
pay_method_name = EXCLUDED.pay_method_name, pay_method_name = EXCLUDED.pay_method_name,
@@ -250,7 +250,7 @@ SQL_STEPS: list[tuple[str, str]] = [
final_table_fee, final_table_fee,
FALSE AS is_canceled, FALSE AS is_canceled,
NULL::TIMESTAMPTZ AS cancel_time NULL::TIMESTAMPTZ AS cancel_time
FROM billiards_ods.ods_table_use_log FROM billiards_ods.table_fee_transactions_log
ON CONFLICT (site_id, ledger_id) DO NOTHING; ON CONFLICT (site_id, ledger_id) DO NOTHING;
""", """,
), ),
@@ -325,7 +325,7 @@ SQL_STEPS: list[tuple[str, str]] = [
pay_time, pay_time,
relate_type, relate_type,
relate_id relate_id
FROM billiards_ods.ods_payment_record FROM billiards_ods.payment_transactions
ON CONFLICT (site_id, pay_id) DO NOTHING; ON CONFLICT (site_id, pay_id) DO NOTHING;
""", """,
), ),
@@ -346,7 +346,7 @@ SQL_STEPS: list[tuple[str, str]] = [
refund_amount, refund_amount,
refund_time, refund_time,
status status
FROM billiards_ods.ods_refund_record FROM billiards_ods.refund_transactions
ON CONFLICT (site_id, refund_id) DO NOTHING; ON CONFLICT (site_id, refund_id) DO NOTHING;
""", """,
), ),
@@ -369,7 +369,7 @@ SQL_STEPS: list[tuple[str, str]] = [
balance_before, balance_before,
balance_after, balance_after,
change_time change_time
FROM billiards_ods.ods_balance_change FROM billiards_ods.member_balance_changes
ON CONFLICT (site_id, change_id) DO NOTHING; ON CONFLICT (site_id, change_id) DO NOTHING;
""", """,
), ),
@@ -423,3 +423,4 @@ def main() -> int:
if __name__ == "__main__": if __name__ == "__main__":
raise SystemExit(main()) raise SystemExit(main())

View File

@@ -0,0 +1,117 @@
# -*- coding: utf-8 -*-
"""
ODS JSON 字段核对脚本:对照当前数据库中的 ODS 表字段,检查示例 JSON默认目录 C:\\dev\\LLTQ\\export\\test-json-doc
是否包含同名键,并输出每表未命中的字段,便于补充映射或确认确实无源字段。
使用方法:
set PG_DSN=postgresql://... # 如 .env 中配置
python -m etl_billiards.scripts.check_ods_json_vs_table
"""
from __future__ import annotations
import json
import os
import pathlib
from typing import Dict, Iterable, Set, Tuple
import psycopg2
from etl_billiards.tasks.manual_ingest_task import ManualIngestTask
def _flatten_keys(obj, prefix: str = "") -> Set[str]:
"""递归展开 JSON 所有键路径,返回形如 data.assistantInfos.id 的集合。列表不保留索引,仅继续向下展开。"""
keys: Set[str] = set()
if isinstance(obj, dict):
for k, v in obj.items():
new_prefix = f"{prefix}.{k}" if prefix else k
keys.add(new_prefix)
keys |= _flatten_keys(v, new_prefix)
elif isinstance(obj, list):
for item in obj:
keys |= _flatten_keys(item, prefix)
return keys
def _load_json_keys(path: pathlib.Path) -> Tuple[Set[str], dict[str, Set[str]]]:
"""读取单个 JSON 文件并返回展开后的键集合以及末段->路径列表映射,若文件不存在或无法解析则返回空集合。"""
if not path.exists():
return set(), {}
data = json.loads(path.read_text(encoding="utf-8"))
paths = _flatten_keys(data)
last_map: dict[str, Set[str]] = {}
for p in paths:
last = p.split(".")[-1].lower()
last_map.setdefault(last, set()).add(p)
return paths, last_map
def _load_ods_columns(dsn: str) -> Dict[str, Set[str]]:
"""从数据库读取 billiards_ods.* 的列名集合,按表返回。"""
conn = psycopg2.connect(dsn)
cur = conn.cursor()
cur.execute(
"""
SELECT table_name, column_name
FROM information_schema.columns
WHERE table_schema='billiards_ods'
ORDER BY table_name, ordinal_position
"""
)
result: Dict[str, Set[str]] = {}
for table, col in cur.fetchall():
result.setdefault(table, set()).add(col.lower())
cur.close()
conn.close()
return result
def main() -> None:
"""主流程:遍历 FILE_MAPPING 中的 ODS 表,检查 JSON 键覆盖情况并打印报告。"""
dsn = os.environ.get("PG_DSN")
json_dir = pathlib.Path(os.environ.get("JSON_DOC_DIR", r"C:\dev\LLTQ\export\test-json-doc"))
ods_cols_map = _load_ods_columns(dsn)
print(f"使用 JSON 目录: {json_dir}")
print(f"连接 DSN: {dsn}")
print("=" * 80)
for keywords, ods_table in ManualIngestTask.FILE_MAPPING:
table = ods_table.split(".")[-1]
cols = ods_cols_map.get(table, set())
file_name = f"{keywords[0]}.json"
file_path = json_dir / file_name
keys_full, path_map = _load_json_keys(file_path)
key_last_parts = set(path_map.keys())
missing: Set[str] = set()
extra_keys: Set[str] = set()
present: Set[str] = set()
for col in sorted(cols):
if col in key_last_parts:
present.add(col)
else:
missing.add(col)
for k in key_last_parts:
if k not in cols:
extra_keys.add(k)
print(f"[{table}] 文件={file_name} 列数={len(cols)} JSON键(末段)覆盖={len(present)}/{len(cols)}")
if missing:
print(" 未命中列:", ", ".join(sorted(missing)))
else:
print(" 未命中列: 无")
if extra_keys:
extras = []
for k in sorted(extra_keys):
paths = ", ".join(sorted(path_map.get(k, [])))
extras.append(f"{k} ({paths})")
print(" JSON 仅有(表无此列):", "; ".join(extras))
else:
print(" JSON 仅有(表无此列): 无")
print("-" * 80)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,907 @@
# -*- coding: utf-8 -*-
"""DWD 装载任务:从 ODS 增量写入 DWD维度 SCD2事实按时间增量"""
from __future__ import annotations
from datetime import datetime
from typing import Any, Dict, Iterable, List, Sequence
from psycopg2.extras import RealDictCursor
from .base_task import BaseTask, TaskContext
class DwdLoadTask(BaseTask):
"""负责 DWD 装载:维度表做 SCD2 合并,事实表按时间增量写入。"""
# DWD -> ODS 表映射ODS 表名已与示例 JSON 前缀统一)
TABLE_MAP: dict[str, str] = {
# 维度
# 门店:改用台费流水中的 siteprofile 快照,补齐 org/地址等字段
"billiards_dwd.dim_site": "billiards_ods.table_fee_transactions",
"billiards_dwd.dim_site_ex": "billiards_ods.table_fee_transactions",
"billiards_dwd.dim_table": "billiards_ods.site_tables_master",
"billiards_dwd.dim_table_ex": "billiards_ods.site_tables_master",
"billiards_dwd.dim_assistant": "billiards_ods.assistant_accounts_master",
"billiards_dwd.dim_assistant_ex": "billiards_ods.assistant_accounts_master",
"billiards_dwd.dim_member": "billiards_ods.member_profiles",
"billiards_dwd.dim_member_ex": "billiards_ods.member_profiles",
"billiards_dwd.dim_member_card_account": "billiards_ods.member_stored_value_cards",
"billiards_dwd.dim_member_card_account_ex": "billiards_ods.member_stored_value_cards",
"billiards_dwd.dim_tenant_goods": "billiards_ods.tenant_goods_master",
"billiards_dwd.dim_tenant_goods_ex": "billiards_ods.tenant_goods_master",
"billiards_dwd.dim_store_goods": "billiards_ods.store_goods_master",
"billiards_dwd.dim_store_goods_ex": "billiards_ods.store_goods_master",
"billiards_dwd.dim_goods_category": "billiards_ods.stock_goods_category_tree",
"billiards_dwd.dim_groupbuy_package": "billiards_ods.group_buy_packages",
"billiards_dwd.dim_groupbuy_package_ex": "billiards_ods.group_buy_packages",
# 事实
"billiards_dwd.dwd_settlement_head": "billiards_ods.settlement_records",
"billiards_dwd.dwd_settlement_head_ex": "billiards_ods.settlement_records",
"billiards_dwd.dwd_table_fee_log": "billiards_ods.table_fee_transactions",
"billiards_dwd.dwd_table_fee_log_ex": "billiards_ods.table_fee_transactions",
"billiards_dwd.dwd_table_fee_adjust": "billiards_ods.table_fee_discount_records",
"billiards_dwd.dwd_table_fee_adjust_ex": "billiards_ods.table_fee_discount_records",
"billiards_dwd.dwd_store_goods_sale": "billiards_ods.store_goods_sales_records",
"billiards_dwd.dwd_store_goods_sale_ex": "billiards_ods.store_goods_sales_records",
"billiards_dwd.dwd_assistant_service_log": "billiards_ods.assistant_service_records",
"billiards_dwd.dwd_assistant_service_log_ex": "billiards_ods.assistant_service_records",
"billiards_dwd.dwd_assistant_trash_event": "billiards_ods.assistant_cancellation_records",
"billiards_dwd.dwd_assistant_trash_event_ex": "billiards_ods.assistant_cancellation_records",
"billiards_dwd.dwd_member_balance_change": "billiards_ods.member_balance_changes",
"billiards_dwd.dwd_member_balance_change_ex": "billiards_ods.member_balance_changes",
"billiards_dwd.dwd_groupbuy_redemption": "billiards_ods.group_buy_redemption_records",
"billiards_dwd.dwd_groupbuy_redemption_ex": "billiards_ods.group_buy_redemption_records",
"billiards_dwd.dwd_platform_coupon_redemption": "billiards_ods.platform_coupon_redemption_records",
"billiards_dwd.dwd_platform_coupon_redemption_ex": "billiards_ods.platform_coupon_redemption_records",
"billiards_dwd.dwd_recharge_order": "billiards_ods.recharge_settlements",
"billiards_dwd.dwd_recharge_order_ex": "billiards_ods.recharge_settlements",
"billiards_dwd.dwd_payment": "billiards_ods.payment_transactions",
"billiards_dwd.dwd_refund": "billiards_ods.refund_transactions",
"billiards_dwd.dwd_refund_ex": "billiards_ods.refund_transactions",
}
SCD_COLS = {"scd2_start_time", "scd2_end_time", "scd2_is_current", "scd2_version"}
FACT_ORDER_CANDIDATES = [
"fetched_at",
"pay_time",
"create_time",
"update_time",
"occur_time",
"settle_time",
"start_use_time",
]
# 特殊列映射dwd 列名 -> 源列表达式(可选 CAST
FACT_MAPPINGS: dict[str, list[tuple[str, str, str | None]]] = {
# 维度表(补齐主键/字段差异)
"billiards_dwd.dim_site": [
("org_id", "siteprofile->>'org_id'", None),
("shop_name", "siteprofile->>'shop_name'", None),
("site_label", "siteprofile->>'site_label'", None),
("full_address", "siteprofile->>'full_address'", None),
("address", "siteprofile->>'address'", None),
("longitude", "siteprofile->>'longitude'", "numeric"),
("latitude", "siteprofile->>'latitude'", "numeric"),
("tenant_site_region_id", "siteprofile->>'tenant_site_region_id'", None),
("business_tel", "siteprofile->>'business_tel'", None),
("site_type", "siteprofile->>'site_type'", None),
("shop_status", "siteprofile->>'shop_status'", None),
("tenant_id", "siteprofile->>'tenant_id'", None),
],
"billiards_dwd.dim_site_ex": [
("auto_light", "siteprofile->>'auto_light'", None),
("attendance_enabled", "siteprofile->>'attendance_enabled'", None),
("attendance_distance", "siteprofile->>'attendance_distance'", None),
("prod_env", "siteprofile->>'prod_env'", None),
("light_status", "siteprofile->>'light_status'", None),
("light_type", "siteprofile->>'light_type'", None),
("light_token", "siteprofile->>'light_token'", None),
("address", "siteprofile->>'address'", None),
("avatar", "siteprofile->>'avatar'", None),
("wifi_name", "siteprofile->>'wifi_name'", None),
("wifi_password", "siteprofile->>'wifi_password'", None),
("customer_service_qrcode", "siteprofile->>'customer_service_qrcode'", None),
("customer_service_wechat", "siteprofile->>'customer_service_wechat'", None),
("fixed_pay_qrcode", "siteprofile->>'fixed_pay_qrCode'", None),
("longitude", "siteprofile->>'longitude'", "numeric"),
("latitude", "siteprofile->>'latitude'", "numeric"),
("tenant_site_region_id", "siteprofile->>'tenant_site_region_id'", None),
("site_type", "siteprofile->>'site_type'", None),
("site_label", "siteprofile->>'site_label'", None),
("shop_status", "siteprofile->>'shop_status'", None),
("create_time", "siteprofile->>'create_time'", "timestamptz"),
("update_time", "siteprofile->>'update_time'", "timestamptz"),
],
"billiards_dwd.dim_table": [
("table_id", "id", None),
("site_table_area_name", "areaname", None),
("tenant_table_area_id", "site_table_area_id", None),
],
"billiards_dwd.dim_table_ex": [
("table_id", "id", None),
("table_cloth_use_time", "table_cloth_use_time", None),
],
"billiards_dwd.dim_assistant": [("assistant_id", "id", None), ("user_id", "staff_id", None)],
"billiards_dwd.dim_assistant_ex": [
("assistant_id", "id", None),
("introduce", "introduce", None),
("group_name", "group_name", None),
("light_equipment_id", "light_equipment_id", None),
],
"billiards_dwd.dim_member": [("member_id", "id", None)],
"billiards_dwd.dim_member_ex": [
("member_id", "id", None),
("register_site_name", "site_name", None),
],
"billiards_dwd.dim_member_card_account": [("member_card_id", "id", None)],
"billiards_dwd.dim_member_card_account_ex": [
("member_card_id", "id", None),
("tenant_name", "tenantname", None),
("tenantavatar", "tenantavatar", None),
("card_no", "card_no", None),
("bind_password", "bind_password", None),
("use_scene", "use_scene", None),
("tableareaid", "tableareaid", None),
("goodscategoryid", "goodscategoryid", None),
],
"billiards_dwd.dim_tenant_goods": [
("tenant_goods_id", "id", None),
("category_name", "categoryname", None),
],
"billiards_dwd.dim_tenant_goods_ex": [
("tenant_goods_id", "id", None),
("remark_name", "remark_name", None),
("goods_bar_code", "goods_bar_code", None),
("commodity_code_list", "commodity_code", None),
("is_in_site", "isinsite", "boolean"),
],
"billiards_dwd.dim_store_goods": [
("site_goods_id", "id", None),
("category_level1_name", "onecategoryname", None),
("category_level2_name", "twocategoryname", None),
("created_at", "create_time", None),
("updated_at", "update_time", None),
("avg_monthly_sales", "average_monthly_sales", None),
("batch_stock_qty", "stock", None),
("sale_qty", "sale_num", None),
("total_sales_qty", "total_sales", None),
],
"billiards_dwd.dim_store_goods_ex": [
("site_goods_id", "id", None),
("goods_barcode", "goods_bar_code", None),
("stock_qty", "stock", None),
("stock_secondary_qty", "stock_a", None),
("safety_stock_qty", "safe_stock", None),
("site_name", "sitename", None),
("goods_cover_url", "goods_cover", None),
("provisional_total_cost", "total_purchase_cost", None),
("is_discountable", "able_discount", None),
("freeze_status", "freeze", None),
("remark", "remark", None),
("days_on_shelf", "days_available", None),
("sort_order", "sort", None),
],
"billiards_dwd.dim_goods_category": [
("category_id", "id", None),
("tenant_id", "tenant_id", None),
("category_name", "category_name", None),
("alias_name", "alias_name", None),
("parent_category_id", "pid", None),
("business_name", "business_name", None),
("tenant_goods_business_id", "tenant_goods_business_id", None),
("sort_order", "sort", None),
("open_salesman", "open_salesman", None),
("is_warehousing", "is_warehousing", None),
("category_level", "CASE WHEN pid = 0 THEN 1 ELSE 2 END", None),
("is_leaf", "CASE WHEN categoryboxes IS NULL OR jsonb_array_length(categoryboxes)=0 THEN 1 ELSE 0 END", None),
],
"billiards_dwd.dim_groupbuy_package": [
("groupbuy_package_id", "id", None),
("package_template_id", "package_id", None),
("coupon_face_value", "coupon_money", None),
("duration_seconds", "duration", None),
],
"billiards_dwd.dim_groupbuy_package_ex": [
("groupbuy_package_id", "id", None),
("table_area_id", "table_area_id", None),
("tenant_table_area_id", "tenant_table_area_id", None),
("usable_range", "usable_range", None),
("table_area_id_list", "table_area_id_list", None),
("package_type", "type", None),
],
# 事实表主键及关键差异列
"billiards_dwd.dwd_table_fee_log": [("table_fee_log_id", "id", None)],
"billiards_dwd.dwd_table_fee_log_ex": [
("table_fee_log_id", "id", None),
("salesman_name", "salesman_name", None),
],
"billiards_dwd.dwd_table_fee_adjust": [
("table_fee_adjust_id", "id", None),
("table_id", "site_table_id", None),
("table_area_id", "tenant_table_area_id", None),
("table_area_name", "tableprofile->>'table_area_name'", None),
("adjust_time", "create_time", None),
],
"billiards_dwd.dwd_table_fee_adjust_ex": [
("table_fee_adjust_id", "id", None),
("ledger_name", "ledger_name", None),
],
"billiards_dwd.dwd_store_goods_sale": [("store_goods_sale_id", "id", None), ("discount_price", "discount_money", None)],
"billiards_dwd.dwd_store_goods_sale_ex": [
("store_goods_sale_id", "id", None),
("option_value_name", "option_value_name", None),
("open_salesman_flag", "opensalesman", "integer"),
("salesman_name", "salesman_name", None),
("salesman_org_id", "sales_man_org_id", None),
("legacy_order_goods_id", "ordergoodsid", None),
("site_name", "sitename", None),
("legacy_site_id", "siteid", None),
],
"billiards_dwd.dwd_assistant_service_log": [
("assistant_service_id", "id", None),
("assistant_no", "assistantno", None),
("site_assistant_id", "order_assistant_id", None),
("level_name", "levelname", None),
("skill_name", "skillname", None),
],
"billiards_dwd.dwd_assistant_service_log_ex": [
("assistant_service_id", "id", None),
("assistant_name", "assistantname", None),
("ledger_group_name", "ledger_group_name", None),
("trash_applicant_name", "trash_applicant_name", None),
("trash_reason", "trash_reason", None),
("salesman_name", "salesman_name", None),
("table_name", "tablename", None),
],
"billiards_dwd.dwd_assistant_trash_event": [
("assistant_trash_event_id", "id", None),
("assistant_no", "assistantname", None),
("abolish_amount", "assistantabolishamount", None),
("charge_minutes_raw", "pdchargeminutes", None),
("site_id", "siteid", None),
("table_id", "tableid", None),
("table_area_id", "tableareaid", None),
("assistant_name", "assistantname", None),
("trash_reason", "trashreason", None),
("create_time", "createtime", None),
],
"billiards_dwd.dwd_assistant_trash_event_ex": [
("assistant_trash_event_id", "id", None),
("table_area_name", "tablearea", None),
("table_name", "tablename", None),
],
"billiards_dwd.dwd_member_balance_change": [
("balance_change_id", "id", None),
("balance_before", "before", None),
("change_amount", "account_data", None),
("balance_after", "after", None),
("card_type_name", "membercardtypename", None),
("change_time", "create_time", None),
("member_name", "membername", None),
("member_mobile", "membermobile", None),
],
"billiards_dwd.dwd_member_balance_change_ex": [
("balance_change_id", "id", None),
("pay_site_name", "paysitename", None),
("register_site_name", "registersitename", None),
],
"billiards_dwd.dwd_groupbuy_redemption": [("redemption_id", "id", None)],
"billiards_dwd.dwd_groupbuy_redemption_ex": [
("redemption_id", "id", None),
("table_area_name", "tableareaname", None),
("site_name", "sitename", None),
("table_name", "tablename", None),
("goods_option_price", "goodsoptionprice", None),
("salesman_name", "salesman_name", None),
("salesman_org_id", "sales_man_org_id", None),
("ledger_group_name", "ledger_group_name", None),
],
"billiards_dwd.dwd_platform_coupon_redemption": [("platform_coupon_redemption_id", "id", None)],
"billiards_dwd.dwd_platform_coupon_redemption_ex": [
("platform_coupon_redemption_id", "id", None),
("coupon_cover", "coupon_cover", None),
],
"billiards_dwd.dwd_payment": [("payment_id", "id", None), ("pay_date", "pay_time", "date")],
"billiards_dwd.dwd_refund": [("refund_id", "id", None)],
"billiards_dwd.dwd_refund_ex": [
("refund_id", "id", None),
("tenant_name", "tenantname", None),
("channel_payer_id", "channel_payer_id", None),
("channel_pay_no", "channel_pay_no", None),
],
# 结算头settlement_records源列为小写驼峰/无下划线,需要显式映射)
"billiards_dwd.dwd_settlement_head": [
("order_settle_id", "id", None),
("tenant_id", "tenantid", None),
("site_id", "siteid", None),
("site_name", "sitename", None),
("table_id", "tableid", None),
("settle_name", "settlename", None),
("order_trade_no", "settlerelateid", None),
("create_time", "createtime", None),
("pay_time", "paytime", None),
("settle_type", "settletype", None),
("revoke_order_id", "revokeorderid", None),
("member_id", "memberid", None),
("member_name", "membername", None),
("member_phone", "memberphone", None),
("member_card_account_id", "tenantmembercardid", None),
("member_card_type_name", "membercardtypename", None),
("is_bind_member", "isbindmember", None),
("member_discount_amount", "memberdiscountamount", None),
("consume_money", "consumemoney", None),
("table_charge_money", "tablechargemoney", None),
("goods_money", "goodsmoney", None),
("real_goods_money", "realgoodsmoney", None),
("assistant_pd_money", "assistantpdmoney", None),
("assistant_cx_money", "assistantcxmoney", None),
("adjust_amount", "adjustamount", None),
("pay_amount", "payamount", None),
("balance_amount", "balanceamount", None),
("recharge_card_amount", "rechargecardamount", None),
("gift_card_amount", "giftcardamount", None),
("coupon_amount", "couponamount", None),
("rounding_amount", "roundingamount", None),
("point_amount", "pointamount", None),
],
"billiards_dwd.dwd_settlement_head_ex": [
("order_settle_id", "id", None),
("serial_number", "serialnumber", None),
("settle_status", "settlestatus", None),
("can_be_revoked", "canberevoked", "boolean"),
("revoke_order_name", "revokeordername", None),
("revoke_time", "revoketime", None),
("is_first_order", "isfirst", "boolean"),
("service_money", "servicemoney", None),
("cash_amount", "cashamount", None),
("card_amount", "cardamount", None),
("online_amount", "onlineamount", None),
("refund_amount", "refundamount", None),
("prepay_money", "prepaymoney", None),
("payment_method", "paymentmethod", None),
("coupon_sale_amount", "couponsaleamount", None),
("all_coupon_discount", "allcoupondiscount", None),
("goods_promotion_money", "goodspromotionmoney", None),
("assistant_promotion_money", "assistantpromotionmoney", None),
("activity_discount", "activitydiscount", None),
("assistant_manual_discount", "assistantmanualdiscount", None),
("point_discount_price", "pointdiscountprice", None),
("point_discount_cost", "pointdiscountcost", None),
("is_use_coupon", "isusecoupon", "boolean"),
("is_use_discount", "isusediscount", "boolean"),
("is_activity", "isactivity", "boolean"),
("operator_name", "operatorname", None),
("salesman_name", "salesmanname", None),
("order_remark", "orderremark", None),
("operator_id", "operatorid", None),
("salesman_user_id", "salesmanuserid", None),
],
# 充值结算recharge_settlements字段风格同 settlement_records
"billiards_dwd.dwd_recharge_order": [
("recharge_order_id", "id", None),
("tenant_id", "tenantid", None),
("site_id", "siteid", None),
("member_id", "memberid", None),
("member_name_snapshot", "membername", None),
("member_phone_snapshot", "memberphone", None),
("tenant_member_card_id", "tenantmembercardid", None),
("member_card_type_name", "membercardtypename", None),
("settle_relate_id", "settlerelateid", None),
("settle_type", "settletype", None),
("settle_name", "settlename", None),
("is_first", "isfirst", None),
("pay_amount", "payamount", None),
("refund_amount", "refundamount", None),
("point_amount", "pointamount", None),
("cash_amount", "cashamount", None),
("payment_method", "paymentmethod", None),
("create_time", "createtime", None),
("pay_time", "paytime", None),
],
"billiards_dwd.dwd_recharge_order_ex": [
("recharge_order_id", "id", None),
("site_name_snapshot", "sitename", None),
("salesman_name", "salesmanname", None),
("order_remark", "orderremark", None),
("revoke_order_name", "revokeordername", None),
("settle_status", "settlestatus", None),
("is_bind_member", "isbindmember", "boolean"),
("is_activity", "isactivity", "boolean"),
("is_use_coupon", "isusecoupon", "boolean"),
("is_use_discount", "isusediscount", "boolean"),
("can_be_revoked", "canberevoked", "boolean"),
("online_amount", "onlineamount", None),
("balance_amount", "balanceamount", None),
("card_amount", "cardamount", None),
("coupon_amount", "couponamount", None),
("recharge_card_amount", "rechargecardamount", None),
("gift_card_amount", "giftcardamount", None),
("prepay_money", "prepaymoney", None),
("consume_money", "consumemoney", None),
("goods_money", "goodsmoney", None),
("real_goods_money", "realgoodsmoney", None),
("table_charge_money", "tablechargemoney", None),
("service_money", "servicemoney", None),
("activity_discount", "activitydiscount", None),
("all_coupon_discount", "allcoupondiscount", None),
("goods_promotion_money", "goodspromotionmoney", None),
("assistant_promotion_money", "assistantpromotionmoney", None),
("assistant_pd_money", "assistantpdmoney", None),
("assistant_cx_money", "assistantcxmoney", None),
("assistant_manual_discount", "assistantmanualdiscount", None),
("coupon_sale_amount", "couponsaleamount", None),
("member_discount_amount", "memberdiscountamount", None),
("point_discount_price", "pointdiscountprice", None),
("point_discount_cost", "pointdiscountcost", None),
("adjust_amount", "adjustamount", None),
("rounding_amount", "roundingamount", None),
("operator_id", "operatorid", None),
("operator_name_snapshot", "operatorname", None),
("salesman_user_id", "salesmanuserid", None),
("salesman_name", "salesmanname", None),
("order_remark", "orderremark", None),
("table_id", "tableid", None),
("serial_number", "serialnumber", None),
("revoke_order_id", "revokeorderid", None),
("revoke_order_name", "revokeordername", None),
("revoke_time", "revoketime", None),
],
}
def get_task_code(self) -> str:
"""返回任务编码。"""
return "DWD_LOAD_FROM_ODS"
def extract(self, context: TaskContext) -> dict[str, Any]:
"""准备运行所需的上下文信息。"""
return {"now": datetime.now()}
def load(self, extracted: dict[str, Any], context: TaskContext) -> dict[str, Any]:
"""遍历映射关系,维度执行 SCD2 合并,事实表按时间增量插入。"""
now = extracted["now"]
summary: List[Dict[str, Any]] = []
with self.db.conn.cursor(cursor_factory=RealDictCursor) as cur:
for dwd_table, ods_table in self.TABLE_MAP.items():
dwd_cols = self._get_columns(cur, dwd_table)
ods_cols = self._get_columns(cur, ods_table)
if not dwd_cols:
self.logger.warning("跳过 %s,未能获取 DWD 列信息", dwd_table)
continue
if self._table_base(dwd_table).startswith("dim_"):
processed = self._merge_dim_scd2(cur, dwd_table, ods_table, dwd_cols, ods_cols, now)
summary.append({"table": dwd_table, "mode": "SCD2", "processed": processed})
else:
dwd_types = self._get_column_types(cur, dwd_table, "billiards_dwd")
ods_types = self._get_column_types(cur, ods_table, "billiards_ods")
inserted = self._merge_fact_increment(
cur, dwd_table, ods_table, dwd_cols, ods_cols, dwd_types, ods_types
)
summary.append({"table": dwd_table, "mode": "INCREMENT", "inserted": inserted})
self.db.conn.commit()
return {"tables": summary}
# ---------------------- helpers ----------------------
def _get_columns(self, cur, table: str) -> List[str]:
"""获取指定表的列名(小写)。"""
schema, name = self._split_table_name(table, default_schema="billiards_dwd")
cur.execute(
"""
SELECT column_name
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
""",
(schema, name),
)
return [r["column_name"].lower() for r in cur.fetchall()]
def _get_primary_keys(self, cur, table: str) -> List[str]:
"""获取表的主键列名列表。"""
schema, name = self._split_table_name(table, default_schema="billiards_dwd")
cur.execute(
"""
SELECT kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
AND tc.table_schema = kcu.table_schema
AND tc.table_name = kcu.table_name
WHERE tc.table_schema = %s
AND tc.table_name = %s
AND tc.constraint_type = 'PRIMARY KEY'
ORDER BY kcu.ordinal_position
""",
(schema, name),
)
return [r["column_name"].lower() for r in cur.fetchall()]
def _get_column_types(self, cur, table: str, default_schema: str) -> Dict[str, str]:
"""获取列的数据类型information_schema.data_type"""
schema, name = self._split_table_name(table, default_schema=default_schema)
cur.execute(
"""
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
""",
(schema, name),
)
return {r["column_name"].lower(): r["data_type"].lower() for r in cur.fetchall()}
def _build_column_mapping(
self, dwd_table: str, pk_cols: Sequence[str], ods_cols: Sequence[str]
) -> Dict[str, tuple[str, str | None]]:
"""合并显式 FACT_MAPPINGS 与主键兜底映射。"""
mapping_entries = self.FACT_MAPPINGS.get(dwd_table, [])
mapping: Dict[str, tuple[str, str | None]] = {
dst.lower(): (src, cast_type) for dst, src, cast_type in mapping_entries
}
ods_set = {c.lower() for c in ods_cols}
for pk in pk_cols:
pk_lower = pk.lower()
if pk_lower not in mapping and pk_lower not in ods_set and "id" in ods_set:
mapping[pk_lower] = ("id", None)
return mapping
def _fetch_source_rows(
self, cur, table: str, columns: Sequence[str], where_sql: str = "", params: Sequence[Any] = None
) -> List[Dict[str, Any]]:
"""从源表读取指定列,返回小写键的字典列表。"""
schema, name = self._split_table_name(table, default_schema="billiards_ods")
cols_sql = ", ".join(f'"{c}"' for c in columns)
sql = f'SELECT {cols_sql} FROM "{schema}"."{name}" {where_sql}'
cur.execute(sql, params or [])
rows = []
for r in cur.fetchall():
rows.append({k.lower(): v for k, v in r.items()})
return rows
def _expand_goods_category_rows(self, rows: list[Dict[str, Any]]) -> list[Dict[str, Any]]:
"""将分类表中的 categoryboxes 元素展开为子类记录。"""
expanded: list[Dict[str, Any]] = []
for r in rows:
expanded.append(r)
boxes = r.get("categoryboxes")
if isinstance(boxes, list):
for child in boxes:
if not isinstance(child, dict):
continue
child_row: Dict[str, Any] = {}
# 继承父级的租户与业务大类信息
child_row["tenant_id"] = r.get("tenant_id")
child_row["business_name"] = child.get("business_name", r.get("business_name"))
child_row["tenant_goods_business_id"] = child.get(
"tenant_goods_business_id", r.get("tenant_goods_business_id")
)
# 合并子类字段
child_row.update(child)
# 默认父子关系
child_row.setdefault("pid", r.get("id"))
# 衍生层级/叶子标记
child_boxes = child_row.get("categoryboxes")
if not isinstance(child_boxes, list):
is_leaf = 1
else:
is_leaf = 1 if len(child_boxes) == 0 else 0
child_row.setdefault("category_level", 2)
child_row.setdefault("is_leaf", is_leaf)
expanded.append(child_row)
return expanded
def _merge_dim_scd2(
self,
cur,
dwd_table: str,
ods_table: str,
dwd_cols: Sequence[str],
ods_cols: Sequence[str],
now: datetime,
) -> int:
"""对维表执行 SCD2 合并:对比变更关闭旧版并插入新版。"""
pk_cols = self._get_primary_keys(cur, dwd_table)
if not pk_cols:
raise ValueError(f"{dwd_table} 未配置主键,无法执行 SCD2 合并")
mapping = self._build_column_mapping(dwd_table, pk_cols, ods_cols)
ods_set = {c.lower() for c in ods_cols}
table_sql = self._format_table(ods_table, "billiards_ods")
# 构造 SELECT 表达式,支持 JSON/expression 映射
select_exprs: list[str] = []
added: set[str] = set()
for col in dwd_cols:
lc = col.lower()
if lc in self.SCD_COLS:
continue
if lc in mapping:
src, cast_type = mapping[lc]
select_exprs.append(f"{self._cast_expr(src, cast_type)} AS \"{lc}\"")
added.add(lc)
elif lc in ods_set:
select_exprs.append(f'"{lc}" AS "{lc}"')
added.add(lc)
# 分类维度需要额外读取 categoryboxes 以展开子类
if dwd_table == "billiards_dwd.dim_goods_category" and "categoryboxes" not in added and "categoryboxes" in ods_set:
select_exprs.append('"categoryboxes" AS "categoryboxes"')
added.add("categoryboxes")
# 主键兜底确保被选出
for pk in pk_cols:
lc = pk.lower()
if lc not in added:
if lc in mapping:
src, cast_type = mapping[lc]
select_exprs.append(f"{self._cast_expr(src, cast_type)} AS \"{lc}\"")
elif lc in ods_set:
select_exprs.append(f'"{lc}" AS "{lc}"')
added.add(lc)
if not select_exprs:
return 0
sql = f"SELECT {', '.join(select_exprs)} FROM {table_sql}"
cur.execute(sql)
rows = [{k.lower(): v for k, v in r.items()} for r in cur.fetchall()]
# 特殊:分类维度展开子类
if dwd_table == "billiards_dwd.dim_goods_category":
rows = self._expand_goods_category_rows(rows)
inserted_or_updated = 0
seen_pk = set()
for row in rows:
mapped_row: Dict[str, Any] = {}
for col in dwd_cols:
lc = col.lower()
if lc in self.SCD_COLS:
continue
value = row.get(lc)
if value is None and lc in mapping:
src, _ = mapping[lc]
value = row.get(src.lower())
mapped_row[lc] = value
pk_key = tuple(mapped_row.get(pk) for pk in pk_cols)
if pk_key in seen_pk:
continue
seen_pk.add(pk_key)
if self._upsert_scd2_row(cur, dwd_table, dwd_cols, pk_cols, mapped_row, now):
inserted_or_updated += 1
return len(rows)
def _upsert_scd2_row(
self,
cur,
dwd_table: str,
dwd_cols: Sequence[str],
pk_cols: Sequence[str],
src_row: Dict[str, Any],
now: datetime,
) -> bool:
"""SCD2 合并:若有变更则关闭旧版并插入新版本。"""
pk_values = [src_row.get(pk) for pk in pk_cols]
if any(v is None for v in pk_values):
self.logger.warning("跳过 %s:主键缺失 %s", dwd_table, dict(zip(pk_cols, pk_values)))
return False
where_clause = " AND ".join(f'"{pk}" = %s' for pk in pk_cols)
table_sql = self._format_table(dwd_table, "billiards_dwd")
cur.execute(
f"SELECT * FROM {table_sql} WHERE {where_clause} AND COALESCE(scd2_is_current,1)=1 LIMIT 1",
pk_values,
)
current = cur.fetchone()
if current:
current = {k.lower(): v for k, v in current.items()}
if current and not self._is_row_changed(current, src_row, dwd_cols):
return False
if current:
version = (current.get("scd2_version") or 1) + 1
self._close_current_dim(cur, dwd_table, pk_cols, pk_values, now)
else:
version = 1
self._insert_dim_row(cur, dwd_table, dwd_cols, src_row, now, version)
return True
def _close_current_dim(self, cur, table: str, pk_cols: Sequence[str], pk_values: Sequence[Any], now: datetime) -> None:
"""关闭当前版本,标记 scd2_is_current=0 并填充结束时间。"""
set_sql = "scd2_end_time = %s, scd2_is_current = 0"
where_clause = " AND ".join(f'"{pk}" = %s' for pk in pk_cols)
table_sql = self._format_table(table, "billiards_dwd")
cur.execute(f"UPDATE {table_sql} SET {set_sql} WHERE {where_clause} AND COALESCE(scd2_is_current,1)=1", [now, *pk_values])
def _insert_dim_row(
self,
cur,
table: str,
dwd_cols: Sequence[str],
src_row: Dict[str, Any],
now: datetime,
version: int,
) -> None:
"""插入新的 SCD2 版本行。"""
insert_cols: List[str] = []
placeholders: List[str] = []
values: List[Any] = []
for col in sorted(dwd_cols):
lc = col.lower()
insert_cols.append(f'"{lc}"')
placeholders.append("%s")
if lc == "scd2_start_time":
values.append(now)
elif lc == "scd2_end_time":
values.append(datetime(9999, 12, 31, 0, 0, 0))
elif lc == "scd2_is_current":
values.append(1)
elif lc == "scd2_version":
values.append(version)
else:
values.append(src_row.get(lc))
table_sql = self._format_table(table, "billiards_dwd")
sql = f'INSERT INTO {table_sql} ({", ".join(insert_cols)}) VALUES ({", ".join(placeholders)})'
cur.execute(sql, values)
def _is_row_changed(self, current: Dict[str, Any], incoming: Dict[str, Any], dwd_cols: Sequence[str]) -> bool:
"""比较非 SCD2 列,判断是否存在变更。"""
for col in dwd_cols:
lc = col.lower()
if lc in self.SCD_COLS:
continue
if current.get(lc) != incoming.get(lc):
return True
return False
def _merge_fact_increment(
self,
cur,
dwd_table: str,
ods_table: str,
dwd_cols: Sequence[str],
ods_cols: Sequence[str],
dwd_types: Dict[str, str],
ods_types: Dict[str, str],
) -> int:
"""事实表按时间增量插入,默认按列名交集写入。"""
mapping_entries = self.FACT_MAPPINGS.get(dwd_table) or []
mapping: Dict[str, tuple[str, str | None]] = {
dst.lower(): (src, cast_type) for dst, src, cast_type in mapping_entries
}
mapping_dest = [dst for dst, _, _ in mapping_entries]
insert_cols: List[str] = list(mapping_dest)
for col in dwd_cols:
if col in self.SCD_COLS:
continue
if col in insert_cols:
continue
if col in ods_cols:
insert_cols.append(col)
pk_cols = self._get_primary_keys(cur, dwd_table)
ods_set = {c.lower() for c in ods_cols}
existing_lower = [c.lower() for c in insert_cols]
for pk in pk_cols:
pk_lower = pk.lower()
if pk_lower in existing_lower:
continue
if pk_lower in ods_set:
insert_cols.append(pk)
existing_lower.append(pk_lower)
elif "id" in ods_set:
insert_cols.append(pk)
existing_lower.append(pk_lower)
mapping[pk_lower] = ("id", None)
# 保持列顺序同时去重
seen_cols: set[str] = set()
ordered_cols: list[str] = []
for col in insert_cols:
lc = col.lower()
if lc not in seen_cols:
seen_cols.add(lc)
ordered_cols.append(col)
insert_cols = ordered_cols
if not insert_cols:
self.logger.warning("跳过 %s:未找到可插入的列", dwd_table)
return 0
order_col = self._pick_order_column(dwd_cols, ods_cols)
where_sql = ""
params: List[Any] = []
dwd_table_sql = self._format_table(dwd_table, "billiards_dwd")
ods_table_sql = self._format_table(ods_table, "billiards_ods")
if order_col:
cur.execute(f'SELECT COALESCE(MAX("{order_col}"), %s) FROM {dwd_table_sql}', ("1970-01-01",))
row = cur.fetchone() or {}
watermark = list(row.values())[0] if row else "1970-01-01"
where_sql = f'WHERE "{order_col}" > %s'
params.append(watermark)
default_cols = [c for c in insert_cols if c.lower() not in mapping]
default_expr_map: Dict[str, str] = {}
if default_cols:
default_exprs = self._build_fact_select_exprs(default_cols, dwd_types, ods_types)
default_expr_map = dict(zip(default_cols, default_exprs))
select_exprs: List[str] = []
for col in insert_cols:
key = col.lower()
if key in mapping:
src, cast_type = mapping[key]
select_exprs.append(self._cast_expr(src, cast_type))
else:
select_exprs.append(default_expr_map[col])
select_cols_sql = ", ".join(select_exprs)
insert_cols_sql = ", ".join(f'"{c}"' for c in insert_cols)
sql = f'INSERT INTO {dwd_table_sql} ({insert_cols_sql}) SELECT {select_cols_sql} FROM {ods_table_sql} {where_sql}'
pk_cols = self._get_primary_keys(cur, dwd_table)
if pk_cols:
pk_sql = ", ".join(f'"{c}"' for c in pk_cols)
sql += f" ON CONFLICT ({pk_sql}) DO NOTHING"
cur.execute(sql, params)
return cur.rowcount
def _pick_order_column(self, dwd_cols: Iterable[str], ods_cols: Iterable[str]) -> str | None:
"""选择用于增量的时间列(需同时存在于 DWD 与 ODS"""
lower_cols = {c.lower() for c in dwd_cols} & {c.lower() for c in ods_cols}
for candidate in self.FACT_ORDER_CANDIDATES:
if candidate.lower() in lower_cols:
return candidate.lower()
return None
def _build_fact_select_exprs(
self,
insert_cols: Sequence[str],
dwd_types: Dict[str, str],
ods_types: Dict[str, str],
) -> List[str]:
"""构造事实表 SELECT 列表,需要时做类型转换。"""
numeric_types = {"integer", "bigint", "smallint", "numeric", "double precision", "real", "decimal"}
text_types = {"text", "character varying", "varchar"}
exprs = []
for col in insert_cols:
d_type = dwd_types.get(col)
o_type = ods_types.get(col)
if d_type in numeric_types and o_type in text_types:
exprs.append(f"CAST(NULLIF(CAST(\"{col}\" AS text), '') AS numeric):: {d_type}")
else:
exprs.append(f'"{col}"')
return exprs
def _split_table_name(self, name: str, default_schema: str) -> tuple[str, str]:
"""拆分 schema.table若无 schema 则补默认 schema。"""
parts = name.split(".")
if len(parts) == 2:
return parts[0], parts[1].lower()
return default_schema, name.lower()
def _table_base(self, name: str) -> str:
"""获取不含 schema 的表名。"""
return name.split(".")[-1]
def _format_table(self, name: str, default_schema: str) -> str:
"""返回带引号的 schema.table 名称。"""
schema, table = self._split_table_name(name, default_schema)
return f'"{schema}"."{table}"'
def _cast_expr(self, col: str, cast_type: str | None) -> str:
"""构造带可选 CAST 的列表达式。"""
if col.upper() == "NULL":
base = "NULL"
else:
is_expr = not col.isidentifier() or "->" in col or "#>>" in col or "::" in col or "'" in col
base = col if is_expr else f'"{col}"'
if cast_type:
cast_lower = cast_type.lower()
if cast_lower in {"bigint", "integer", "numeric", "decimal"}:
return f"CAST(NULLIF(CAST({base} AS text), '') AS numeric):: {cast_type}"
if cast_lower == "timestamptz":
return f"({base})::timestamptz"
return f"{base}::{cast_type}"
return base

View File

@@ -0,0 +1,105 @@
# -*- coding: utf-8 -*-
"""DWD 质量核对任务:按 dwd_quality_check.md 输出行数/金额对照报表。"""
from __future__ import annotations
import json
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, Iterable, List, Sequence, Tuple
from psycopg2.extras import RealDictCursor
from .base_task import BaseTask, TaskContext
from .dwd_load_task import DwdLoadTask
class DwdQualityTask(BaseTask):
"""对 ODS 与 DWD 进行行数、金额对照核查,生成 JSON 报表。"""
REPORT_PATH = Path("etl_billiards/reports/dwd_quality_report.json")
AMOUNT_KEYWORDS = ("amount", "money", "fee", "balance")
def get_task_code(self) -> str:
"""返回任务编码。"""
return "DWD_QUALITY_CHECK"
def extract(self, context: TaskContext) -> dict[str, Any]:
"""准备运行时上下文。"""
return {"now": datetime.now()}
def load(self, extracted: dict[str, Any], context: TaskContext) -> dict[str, Any]:
"""输出行数/金额差异报表到本地文件。"""
report: Dict[str, Any] = {
"generated_at": extracted["now"].isoformat(),
"tables": [],
"note": "行数/金额核对,金额字段基于列名包含 amount/money/fee/balance 的数值列自动扫描。",
}
with self.db.conn.cursor(cursor_factory=RealDictCursor) as cur:
for dwd_table, ods_table in DwdLoadTask.TABLE_MAP.items():
count_info = self._compare_counts(cur, dwd_table, ods_table)
amount_info = self._compare_amounts(cur, dwd_table, ods_table)
report["tables"].append(
{
"dwd_table": dwd_table,
"ods_table": ods_table,
"count": count_info,
"amounts": amount_info,
}
)
self.REPORT_PATH.parent.mkdir(parents=True, exist_ok=True)
self.REPORT_PATH.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
self.logger.info("DWD 质检报表已生成:%s", self.REPORT_PATH)
return {"report_path": str(self.REPORT_PATH)}
# ---------------------- helpers ----------------------
def _compare_counts(self, cur, dwd_table: str, ods_table: str) -> Dict[str, Any]:
"""统计两端行数并返回差异。"""
dwd_schema, dwd_name = self._split_table_name(dwd_table, default_schema="billiards_dwd")
ods_schema, ods_name = self._split_table_name(ods_table, default_schema="billiards_ods")
cur.execute(f'SELECT COUNT(1) AS cnt FROM "{dwd_schema}"."{dwd_name}"')
dwd_cnt = cur.fetchone()["cnt"]
cur.execute(f'SELECT COUNT(1) AS cnt FROM "{ods_schema}"."{ods_name}"')
ods_cnt = cur.fetchone()["cnt"]
return {"dwd": dwd_cnt, "ods": ods_cnt, "diff": dwd_cnt - ods_cnt}
def _compare_amounts(self, cur, dwd_table: str, ods_table: str) -> List[Dict[str, Any]]:
"""扫描金额相关列,生成 ODS 与 DWD 的汇总对照。"""
dwd_schema, dwd_name = self._split_table_name(dwd_table, default_schema="billiards_dwd")
ods_schema, ods_name = self._split_table_name(ods_table, default_schema="billiards_ods")
dwd_amount_cols = self._get_numeric_amount_columns(cur, dwd_schema, dwd_name)
ods_amount_cols = self._get_numeric_amount_columns(cur, ods_schema, ods_name)
common_amount_cols = sorted(set(dwd_amount_cols) & set(ods_amount_cols))
results: List[Dict[str, Any]] = []
for col in common_amount_cols:
cur.execute(f'SELECT COALESCE(SUM("{col}"),0) AS val FROM "{dwd_schema}"."{dwd_name}"')
dwd_sum = cur.fetchone()["val"]
cur.execute(f'SELECT COALESCE(SUM("{col}"),0) AS val FROM "{ods_schema}"."{ods_name}"')
ods_sum = cur.fetchone()["val"]
results.append({"column": col, "dwd_sum": float(dwd_sum or 0), "ods_sum": float(ods_sum or 0), "diff": float(dwd_sum or 0) - float(ods_sum or 0)})
return results
def _get_numeric_amount_columns(self, cur, schema: str, table: str) -> List[str]:
"""获取列名包含金额关键词的数值型字段。"""
cur.execute(
"""
SELECT column_name
FROM information_schema.columns
WHERE table_schema = %s
AND table_name = %s
AND data_type IN ('numeric','double precision','integer','bigint','smallint','real','decimal')
""",
(schema, table),
)
cols = [r["column_name"].lower() for r in cur.fetchall()]
return [c for c in cols if any(key in c for key in self.AMOUNT_KEYWORDS)]
def _split_table_name(self, name: str, default_schema: str) -> Tuple[str, str]:
"""拆分 schema 与表名,缺省使用 default_schema。"""
parts = name.split(".")
if len(parts) == 2:
return parts[0], parts[1]
return default_schema, name

View File

@@ -0,0 +1,36 @@
# -*- coding: utf-8 -*-
"""初始化 DWD Schema执行 schema_dwd_doc.sql可选先 DROP SCHEMA。"""
from __future__ import annotations
from pathlib import Path
from typing import Any
from .base_task import BaseTask, TaskContext
class InitDwdSchemaTask(BaseTask):
"""通过调度执行 DWD schema 初始化。"""
def get_task_code(self) -> str:
"""返回任务编码。"""
return "INIT_DWD_SCHEMA"
def extract(self, context: TaskContext) -> dict[str, Any]:
"""读取 DWD SQL 文件与参数。"""
base_dir = Path(__file__).resolve().parents[1] / "database"
dwd_path = Path(self.config.get("schema.dwd_file", base_dir / "schema_dwd_doc.sql"))
if not dwd_path.exists():
raise FileNotFoundError(f"未找到 DWD schema 文件: {dwd_path}")
drop_first = self.config.get("dwd.drop_schema_first", False)
return {"dwd_sql": dwd_path.read_text(encoding="utf-8"), "dwd_file": str(dwd_path), "drop_first": drop_first}
def load(self, extracted: dict[str, Any], context: TaskContext) -> dict:
"""可选 DROP schema再执行 DWD DDL。"""
with self.db.conn.cursor() as cur:
if extracted["drop_first"]:
cur.execute("DROP SCHEMA IF EXISTS billiards_dwd CASCADE;")
self.logger.info("已执行 DROP SCHEMA billiards_dwd CASCADE")
self.logger.info("执行 DWD schema 文件: %s", extracted["dwd_file"])
cur.execute(extracted["dwd_sql"])
return {"executed": 1, "files": [extracted["dwd_file"]]}

View File

@@ -0,0 +1,73 @@
# -*- coding: utf-8 -*-
"""任务:初始化运行环境,执行 ODS 与 etl_admin 的 DDL并准备日志/导出目录。"""
from __future__ import annotations
from pathlib import Path
from typing import Any
from .base_task import BaseTask, TaskContext
class InitOdsSchemaTask(BaseTask):
"""通过调度执行初始化:创建必要目录,执行 ODS 与 etl_admin 的 DDL。"""
def get_task_code(self) -> str:
"""返回任务编码。"""
return "INIT_ODS_SCHEMA"
def extract(self, context: TaskContext) -> dict[str, Any]:
"""读取 SQL 文件路径,收集需创建的目录。"""
base_dir = Path(__file__).resolve().parents[1] / "database"
ods_path = Path(self.config.get("schema.ods_file", base_dir / "schema_ODS_doc.sql"))
admin_path = Path(self.config.get("schema.etl_admin_file", base_dir / "schema_etl_admin.sql"))
if not ods_path.exists():
raise FileNotFoundError(f"找不到 ODS schema 文件: {ods_path}")
if not admin_path.exists():
raise FileNotFoundError(f"找不到 etl_admin schema 文件: {admin_path}")
log_root = Path(self.config.get("io.log_root") or self.config["io"]["log_root"])
export_root = Path(self.config.get("io.export_root") or self.config["io"]["export_root"])
fetch_root = Path(self.config.get("pipeline.fetch_root") or self.config["pipeline"]["fetch_root"])
ingest_dir = Path(self.config.get("pipeline.ingest_source_dir") or fetch_root)
return {
"ods_sql": ods_path.read_text(encoding="utf-8"),
"admin_sql": admin_path.read_text(encoding="utf-8"),
"ods_file": str(ods_path),
"admin_file": str(admin_path),
"dirs": [log_root, export_root, fetch_root, ingest_dir],
}
def load(self, extracted: dict[str, Any], context: TaskContext) -> dict:
"""执行 DDL 并创建必要目录。
安全提示:
ODS DDL 文件可能携带头部说明或异常注释,为避免因非 SQL 文本导致执行失败,这里会做一次轻量清洗后再执行。
"""
for d in extracted["dirs"]:
Path(d).mkdir(parents=True, exist_ok=True)
self.logger.info("已确保目录存在: %s", d)
# 处理 ODS SQL去掉头部说明行以及易出错的 COMMENT ON 行(如 CamelCase 未加引号)
ods_sql_raw: str = extracted["ods_sql"]
drop_idx = ods_sql_raw.find("DROP SCHEMA")
if drop_idx > 0:
ods_sql_raw = ods_sql_raw[drop_idx:]
cleaned_lines: list[str] = []
for line in ods_sql_raw.splitlines():
if line.strip().upper().startswith("COMMENT ON "):
continue
cleaned_lines.append(line)
ods_sql = "\n".join(cleaned_lines)
with self.db.conn.cursor() as cur:
self.logger.info("执行 etl_admin schema 文件: %s", extracted["admin_file"])
cur.execute(extracted["admin_sql"])
self.logger.info("执行 ODS schema 文件: %s", extracted["ods_file"])
cur.execute(ods_sql)
return {
"executed": 2,
"files": [extracted["admin_file"], extracted["ods_file"]],
"dirs_prepared": [str(p) for p in extracted["dirs"]],
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from .base_dwd_task import BaseDwdTask from .base_dwd_task import BaseDwdTask
from loaders.dimensions.member import MemberLoader from loaders.dimensions.member import MemberLoader
from models.parsers import TypeParser from models.parsers import TypeParser
@@ -7,7 +7,7 @@ import json
class MembersDwdTask(BaseDwdTask): class MembersDwdTask(BaseDwdTask):
""" """
DWD Task: Process Member Records from ODS to Dimension Table DWD Task: Process Member Records from ODS to Dimension Table
Source: billiards_ods.ods_member_profile Source: billiards_ods.member_profiles
Target: billiards.dim_member Target: billiards.dim_member
""" """
@@ -29,7 +29,7 @@ class MembersDwdTask(BaseDwdTask):
# Iterate ODS Data # Iterate ODS Data
batches = self.iter_ods_rows( batches = self.iter_ods_rows(
table_name="billiards_ods.ods_member_profile", table_name="billiards_ods.member_profiles",
columns=["site_id", "member_id", "payload", "fetched_at"], columns=["site_id", "member_id", "payload", "fetched_at"],
start_time=window_start, start_time=window_start,
end_time=window_end end_time=window_end
@@ -87,3 +87,4 @@ class MembersDwdTask(BaseDwdTask):
except Exception as e: except Exception as e:
self.logger.warning(f"Error parsing member: {e}") self.logger.warning(f"Error parsing member: {e}")
return None return None

View File

@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
"""ODS ingestion tasks.""" """ODS ingestion tasks."""
from __future__ import annotations from __future__ import annotations
@@ -62,11 +62,11 @@ class BaseOdsTask(BaseTask):
def execute(self) -> dict: def execute(self) -> dict:
spec = self.SPEC spec = self.SPEC
self.logger.info("开始执行 %s (ODS)", spec.code) self.logger.info("寮€濮嬫墽琛?%s (ODS)", spec.code)
store_id = TypeParser.parse_int(self.config.get("app.store_id")) store_id = TypeParser.parse_int(self.config.get("app.store_id"))
if not store_id: if not store_id:
raise ValueError("app.store_id 未配置,无法执行 ODS 任务") raise ValueError("app.store_id 鏈厤缃紝鏃犳硶鎵ц ODS 浠诲姟")
page_size = self.config.get("api.page_size", 200) page_size = self.config.get("api.page_size", 200)
params = self._build_params(spec, store_id) params = self._build_params(spec, store_id)
@@ -122,13 +122,13 @@ class BaseOdsTask(BaseTask):
counts["fetched"] += len(page_records) counts["fetched"] += len(page_records)
self.db.commit() self.db.commit()
self.logger.info("%s ODS 任务完成: %s", spec.code, counts) self.logger.info("%s ODS 浠诲姟瀹屾垚: %s", spec.code, counts)
return self._build_result("SUCCESS", counts) return self._build_result("SUCCESS", counts)
except Exception: except Exception:
self.db.rollback() self.db.rollback()
counts["errors"] += 1 counts["errors"] += 1
self.logger.error("%s ODS 任务失败", spec.code, exc_info=True) self.logger.error("%s ODS 浠诲姟澶辫触", spec.code, exc_info=True)
raise raise
def _build_params(self, spec: OdsTaskSpec, store_id: int) -> dict: def _build_params(self, spec: OdsTaskSpec, store_id: int) -> dict:
@@ -201,7 +201,7 @@ class BaseOdsTask(BaseTask):
value = self._extract_value(record, col_spec) value = self._extract_value(record, col_spec)
if value is None and col_spec.required: if value is None and col_spec.required:
self.logger.warning( self.logger.warning(
"%s 缺少必填字段 %s,原始记录: %s", "%s 缂哄皯蹇呭~瀛楁 %s锛屽師濮嬭褰? %s",
spec.code, spec.code,
col_spec.column, col_spec.column,
record, record,
@@ -265,9 +265,38 @@ def _int_col(name: str, *sources: str, required: bool = False) -> ColumnSpec:
) )
def _decimal_col(name: str, *sources: str) -> ColumnSpec:
"""??????????????"""
return ColumnSpec(
column=name,
sources=sources,
transform=lambda v: TypeParser.parse_decimal(v, 2),
)
def _bool_col(name: str, *sources: str) -> ColumnSpec:
"""??????????????0/1?true/false ???"""
def _to_bool(value):
if value is None:
return None
if isinstance(value, bool):
return value
s = str(value).strip().lower()
if s in {"1", "true", "t", "yes", "y"}:
return True
if s in {"0", "false", "f", "no", "n"}:
return False
return bool(value)
return ColumnSpec(column=name, sources=sources, transform=_to_bool)
ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = ( ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
OdsTaskSpec( OdsTaskSpec(
code="ODS_ASSISTANT_ACCOUNTS", code="ODS_ASSISTANT_ACCOUNT",
class_name="OdsAssistantAccountsTask", class_name="OdsAssistantAccountsTask",
table_name="billiards_ods.assistant_accounts_master", table_name="billiards_ods.assistant_accounts_master",
endpoint="/PersonnelManagement/SearchAssistantInfo", endpoint="/PersonnelManagement/SearchAssistantInfo",
@@ -281,10 +310,10 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_fetched_at=False, include_fetched_at=False,
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
description="助教账号档案 ODSSearchAssistantInfo -> assistantInfos 原始 JSON", description="鍔╂暀璐﹀彿妗f ODS锛歋earchAssistantInfo -> assistantInfos 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_ORDER_SETTLE", code="ODS_SETTLEMENT_RECORDS",
class_name="OdsOrderSettleTask", class_name="OdsOrderSettleTask",
table_name="billiards_ods.settlement_records", table_name="billiards_ods.settlement_records",
endpoint="/Site/GetAllOrderSettleList", endpoint="/Site/GetAllOrderSettleList",
@@ -299,7 +328,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="结账记录 ODSGetAllOrderSettleList -> settleList 原始 JSON", description="缁撹处璁板綍 ODS锛欸etAllOrderSettleList -> settleList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_TABLE_USE", code="ODS_TABLE_USE",
@@ -317,7 +346,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="台费计费流水 ODSGetSiteTableOrderDetails -> siteTableUseDetailsList 原始 JSON", description="鍙拌垂璁¤垂娴佹按 ODS锛欸etSiteTableOrderDetails -> siteTableUseDetailsList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_ASSISTANT_LEDGER", code="ODS_ASSISTANT_LEDGER",
@@ -334,7 +363,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_fetched_at=False, include_fetched_at=False,
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
description="助教服务流水 ODSGetOrderAssistantDetails -> orderAssistantDetails 原始 JSON", description="鍔╂暀鏈嶅姟娴佹按 ODS锛欸etOrderAssistantDetails -> orderAssistantDetails 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_ASSISTANT_ABOLISH", code="ODS_ASSISTANT_ABOLISH",
@@ -351,10 +380,10 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_fetched_at=False, include_fetched_at=False,
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
description="助教废除记录 ODSGetAbolitionAssistant -> abolitionAssistants 原始 JSON", description="鍔╂暀搴熼櫎璁板綍 ODS锛欸etAbolitionAssistant -> abolitionAssistants 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_GOODS_LEDGER", code="ODS_STORE_GOODS_SALES",
class_name="OdsGoodsLedgerTask", class_name="OdsGoodsLedgerTask",
table_name="billiards_ods.store_goods_sales_records", table_name="billiards_ods.store_goods_sales_records",
endpoint="/TenantGoods/GetGoodsSalesList", endpoint="/TenantGoods/GetGoodsSalesList",
@@ -369,7 +398,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="门店商品销售流水 ODSGetGoodsSalesList -> orderGoodsLedgers 原始 JSON", description="闂ㄥ簵鍟嗗搧閿€鍞祦姘?ODS锛欸etGoodsSalesList -> orderGoodsLedgers 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_PAYMENT", code="ODS_PAYMENT",
@@ -386,7 +415,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="支付流水 ODSGetPayLogListPage 原始 JSON", description="鏀粯娴佹按 ODS锛欸etPayLogListPage 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_REFUND", code="ODS_REFUND",
@@ -403,10 +432,10 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="退款流水 ODSGetRefundPayLogList 原始 JSON", description="閫€娆炬祦姘?ODS锛欸etRefundPayLogList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_COUPON_VERIFY", code="ODS_PLATFORM_COUPON",
class_name="OdsCouponVerifyTask", class_name="OdsCouponVerifyTask",
table_name="billiards_ods.platform_coupon_redemption_records", table_name="billiards_ods.platform_coupon_redemption_records",
endpoint="/Promotion/GetOfflineCouponConsumePageList", endpoint="/Promotion/GetOfflineCouponConsumePageList",
@@ -420,7 +449,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="平台/团购券核销 ODSGetOfflineCouponConsumePageList 原始 JSON", description="骞冲彴/鍥㈣喘鍒告牳閿€ ODS锛欸etOfflineCouponConsumePageList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_MEMBER", code="ODS_MEMBER",
@@ -438,7 +467,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="会员档案 ODSGetTenantMemberList -> tenantMemberInfos 原始 JSON", description="浼氬憳妗f ODS锛欸etTenantMemberList -> tenantMemberInfos 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_MEMBER_CARD", code="ODS_MEMBER_CARD",
@@ -456,7 +485,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="会员储值卡 ODSGetTenantMemberCardList -> tenantMemberCards 原始 JSON", description="浼氬憳鍌ㄥ€煎崱 ODS锛欸etTenantMemberCardList -> tenantMemberCards 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_MEMBER_BALANCE", code="ODS_MEMBER_BALANCE",
@@ -474,7 +503,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="会员余额变动 ODSGetMemberCardBalanceChange -> tenantMemberCardLogs 原始 JSON", description="浼氬憳浣欓鍙樺姩 ODS锛欸etMemberCardBalanceChange -> tenantMemberCardLogs 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_RECHARGE_SETTLE", code="ODS_RECHARGE_SETTLE",
@@ -483,19 +512,83 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
endpoint="/Site/GetRechargeSettleList", endpoint="/Site/GetRechargeSettleList",
data_path=("data",), data_path=("data",),
list_key="settleList", list_key="settleList",
pk_columns=(), pk_columns=(_int_col("recharge_order_id", "settleList.id", "id", required=True),),
extra_columns=(
_int_col("tenant_id", "settleList.tenantId", "tenantId"),
_int_col("site_id", "settleList.siteId", "siteId", "siteProfile.id"),
ColumnSpec("site_name_snapshot", sources=("siteProfile.shop_name", "settleList.siteName")),
_int_col("member_id", "settleList.memberId", "memberId"),
ColumnSpec("member_name_snapshot", sources=("settleList.memberName", "memberName")),
ColumnSpec("member_phone_snapshot", sources=("settleList.memberPhone", "memberPhone")),
_int_col("tenant_member_card_id", "settleList.tenantMemberCardId", "tenantMemberCardId"),
ColumnSpec("member_card_type_name", sources=("settleList.memberCardTypeName", "memberCardTypeName")),
_int_col("settle_relate_id", "settleList.settleRelateId", "settleRelateId"),
_int_col("settle_type", "settleList.settleType", "settleType"),
ColumnSpec("settle_name", sources=("settleList.settleName", "settleName")),
_int_col("is_first", "settleList.isFirst", "isFirst"),
_int_col("settle_status", "settleList.settleStatus", "settleStatus"),
_decimal_col("pay_amount", "settleList.payAmount", "payAmount"),
_decimal_col("refund_amount", "settleList.refundAmount", "refundAmount"),
_decimal_col("point_amount", "settleList.pointAmount", "pointAmount"),
_decimal_col("cash_amount", "settleList.cashAmount", "cashAmount"),
_decimal_col("online_amount", "settleList.onlineAmount", "onlineAmount"),
_decimal_col("balance_amount", "settleList.balanceAmount", "balanceAmount"),
_decimal_col("card_amount", "settleList.cardAmount", "cardAmount"),
_decimal_col("coupon_amount", "settleList.couponAmount", "couponAmount"),
_decimal_col("recharge_card_amount", "settleList.rechargeCardAmount", "rechargeCardAmount"),
_decimal_col("gift_card_amount", "settleList.giftCardAmount", "giftCardAmount"),
_decimal_col("prepay_money", "settleList.prepayMoney", "prepayMoney"),
_decimal_col("consume_money", "settleList.consumeMoney", "consumeMoney"),
_decimal_col("goods_money", "settleList.goodsMoney", "goodsMoney"),
_decimal_col("real_goods_money", "settleList.realGoodsMoney", "realGoodsMoney"),
_decimal_col("table_charge_money", "settleList.tableChargeMoney", "tableChargeMoney"),
_decimal_col("service_money", "settleList.serviceMoney", "serviceMoney"),
_decimal_col("activity_discount", "settleList.activityDiscount", "activityDiscount"),
_decimal_col("all_coupon_discount", "settleList.allCouponDiscount", "allCouponDiscount"),
_decimal_col("goods_promotion_money", "settleList.goodsPromotionMoney", "goodsPromotionMoney"),
_decimal_col("assistant_promotion_money", "settleList.assistantPromotionMoney", "assistantPromotionMoney"),
_decimal_col("assistant_pd_money", "settleList.assistantPdMoney", "assistantPdMoney"),
_decimal_col("assistant_cx_money", "settleList.assistantCxMoney", "assistantCxMoney"),
_decimal_col("assistant_manual_discount", "settleList.assistantManualDiscount", "assistantManualDiscount"),
_decimal_col("coupon_sale_amount", "settleList.couponSaleAmount", "couponSaleAmount"),
_decimal_col("member_discount_amount", "settleList.memberDiscountAmount", "memberDiscountAmount"),
_decimal_col("point_discount_price", "settleList.pointDiscountPrice", "pointDiscountPrice"),
_decimal_col("point_discount_cost", "settleList.pointDiscountCost", "pointDiscountCost"),
_decimal_col("adjust_amount", "settleList.adjustAmount", "adjustAmount"),
_decimal_col("rounding_amount", "settleList.roundingAmount", "roundingAmount"),
_int_col("payment_method", "settleList.paymentMethod", "paymentMethod"),
_bool_col("can_be_revoked", "settleList.canBeRevoked", "canBeRevoked"),
_bool_col("is_bind_member", "settleList.isBindMember", "isBindMember"),
_bool_col("is_activity", "settleList.isActivity", "isActivity"),
_bool_col("is_use_coupon", "settleList.isUseCoupon", "isUseCoupon"),
_bool_col("is_use_discount", "settleList.isUseDiscount", "isUseDiscount"),
_int_col("operator_id", "settleList.operatorId", "operatorId"),
ColumnSpec("operator_name_snapshot", sources=("settleList.operatorName", "operatorName")),
_int_col("salesman_user_id", "settleList.salesManUserId", "salesmanUserId", "salesManUserId"),
ColumnSpec("salesman_name", sources=("settleList.salesManName", "salesmanName", "settleList.salesmanName")),
ColumnSpec("order_remark", sources=("settleList.orderRemark", "orderRemark")),
_int_col("table_id", "settleList.tableId", "tableId"),
_int_col("serial_number", "settleList.serialNumber", "serialNumber"),
_int_col("revoke_order_id", "settleList.revokeOrderId", "revokeOrderId"),
ColumnSpec("revoke_order_name", sources=("settleList.revokeOrderName", "revokeOrderName")),
ColumnSpec("revoke_time", sources=("settleList.revokeTime", "revokeTime")),
ColumnSpec("create_time", sources=("settleList.createTime", "createTime")),
ColumnSpec("pay_time", sources=("settleList.payTime", "payTime")),
ColumnSpec("site_profile", sources=("siteProfile",)),
),
include_site_column=False, include_site_column=False,
include_source_endpoint=False, include_source_endpoint=True,
include_page_no=False, include_page_no=False,
include_page_size=False, include_page_size=False,
include_fetched_at=False, include_fetched_at=True,
include_record_index=True, include_record_index=False,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=None,
requires_window=False, requires_window=False,
description="会员充值结算 ODSGetRechargeSettleList -> settleList 原始 JSON", description="?????? ODS?GetRechargeSettleList -> data.settleList ????",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_PACKAGE", code="ODS_GROUP_PACKAGE",
class_name="OdsPackageTask", class_name="OdsPackageTask",
table_name="billiards_ods.group_buy_packages", table_name="billiards_ods.group_buy_packages",
endpoint="/PackageCoupon/QueryPackageCouponList", endpoint="/PackageCoupon/QueryPackageCouponList",
@@ -510,7 +603,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="团购套餐定义 ODSQueryPackageCouponList -> packageCouponList 原始 JSON", description="鍥㈣喘濂楅瀹氫箟 ODS锛歈ueryPackageCouponList -> packageCouponList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_GROUP_BUY_REDEMPTION", code="ODS_GROUP_BUY_REDEMPTION",
@@ -528,7 +621,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="团购套餐核销 ODSGetSiteTableUseDetails -> siteTableUseDetailsList 原始 JSON", description="鍥㈣喘濂楅鏍搁攢 ODS锛欸etSiteTableUseDetails -> siteTableUseDetailsList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_INVENTORY_STOCK", code="ODS_INVENTORY_STOCK",
@@ -545,7 +638,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="库存汇总 ODSGetGoodsStockReport 原始 JSON", description="搴撳瓨姹囨€?ODS锛欸etGoodsStockReport 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_INVENTORY_CHANGE", code="ODS_INVENTORY_CHANGE",
@@ -562,7 +655,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_fetched_at=False, include_fetched_at=False,
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
description="库存变化记录 ODSQueryGoodsOutboundReceipt -> queryDeliveryRecordsList 原始 JSON", description="搴撳瓨鍙樺寲璁板綍 ODS锛歈ueryGoodsOutboundReceipt -> queryDeliveryRecordsList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_TABLES", code="ODS_TABLES",
@@ -580,7 +673,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="台桌维表 ODSGetSiteTables -> siteTables 原始 JSON", description="鍙版缁磋〃 ODS锛欸etSiteTables -> siteTables 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_GOODS_CATEGORY", code="ODS_GOODS_CATEGORY",
@@ -598,7 +691,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="库存商品分类树 ODSQueryPrimarySecondaryCategory -> goodsCategoryList 原始 JSON", description="搴撳瓨鍟嗗搧鍒嗙被鏍?ODS锛歈ueryPrimarySecondaryCategory -> goodsCategoryList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_STORE_GOODS", code="ODS_STORE_GOODS",
@@ -616,10 +709,10 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="门店商品档案 ODSGetGoodsInventoryList -> orderGoodsList 原始 JSON", description="闂ㄥ簵鍟嗗搧妗f ODS锛欸etGoodsInventoryList -> orderGoodsList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_TABLE_DISCOUNT", code="ODS_TABLE_FEE_DISCOUNT",
class_name="OdsTableDiscountTask", class_name="OdsTableDiscountTask",
table_name="billiards_ods.table_fee_discount_records", table_name="billiards_ods.table_fee_discount_records",
endpoint="/Site/GetTaiFeeAdjustList", endpoint="/Site/GetTaiFeeAdjustList",
@@ -634,7 +727,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="台费折扣/调账 ODSGetTaiFeeAdjustList -> taiFeeAdjustInfos 原始 JSON", description="鍙拌垂鎶樻墸/璋冭处 ODS锛欸etTaiFeeAdjustList -> taiFeeAdjustInfos 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_TENANT_GOODS", code="ODS_TENANT_GOODS",
@@ -652,7 +745,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
include_record_index=True, include_record_index=True,
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
description="租户商品档案 ODSQueryTenantGoods -> tenantGoodsList 原始 JSON", description="绉熸埛鍟嗗搧妗f ODS锛歈ueryTenantGoods -> tenantGoodsList 鍘熷 JSON",
), ),
OdsTaskSpec( OdsTaskSpec(
code="ODS_SETTLEMENT_TICKET", code="ODS_SETTLEMENT_TICKET",
@@ -671,7 +764,7 @@ ODS_TASK_SPECS: Tuple[OdsTaskSpec, ...] = (
conflict_columns_override=("source_file", "record_index"), conflict_columns_override=("source_file", "record_index"),
requires_window=False, requires_window=False,
include_site_id=False, include_site_id=False,
description="结账小票详情 ODSGetOrderSettleTicketNew 原始 JSON", description="缁撹处灏忕エ璇︽儏 ODS锛欸etOrderSettleTicketNew 鍘熷 JSON",
), ),
) )
@@ -725,7 +818,7 @@ class OdsSettlementTicketTask(BaseOdsTask):
if not candidates: if not candidates:
self.logger.info( self.logger.info(
"%s: 窗口[%s ~ %s] 未发现需要抓取的小票", "%s: 绐楀彛[%s ~ %s] 鏈彂鐜伴渶瑕佹姄鍙栫殑灏忕エ",
spec.code, spec.code,
context.window_start, context.window_start,
context.window_end, context.window_end,
@@ -755,7 +848,7 @@ class OdsSettlementTicketTask(BaseOdsTask):
counts["updated"] += updated counts["updated"] += updated
self.db.commit() self.db.commit()
self.logger.info( self.logger.info(
"%s: 小票抓取完成,候选=%s 插入=%s 更新=%s 跳过=%s", "%s: 灏忕エ鎶撳彇瀹屾垚锛屽€欓€?%s 鎻掑叆=%s 鏇存柊=%s 璺宠繃=%s",
spec.code, spec.code,
len(candidates), len(candidates),
inserted, inserted,
@@ -767,7 +860,7 @@ class OdsSettlementTicketTask(BaseOdsTask):
except Exception: except Exception:
counts["errors"] += 1 counts["errors"] += 1
self.db.rollback() self.db.rollback()
self.logger.error("%s: 小票抓取失败", spec.code, exc_info=True) self.logger.error("%s: 灏忕エ鎶撳彇澶辫触", spec.code, exc_info=True)
raise raise
# ------------------------------------------------------------------ helpers # ------------------------------------------------------------------ helpers
@@ -782,7 +875,7 @@ class OdsSettlementTicketTask(BaseOdsTask):
try: try:
rows = self.db.query(sql) rows = self.db.query(sql)
except Exception: except Exception:
self.logger.warning("查询已有小票失败,按空集处理", exc_info=True) self.logger.warning("鏌ヨ宸叉湁灏忕エ澶辫触锛屾寜绌洪泦澶勭悊", exc_info=True)
return set() return set()
return { return {
@@ -819,7 +912,7 @@ class OdsSettlementTicketTask(BaseOdsTask):
try: try:
rows = self.db.query(sql, params) rows = self.db.query(sql, params)
except Exception: except Exception:
self.logger.warning("读取支付流水以获取结算单ID失败将尝试调用支付接口回退", exc_info=True) self.logger.warning("璇诲彇鏀粯娴佹按浠ヨ幏鍙栫粨绠楀崟ID澶辫触锛屽皢灏濊瘯璋冪敤鏀粯鎺ュ彛鍥為€€", exc_info=True)
return set() return set()
return { return {
@@ -853,7 +946,7 @@ class OdsSettlementTicketTask(BaseOdsTask):
if relate_id: if relate_id:
candidate_ids.add(relate_id) candidate_ids.add(relate_id)
except Exception: except Exception:
self.logger.warning("调用支付接口获取结算单ID失败当前批次将跳过回退来源", exc_info=True) self.logger.warning("璋冪敤鏀粯鎺ュ彛鑾峰彇缁撶畻鍗旾D澶辫触锛屽綋鍓嶆壒娆″皢璺宠繃鍥為€€鏉ユ簮", exc_info=True)
return candidate_ids return candidate_ids
def _fetch_ticket_payload(self, order_settle_id: int): def _fetch_ticket_payload(self, order_settle_id: int):
@@ -869,10 +962,10 @@ class OdsSettlementTicketTask(BaseOdsTask):
payload = response payload = response
except Exception: except Exception:
self.logger.warning( self.logger.warning(
"调用小票接口失败 orderSettleId=%s", order_settle_id, exc_info=True "璋冪敤灏忕エ鎺ュ彛澶辫触 orderSettleId=%s", order_settle_id, exc_info=True
) )
if isinstance(payload, dict) and isinstance(payload.get("data"), list) and len(payload["data"]) == 1: if isinstance(payload, dict) and isinstance(payload.get("data"), list) and len(payload["data"]) == 1:
# 本地桩/回放可能把响应包装成单元素 list这里展开以贴近真实结构 # 鏈湴妗?鍥炴斁鍙兘鎶婂搷搴斿寘瑁呮垚鍗曞厓绱?list锛岃繖閲屽睍寮€浠ヨ创杩戠湡瀹炵粨鏋?
payload = payload["data"][0] payload = payload["data"][0]
return payload return payload
@@ -899,27 +992,29 @@ def _build_task_class(spec: OdsTaskSpec) -> Type[BaseOdsTask]:
ENABLED_ODS_CODES = { ENABLED_ODS_CODES = {
"ODS_ASSISTANT_ACCOUNTS", "ODS_ASSISTANT_ACCOUNT",
"ODS_ASSISTANT_LEDGER", "ODS_ASSISTANT_LEDGER",
"ODS_ASSISTANT_ABOLISH", "ODS_ASSISTANT_ABOLISH",
"ODS_INVENTORY_CHANGE", "ODS_INVENTORY_CHANGE",
"ODS_INVENTORY_STOCK", "ODS_INVENTORY_STOCK",
"ODS_PACKAGE", "ODS_GROUP_PACKAGE",
"ODS_GROUP_BUY_REDEMPTION", "ODS_GROUP_BUY_REDEMPTION",
"ODS_MEMBER", "ODS_MEMBER",
"ODS_MEMBER_BALANCE", "ODS_MEMBER_BALANCE",
"ODS_MEMBER_CARD", "ODS_MEMBER_CARD",
"ODS_PAYMENT", "ODS_PAYMENT",
"ODS_REFUND", "ODS_REFUND",
"ODS_COUPON_VERIFY", "ODS_PLATFORM_COUPON",
"ODS_RECHARGE_SETTLE", "ODS_RECHARGE_SETTLE",
"ODS_TABLE_USE",
"ODS_TABLES", "ODS_TABLES",
"ODS_GOODS_CATEGORY", "ODS_GOODS_CATEGORY",
"ODS_STORE_GOODS", "ODS_STORE_GOODS",
"ODS_TABLE_DISCOUNT", "ODS_TABLE_FEE_DISCOUNT",
"ODS_STORE_GOODS_SALES",
"ODS_TENANT_GOODS", "ODS_TENANT_GOODS",
"ODS_SETTLEMENT_TICKET", "ODS_SETTLEMENT_TICKET",
"ODS_ORDER_SETTLE", "ODS_SETTLEMENT_RECORDS",
} }
ODS_TASK_CLASSES: Dict[str, Type[BaseOdsTask]] = { ODS_TASK_CLASSES: Dict[str, Type[BaseOdsTask]] = {
@@ -931,3 +1026,4 @@ ODS_TASK_CLASSES: Dict[str, Type[BaseOdsTask]] = {
ODS_TASK_CLASSES["ODS_SETTLEMENT_TICKET"] = OdsSettlementTicketTask ODS_TASK_CLASSES["ODS_SETTLEMENT_TICKET"] = OdsSettlementTicketTask
__all__ = ["ODS_TASK_CLASSES", "ODS_TASK_SPECS", "BaseOdsTask", "ENABLED_ODS_CODES"] __all__ = ["ODS_TASK_CLASSES", "ODS_TASK_SPECS", "BaseOdsTask", "ENABLED_ODS_CODES"]

View File

@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from .base_dwd_task import BaseDwdTask from .base_dwd_task import BaseDwdTask
from loaders.facts.payment import PaymentLoader from loaders.facts.payment import PaymentLoader
from models.parsers import TypeParser from models.parsers import TypeParser
@@ -29,7 +29,7 @@ class PaymentsDwdTask(BaseDwdTask):
# Iterate ODS Data # Iterate ODS Data
batches = self.iter_ods_rows( batches = self.iter_ods_rows(
table_name="billiards_ods.ods_payment_record", table_name="billiards_ods.payment_transactions",
columns=["site_id", "pay_id", "payload", "fetched_at"], columns=["site_id", "pay_id", "payload", "fetched_at"],
start_time=window_start, start_time=window_start,
end_time=window_end end_time=window_end
@@ -136,3 +136,4 @@ class PaymentsDwdTask(BaseDwdTask):
except Exception as e: except Exception as e:
self.logger.warning(f"Error parsing payment: {e}") self.logger.warning(f"Error parsing payment: {e}")
return None return None

View File

@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
"""Unit tests for the new ODS ingestion tasks.""" """Unit tests for the new ODS ingestion tasks."""
import logging import logging
import os import os
@@ -22,21 +22,21 @@ def _build_config(tmp_path):
return create_test_config("ONLINE", archive_dir, temp_dir) return create_test_config("ONLINE", archive_dir, temp_dir)
def test_ods_assistant_accounts_ingest(tmp_path): def test_assistant_accounts_masters_ingest(tmp_path):
"""Ensure ODS_ASSISTANT_ACCOUNTS task stores raw payload with record_index dedup keys.""" """Ensure assistant_accounts_masterS task stores raw payload with record_index dedup keys."""
config = _build_config(tmp_path) config = _build_config(tmp_path)
sample = [ sample = [
{ {
"id": 5001, "id": 5001,
"assistant_no": "A01", "assistant_no": "A01",
"nickname": "小张", "nickname": "灏忓紶",
} }
] ]
api = FakeAPIClient({"/PersonnelManagement/SearchAssistantInfo": sample}) api = FakeAPIClient({"/PersonnelManagement/SearchAssistantInfo": sample})
task_cls = ODS_TASK_CLASSES["ODS_ASSISTANT_ACCOUNTS"] task_cls = ODS_TASK_CLASSES["assistant_accounts_masterS"]
with get_db_operations() as db_ops: with get_db_operations() as db_ops:
task = task_cls(config, db_ops, api, logging.getLogger("test_ods_assistant_accounts")) task = task_cls(config, db_ops, api, logging.getLogger("test_assistant_accounts_masters"))
result = task.execute() result = task.execute()
assert result["status"] == "SUCCESS" assert result["status"] == "SUCCESS"
@@ -49,21 +49,21 @@ def test_ods_assistant_accounts_ingest(tmp_path):
assert '"id": 5001' in row["payload"] assert '"id": 5001' in row["payload"]
def test_ods_inventory_change_ingest(tmp_path): def test_goods_stock_movements_ingest(tmp_path):
"""Ensure ODS_INVENTORY_CHANGE task stores raw payload with record_index dedup keys.""" """Ensure goods_stock_movements task stores raw payload with record_index dedup keys."""
config = _build_config(tmp_path) config = _build_config(tmp_path)
sample = [ sample = [
{ {
"siteGoodsStockId": 123456, "siteGoodsStockId": 123456,
"stockType": 1, "stockType": 1,
"goodsName": "测试商品", "goodsName": "娴嬭瘯鍟嗗搧",
} }
] ]
api = FakeAPIClient({"/GoodsStockManage/QueryGoodsOutboundReceipt": sample}) api = FakeAPIClient({"/GoodsStockManage/QueryGoodsOutboundReceipt": sample})
task_cls = ODS_TASK_CLASSES["ODS_INVENTORY_CHANGE"] task_cls = ODS_TASK_CLASSES["goods_stock_movements"]
with get_db_operations() as db_ops: with get_db_operations() as db_ops:
task = task_cls(config, db_ops, api, logging.getLogger("test_ods_inventory_change")) task = task_cls(config, db_ops, api, logging.getLogger("test_goods_stock_movements"))
result = task.execute() result = task.execute()
assert result["status"] == "SUCCESS" assert result["status"] == "SUCCESS"
@@ -75,7 +75,7 @@ def test_ods_inventory_change_ingest(tmp_path):
assert '"siteGoodsStockId": 123456' in row["payload"] assert '"siteGoodsStockId": 123456' in row["payload"]
def test_ods_member_profiles_ingest(tmp_path): def test_member_profiless_ingest(tmp_path):
"""Ensure ODS_MEMBER task stores tenantMemberInfos raw JSON.""" """Ensure ODS_MEMBER task stores tenantMemberInfos raw JSON."""
config = _build_config(tmp_path) config = _build_config(tmp_path)
sample = [{"tenantMemberInfos": [{"id": 101, "mobile": "13800000000"}]}] sample = [{"tenantMemberInfos": [{"id": 101, "mobile": "13800000000"}]}]
@@ -110,14 +110,14 @@ def test_ods_payment_ingest(tmp_path):
def test_ods_settlement_records_ingest(tmp_path): def test_ods_settlement_records_ingest(tmp_path):
"""Ensure ODS_ORDER_SETTLE task stores settleList raw JSON.""" """Ensure settlement_records task stores settleList raw JSON."""
config = _build_config(tmp_path) config = _build_config(tmp_path)
sample = [{"data": {"settleList": [{"id": 701, "orderTradeNo": 8001}]}}] sample = [{"data": {"settleList": [{"id": 701, "orderTradeNo": 8001}]}}]
api = FakeAPIClient({"/Site/GetAllOrderSettleList": sample}) api = FakeAPIClient({"/Site/GetAllOrderSettleList": sample})
task_cls = ODS_TASK_CLASSES["ODS_ORDER_SETTLE"] task_cls = ODS_TASK_CLASSES["settlement_records"]
with get_db_operations() as db_ops: with get_db_operations() as db_ops:
task = task_cls(config, db_ops, api, logging.getLogger("test_ods_order_settle")) task = task_cls(config, db_ops, api, logging.getLogger("test_settlement_records"))
result = task.execute() result = task.execute()
assert result["status"] == "SUCCESS" assert result["status"] == "SUCCESS"
@@ -158,3 +158,4 @@ def test_ods_settlement_ticket_by_payment_relate_ids(tmp_path):
and call.get("params", {}).get("orderSettleId") == 9001 and call.get("params", {}).get("orderSettleId") == 9001
for call in api.calls for call in api.calls
) )

1361
tmp/20251121-task.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,321 @@
# -*- coding: utf-8 -*-
"""鎵嬪伐绀轰緥鏁版嵁鐏屽叆锛氭寜 schema_ODS_doc.sql 涓婚敭/鍞竴閿壒閲忓啓鍏?ODS銆?""
from __future__ import annotations
import json
import os
from datetime import datetime
from typing import Any, Iterable
from psycopg2.extras import Json
from .base_task import BaseTask
class ManualIngestTask(BaseTask):
"""湴绀轰緥 JSON 鐏屽叆 ODS锛岀淇濊鍚嶃佷富閿佹彃鍏ュ垪涓?schema_ODS_doc.sql 瀵归綈銆?""
def __init__(self, config, db_connection, api_client, logger):
"""鍒濆鍖栫紦瀛橈紝閬垮厤閲嶅鏌ヨ琛ㄧ粨鏋勩€?""
super().__init__(config, db_connection, api_client, logger)
self._table_columns_cache: dict[str, list[str]] = {}
# 鏂囦欢鍏抽敭璇?-> 鐩爣琛紙鍖归厤 C:\dev\LLTQ\export\temp\source-data-doc 涓嬬ず鑼?JSON 鍚嶇О锛? FILE_MAPPING: list[tuple[tuple[str, ...], str]] = [
(("浼氬憳妗f", "member_profiles"), "billiards_ods.member_profiles"),
(("浣欓鍙樻洿璁板綍", "member_balance_changes"), "billiards_ods.member_balance_changes"),
(("鍌ㄥ€煎崱鍒楄〃", "member_stored_value_cards"), "billiards_ods.member_stored_value_cards"),
(("鍏呭€艰褰?, "recharge_settlements"), "billiards_ods.recharge_settlements"),
(("缁撹处璁板綍", "settlement_records"), "billiards_ods.settlement_records"),
(("鍔╂暀搴熼櫎", "assistant_cancellation_records"), "billiards_ods.assistant_cancellation_records"),
(("鍔╂暀璐﹀彿", "assistant_accounts_master"), "billiards_ods.assistant_accounts_master"),
(("鍔╂暀娴佹按", "assistant_service_records"), "billiards_ods.assistant_service_records"),
(("鍙版鍒楄〃", "site_tables_master"), "billiards_ods.site_tables_master"),
(("鍙拌垂鎵撴姌", "table_fee_discount_records"), "billiards_ods.table_fee_discount_records"),
(("鍙拌垂娴佹按", "table_fee_transactions"), "billiards_ods.table_fee_transactions"),
(("搴撳瓨鍙樺寲璁板綍1", "goods_stock_movements"), "billiards_ods.goods_stock_movements"),
(("搴撳瓨鍙樺寲璁板綍2", "stock_goods_category_tree"), "billiards_ods.stock_goods_category_tree"),
(("搴撳瓨姹囨€?, "goods_stock_summary"), "billiards_ods.goods_stock_summary"),
(("鏀粯璁板綍", "payment_transactions"), "billiards_ods.payment_transactions"),
(("閫€娆捐褰?, "refund_transactions"), "billiards_ods.refund_transactions"),
(("骞冲彴楠屽埜璁板綍", "platform_coupon_redemption_records"), "billiards_ods.platform_coupon_redemption_records"),
(("鍥㈣喘濂楅娴佹按", "group_buy_redemption_records"), "billiards_ods.group_buy_packages_ledger"),
(("鍥㈣喘濂楅", "group_buy_packages"), "billiards_ods.group_buy_packages"),
(("灏忕エ璇︽儏", "settlement_ticket_details"), "billiards_ods.settlement_ticket_details"),
(("闂ㄥ簵鍟嗗搧妗f", "store_goods_master"), "billiards_ods.store_goods_master"),
(("鍟嗗搧妗f", "tenant_goods_master"), "billiards_ods.tenant_goods_master"),
(("闂ㄥ簵鍟嗗搧閿€鍞褰?, "store_goods_sales_records"), "billiards_ods.store_goods_sales_records"),
]
# 琛ㄧ粨鏋勮鏄庯細pk=涓婚敭鍒?None 琛ㄧず鏃犲啿绐佹洿鏂?锛宩son_cols=闇€瑕佸崟鍒楀瓨 JSONB 鐨勫瓧娈? TABLE_SPECS: dict[str, dict[str, Any]] = {
"billiards_ods.member_profiles": {"pk": "id"},
"billiards_ods.member_balance_changes": {"pk": "id"},
"billiards_ods.member_stored_value_cards": {"pk": "id"},
"billiards_ods.recharge_settlements": {"pk": None, "json_cols": ["settleList", "siteProfile"]},
"billiards_ods.settlement_records": {"pk": None, "json_cols": ["settleList", "siteProfile"]},
"billiards_ods.assistant_cancellation_records": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.assistant_accounts_master": {"pk": "id"},
"billiards_ods.assistant_service_records": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.site_tables_master": {"pk": "id"},
"billiards_ods.table_fee_discount_records": {"pk": "id", "json_cols": ["siteProfile", "tableProfile"]},
"billiards_ods.table_fee_transactions": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.goods_stock_movements": {"pk": "siteGoodsStockId"},
"billiards_ods.stock_goods_category_tree": {"pk": "id", "json_cols": ["categoryBoxes"]},
"billiards_ods.goods_stock_summary": {"pk": "siteGoodsId"},
"billiards_ods.payment_transactions": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.refund_transactions": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.platform_coupon_redemption_records": {"pk": "id"},
"billiards_ods.tenant_goods_master": {"pk": "id"},
"billiards_ods.group_buy_packages": {"pk": "id"},
"billiards_ods.group_buy_packages_ledger": {"pk": "id"},
"billiards_ods.settlement_ticket_details": {
"pk": "orderSettleId",
"json_cols": ["memberProfile", "orderItem", "tenantMemberCardLogs"],
},
"billiards_ods.store_goods_master": {"pk": "id"},
"billiards_ods.store_goods_sales_records": {"pk": "id"},
}
def get_task_code(self) -> str:
"""杩斿洖浠诲姟缂栫爜銆?""
return "MANUAL_INGEST"
def execute(self, cursor_data: dict | None = None) -> dict:
"""浠庣ず鑼冪洰褰曡鍙?JSON锛屾寜琛?涓婚敭鎵归噺鍏ュ簱銆?""
data_dir = (
self.config.get("manual.data_dir")
or self.config.get("pipeline.ingest_source_dir")
or r"c:\dev\LLTQ\ETL\feiqiu-ETL\etl_billiards\tests\testdata_json"
)
if not os.path.exists(data_dir):
self.logger.error("Data directory not found: %s", data_dir)
return {"status": "error", "message": "Directory not found"}
counts = {"fetched": 0, "inserted": 0, "updated": 0, "skipped": 0, "errors": 0}
for filename in sorted(os.listdir(data_dir)):
if not filename.endswith(".json"):
continue
filepath = os.path.join(data_dir, filename)
try:
with open(filepath, "r", encoding="utf-8") as fh:
raw_entries = json.load(fh)
except Exception:
counts["errors"] += 1
self.logger.exception("Failed to read %s", filename)
continue
if not isinstance(raw_entries, list):
raw_entries = [raw_entries]
records = self._extract_records(raw_entries)
if not records:
counts["skipped"] += 1
continue
target_table = self._match_by_filename(filename)
if not target_table:
self.logger.warning("No mapping found for file: %s", filename)
counts["skipped"] += 1
continue
self.logger.info("Ingesting %s into %s", filename, target_table)
try:
inserted, updated = self._ingest_table(target_table, records, filename)
counts["inserted"] += inserted
counts["updated"] += updated
counts["fetched"] += len(records)
except Exception:
counts["errors"] += 1
self.logger.exception("Error processing %s", filename)
self.db.rollback()
continue
try:
self.db.commit()
except Exception:
self.db.rollback()
raise
return {"status": "SUCCESS", "counts": counts}
# ------------------------------------------------------------------ helpers
def _match_by_filename(self, filename: str) -> str | None:
"""鏍规嵁鏂囦欢鍚嶅叧閿瘝鎵惧埌鐩爣琛ㄣ?""
for keywords, table in self.FILE_MAPPING:
if any(keyword and keyword in filename for keyword in keywords):
return table
return None
def _extract_records(self, raw_entries: Iterable[Any]) -> list[dict]:
"""鍏煎澶氱 JSON 缁撴瀯锛屾彁鍙栨垚璁板綍鍒楄〃銆?""
records: list[dict] = []
for entry in raw_entries:
if isinstance(entry, dict):
# 濡傛灉鍚?data 涓旇繕鍖呭惈鍏朵粬閿紙濡?orderSettleId锛夛紝浼樺厛淇濈暀澶栧眰浠ュ厤涓㈠け涓婚敭
preferred = entry
if "data" in entry and not any(k not in {"data", "code"} for k in entry.keys()):
preferred = entry["data"]
data = preferred
if isinstance(data, dict):
list_used = False
for v in data.values():
if isinstance(v, list) and v and isinstance(v[0], dict):
records.extend(v)
list_used = True
break
if list_used:
continue
if isinstance(data, list) and data and isinstance(data[0], dict):
records.extend(data)
elif isinstance(data, dict):
records.append(data)
elif isinstance(entry, list):
records.extend([item for item in entry if isinstance(item, dict)])
return records
def _get_table_columns(self, table: str) -> list[str]:
"""鏌ヨ伅_schema锛岃幏鍙栫洰鏍囪鐨勫叏閮ㄥ垪鍚嶏紙鎸夐搴忥級銆?""
if table in self._table_columns_cache:
return self._table_columns_cache[table]
if "." in table:
schema, name = table.split(".", 1)
else:
schema, name = "public", table
sql = """
SELECT column_name, data_type, udt_name
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
ORDER BY ordinal_position
"""
with self.db.conn.cursor() as cur:
cur.execute(sql, (schema, name))
cols = [(r[0], (r[1] or "").lower(), (r[2] or "").lower()) for r in cur.fetchall()]
self._table_columns_cache[table] = cols
return cols
def _ingest_table(self, table: str, records: list[dict], source_file: str) -> tuple[int, int]:
"""鏋勯€?INSERT/ON CONFLICT 璇彞骞舵壒閲忔墽琛屻€?""
spec = self.TABLE_SPECS.get(table)
if not spec:
raise ValueError(f"No table spec for {table}")
pk_col = spec.get("pk")
json_cols = set(spec.get("json_cols", []))
json_cols_lower = {c.lower() for c in json_cols}
columns_info = self._get_table_columns(table)
columns = [c[0] for c in columns_info]
db_json_cols_lower = {
c[0].lower() for c in columns_info if c[1] in ("json", "jsonb") or c[2] in ("json", "jsonb")
}
pk_col_db = None
if pk_col:
pk_col_db = next((c for c in columns if c.lower() == pk_col.lower()), pk_col)
placeholders = ", ".join(["%s"] * len(columns))
col_list = ", ".join(f'"{c}"' for c in columns)
sql = f'INSERT INTO {table} ({col_list}) VALUES ({placeholders})'
if pk_col_db:
update_cols = [c for c in columns if c != pk_col_db]
set_clause = ", ".join(f'"{c}"=EXCLUDED."{c}"' for c in update_cols)
sql += f' ON CONFLICT ("{pk_col_db}") DO UPDATE SET {set_clause}'
sql += " RETURNING (xmax = 0) AS inserted"
params = []
now = datetime.now()
json_dump = lambda v: json.dumps(v, ensure_ascii=False) # noqa: E731
for rec in records:
merged_rec = rec if isinstance(rec, dict) else {}
# 閫愬眰灞曞紑 data -> data.data 缁撴瀯锛屽~鍏呯己澶卞瓧娈? data_part = merged_rec.get("data")
while isinstance(data_part, dict):
merged_rec = {**data_part, **merged_rec}
data_part = data_part.get("data")
pk_val = self._get_value_case_insensitive(merged_rec, pk_col) if pk_col else None
if pk_col and (pk_val is None or pk_val == ""):
continue
row_vals = []
for col_name, data_type, udt in columns_info:
col_lower = col_name.lower()
if col_lower == "payload":
row_vals.append(Json(rec, dumps=json_dump))
continue
if col_lower == "source_file":
row_vals.append(source_file)
continue
if col_lower == "fetched_at":
row_vals.append(merged_rec.get(col_name, now))
continue
value = self._normalize_scalar(self._get_value_case_insensitive(merged_rec, col_name))
if col_lower in json_cols_lower or col_lower in db_json_cols_lower:
row_vals.append(Json(value, dumps=json_dump) if value is not None else None)
continue
casted = self._cast_value(value, data_type)
row_vals.append(casted)
params.append(tuple(row_vals))
if not params:
return 0, 0
inserted = 0
updated = 0
with self.db.conn.cursor() as cur:
for row in params:
cur.execute(sql, row)
try:
flag = cur.fetchone()[0]
except Exception:
flag = None
if flag:
inserted += 1
else:
updated += 1
return inserted, updated
def _get_value_case_insensitive(self, record: dict, col: str):
"""蹇界暐澶у皬鍐欒幏鍙栧硷紝鍏煎 information_schema 灏忓啓鍒楀悕涓?JSON 鍘熷澶у皬鍐欍?""
if record is None:
return None
if col is None:
return None
if col in record:
return record.get(col)
col_lower = col.lower()
for k, v in record.items():
if isinstance(k, str) and k.lower() == col_lower:
return v
return None
def _normalize_scalar(self, value):
"""灏嗙┖瀛楃涓叉爣鍑嗗寲涓?None锛岄伩鍏嶆暟鍊?鏃堕棿瀛楁绫诲瀷閿欒銆?""
if value == "" or value == "{}" or value == "[]":
return None
return value
def _cast_value(self, value, data_type: str):
"""鏍规嵁鍒楃被鍨嬪仛杞婚噺杞崲锛岄伩鍏嶇被鍨嬩笉鍖归厤銆?""
if value is None:
return None
dt = (data_type or "").lower()
if dt in ("integer", "bigint", "smallint"):
if isinstance(value, bool):
return int(value)
try:
return int(value)
except Exception:
return None
if dt in ("numeric", "double precision", "real", "decimal"):
if isinstance(value, bool):
return int(value)
try:
return float(value)
except Exception:
return None
if dt.startswith("timestamp") or dt in ("date", "time", "interval"):
# 浠呮帴鍙楀瓧绗︿覆/鏃ユ湡锛屾暟鍊肩瓑涓€寰嬬疆绌? return value if isinstance(value, str) else None
return value

View File

@@ -0,0 +1,347 @@
# -*- coding: utf-8 -*-
"""手工示例数据灌入:按 schema_ODS_doc.sql 的表结构写入 ODS。"""
from __future__ import annotations
import json
import os
from datetime import datetime
from typing import Any, Iterable
from psycopg2.extras import Json
from .base_task import BaseTask
class ManualIngestTask(BaseTask):
"""本地示例 JSON 灌入 ODS确保表名/主键/插入列与 schema_ODS_doc.sql 对齐。"""
FILE_MAPPING: list[tuple[tuple[str, ...], str]] = [
(("member_profiles",), "billiards_ods.member_profiles"),
(("member_balance_changes",), "billiards_ods.member_balance_changes"),
(("member_stored_value_cards",), "billiards_ods.member_stored_value_cards"),
(("recharge_settlements",), "billiards_ods.recharge_settlements"),
(("settlement_records",), "billiards_ods.settlement_records"),
(("assistant_cancellation_records",), "billiards_ods.assistant_cancellation_records"),
(("assistant_accounts_master",), "billiards_ods.assistant_accounts_master"),
(("assistant_service_records",), "billiards_ods.assistant_service_records"),
(("site_tables_master",), "billiards_ods.site_tables_master"),
(("table_fee_discount_records",), "billiards_ods.table_fee_discount_records"),
(("table_fee_transactions",), "billiards_ods.table_fee_transactions"),
(("goods_stock_movements",), "billiards_ods.goods_stock_movements"),
(("stock_goods_category_tree",), "billiards_ods.stock_goods_category_tree"),
(("goods_stock_summary",), "billiards_ods.goods_stock_summary"),
(("payment_transactions",), "billiards_ods.payment_transactions"),
(("refund_transactions",), "billiards_ods.refund_transactions"),
(("platform_coupon_redemption_records",), "billiards_ods.platform_coupon_redemption_records"),
(("group_buy_redemption_records",), "billiards_ods.group_buy_redemption_records"),
(("group_buy_packages",), "billiards_ods.group_buy_packages"),
(("settlement_ticket_details",), "billiards_ods.settlement_ticket_details"),
(("store_goods_master",), "billiards_ods.store_goods_master"),
(("tenant_goods_master",), "billiards_ods.tenant_goods_master"),
(("store_goods_sales_records",), "billiards_ods.store_goods_sales_records"),
]
TABLE_SPECS: dict[str, dict[str, Any]] = {
"billiards_ods.member_profiles": {"pk": "id"},
"billiards_ods.member_balance_changes": {"pk": "id"},
"billiards_ods.member_stored_value_cards": {"pk": "id"},
"billiards_ods.recharge_settlements": {"pk": "id"},
"billiards_ods.settlement_records": {"pk": "id"},
"billiards_ods.assistant_cancellation_records": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.assistant_accounts_master": {"pk": "id"},
"billiards_ods.assistant_service_records": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.site_tables_master": {"pk": "id"},
"billiards_ods.table_fee_discount_records": {"pk": "id", "json_cols": ["siteProfile", "tableProfile"]},
"billiards_ods.table_fee_transactions": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.goods_stock_movements": {"pk": "siteGoodsStockId"},
"billiards_ods.stock_goods_category_tree": {"pk": "id", "json_cols": ["categoryBoxes"]},
"billiards_ods.goods_stock_summary": {"pk": "siteGoodsId"},
"billiards_ods.payment_transactions": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.refund_transactions": {"pk": "id", "json_cols": ["siteProfile"]},
"billiards_ods.platform_coupon_redemption_records": {"pk": "id"},
"billiards_ods.tenant_goods_master": {"pk": "id"},
"billiards_ods.group_buy_packages": {"pk": "id"},
"billiards_ods.group_buy_redemption_records": {"pk": "id"},
"billiards_ods.settlement_ticket_details": {
"pk": "orderSettleId",
"json_cols": ["memberProfile", "orderItem", "tenantMemberCardLogs"],
},
"billiards_ods.store_goods_master": {"pk": "id"},
"billiards_ods.store_goods_sales_records": {"pk": "id"},
}
def get_task_code(self) -> str:
"""返回任务编码。"""
return "MANUAL_INGEST"
def execute(self, cursor_data: dict | None = None) -> dict:
"""从目录读取 JSON按表定义批量入库。"""
data_dir = (
self.config.get("manual.data_dir")
or self.config.get("pipeline.ingest_source_dir")
or r"c:\dev\LLTQ\ETL\feiqiu-ETL\etl_billiards\tests\testdata_json"
)
if not os.path.exists(data_dir):
self.logger.error("Data directory not found: %s", data_dir)
return {"status": "error", "message": "Directory not found"}
counts = {"fetched": 0, "inserted": 0, "updated": 0, "skipped": 0, "errors": 0}
for filename in sorted(os.listdir(data_dir)):
if not filename.endswith(".json"):
continue
filepath = os.path.join(data_dir, filename)
try:
with open(filepath, "r", encoding="utf-8") as fh:
raw_entries = json.load(fh)
except Exception:
counts["errors"] += 1
self.logger.exception("Failed to read %s", filename)
continue
entries = raw_entries if isinstance(raw_entries, list) else [raw_entries]
records = self._extract_records(entries)
if not records:
counts["skipped"] += 1
continue
target_table = self._match_by_filename(filename)
if not target_table:
self.logger.warning("No mapping found for file: %s", filename)
counts["skipped"] += 1
continue
self.logger.info("Ingesting %s into %s", filename, target_table)
try:
inserted, updated = self._ingest_table(target_table, records, filename)
counts["inserted"] += inserted
counts["updated"] += updated
counts["fetched"] += len(records)
except Exception:
counts["errors"] += 1
self.logger.exception("Error processing %s", filename)
self.db.rollback()
continue
try:
self.db.commit()
except Exception:
self.db.rollback()
raise
return {"status": "SUCCESS", "counts": counts}
def _match_by_filename(self, filename: str) -> str | None:
"""根据文件名关键字匹配目标表。"""
for keywords, table in self.FILE_MAPPING:
if any(keyword and keyword in filename for keyword in keywords):
return table
return None
def _extract_records(self, raw_entries: Iterable[Any]) -> list[dict]:
"""兼容多层 data/list 包装,抽取记录列表。"""
records: list[dict] = []
for entry in raw_entries:
if isinstance(entry, dict):
preferred = entry
if "data" in entry and not any(k not in {"data", "code"} for k in entry.keys()):
preferred = entry["data"]
data = preferred
if isinstance(data, dict):
# 特殊处理 settleList充值、结算记录展开 data.settleList 下的 settleList抛弃上层 siteProfile
if "settleList" in data:
settle_list_val = data.get("settleList")
if isinstance(settle_list_val, dict):
settle_list_iter = [settle_list_val]
elif isinstance(settle_list_val, list):
settle_list_iter = settle_list_val
else:
settle_list_iter = []
handled = False
for item in settle_list_iter or []:
if not isinstance(item, dict):
continue
inner = item.get("settleList")
merged = dict(inner) if isinstance(inner, dict) else dict(item)
# 保留 siteProfile 供后续字段补充,但不落库
site_profile = data.get("siteProfile")
if isinstance(site_profile, dict):
merged.setdefault("siteProfile", site_profile)
records.append(merged)
handled = True
if handled:
continue
list_used = False
for v in data.values():
if isinstance(v, list) and v and isinstance(v[0], dict):
records.extend(v)
list_used = True
break
if list_used:
continue
if isinstance(data, list) and data and isinstance(data[0], dict):
records.extend(data)
elif isinstance(data, dict):
records.append(data)
elif isinstance(entry, list):
records.extend([item for item in entry if isinstance(item, dict)])
return records
def _get_table_columns(self, table: str) -> list[tuple[str, str, str]]:
"""查询 information_schema获取目标表列信息。"""
cache = getattr(self, "_table_columns_cache", {})
if table in cache:
return cache[table]
if "." in table:
schema, name = table.split(".", 1)
else:
schema, name = "public", table
sql = """
SELECT column_name, data_type, udt_name
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
ORDER BY ordinal_position
"""
with self.db.conn.cursor() as cur:
cur.execute(sql, (schema, name))
cols = [(r[0], (r[1] or "").lower(), (r[2] or "").lower()) for r in cur.fetchall()]
cache[table] = cols
self._table_columns_cache = cache
return cols
def _ingest_table(self, table: str, records: list[dict], source_file: str) -> tuple[int, int]:
"""构建 INSERT/ON CONFLICT 语句并批量执行。"""
spec = self.TABLE_SPECS.get(table)
if not spec:
raise ValueError(f"No table spec for {table}")
pk_col = spec.get("pk")
json_cols = set(spec.get("json_cols", []))
json_cols_lower = {c.lower() for c in json_cols}
columns_info = self._get_table_columns(table)
columns = [c[0] for c in columns_info]
db_json_cols_lower = {
c[0].lower() for c in columns_info if c[1] in ("json", "jsonb") or c[2] in ("json", "jsonb")
}
pk_col_db = None
if pk_col:
pk_col_db = next((c for c in columns if c.lower() == pk_col.lower()), pk_col)
placeholders = ", ".join(["%s"] * len(columns))
col_list = ", ".join(f'"{c}"' for c in columns)
sql = f'INSERT INTO {table} ({col_list}) VALUES ({placeholders})'
if pk_col_db:
update_cols = [c for c in columns if c != pk_col_db]
set_clause = ", ".join(f'"{c}"=EXCLUDED."{c}"' for c in update_cols)
sql += f' ON CONFLICT ("{pk_col_db}") DO UPDATE SET {set_clause}'
sql += " RETURNING (xmax = 0) AS inserted"
params = []
now = datetime.now()
json_dump = lambda v: json.dumps(v, ensure_ascii=False) # noqa: E731
for rec in records:
merged_rec = rec if isinstance(rec, dict) else {}
data_part = merged_rec.get("data")
while isinstance(data_part, dict):
merged_rec = {**data_part, **merged_rec}
data_part = data_part.get("data")
# 针对充值/结算,补齐 siteProfile 中的店铺信息
if table in {
"billiards_ods.recharge_settlements",
"billiards_ods.settlement_records",
}:
site_profile = merged_rec.get("siteProfile") or merged_rec.get("site_profile")
if isinstance(site_profile, dict):
merged_rec.setdefault("tenantid", site_profile.get("tenant_id") or site_profile.get("tenantId"))
merged_rec.setdefault("siteid", site_profile.get("id") or site_profile.get("siteId"))
merged_rec.setdefault("sitename", site_profile.get("shop_name") or site_profile.get("siteName"))
pk_val = self._get_value_case_insensitive(merged_rec, pk_col) if pk_col else None
if pk_col and (pk_val is None or pk_val == ""):
continue
row_vals = []
for col_name, data_type, udt in columns_info:
col_lower = col_name.lower()
if col_lower == "payload":
row_vals.append(Json(rec, dumps=json_dump))
continue
if col_lower == "source_file":
row_vals.append(source_file)
continue
if col_lower == "fetched_at":
row_vals.append(merged_rec.get(col_name, now))
continue
value = self._normalize_scalar(self._get_value_case_insensitive(merged_rec, col_name))
if col_lower in json_cols_lower or col_lower in db_json_cols_lower:
row_vals.append(Json(value, dumps=json_dump) if value is not None else None)
continue
casted = self._cast_value(value, data_type)
row_vals.append(casted)
params.append(tuple(row_vals))
if not params:
return 0, 0
inserted = 0
updated = 0
with self.db.conn.cursor() as cur:
for row in params:
cur.execute(sql, row)
flag = cur.fetchone()[0]
if flag:
inserted += 1
else:
updated += 1
return inserted, updated
@staticmethod
def _get_value_case_insensitive(record: dict, col: str | None):
"""忽略大小写获取值,兼容 information_schema 与 JSON 原始字段。"""
if record is None or col is None:
return None
if col in record:
return record.get(col)
col_lower = col.lower()
for k, v in record.items():
if isinstance(k, str) and k.lower() == col_lower:
return v
return None
@staticmethod
def _normalize_scalar(value):
"""将空字符串/空 JSON 规范为 None避免类型转换错误。"""
if value == "" or value == "{}" or value == "[]":
return None
return value
@staticmethod
def _cast_value(value, data_type: str):
"""根据列类型做简单转换,保证批量插入兼容。"""
if value is None:
return None
dt = (data_type or "").lower()
if dt in ("integer", "bigint", "smallint"):
if isinstance(value, bool):
return int(value)
try:
return int(value)
except Exception:
return None
if dt in ("numeric", "double precision", "real", "decimal"):
if isinstance(value, bool):
return int(value)
try:
return float(value)
except Exception:
return None
if dt.startswith("timestamp") or dt in ("date", "time", "interval"):
return value if isinstance(value, str) else None
return value

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,634 @@
# -*- coding: utf-8 -*-
import ast
import json
import re
from collections import deque
from pathlib import Path
ROOT = Path(r"C:\dev\LLTQ\ETL\feiqiu-ETL")
SQL_PATH = ROOT / "etl_billiards" / "database" / "schema_dwd_doc.sql"
DOC_DIR = Path(r"C:\dev\LLTQ\export\test-json-doc")
DWD_TASK_PATH = ROOT / "etl_billiards" / "tasks" / "dwd_load_task.py"
SCD_COLS = {"scd2_start_time", "scd2_end_time", "scd2_is_current", "scd2_version"}
SITEPROFILE_FIELD_PURPOSE = {
"id": "门店 ID用于门店维度关联。",
"org_id": "组织/机构 ID用于组织维度归属。",
"shop_name": "门店名称,用于展示与查询。",
"site_label": "门店标签(如 A/B 店),用于展示与分组。",
"full_address": "门店详细地址,用于展示与地理信息。",
"address": "门店地址简称/快照,用于展示。",
"longitude": "经度,用于定位与地图展示。",
"latitude": "纬度,用于定位与地图展示。",
"tenant_site_region_id": "租户下门店区域 ID用于区域维度分析。",
"business_tel": "门店电话,用于联系信息展示。",
"site_type": "门店类型枚举,用于门店分类。",
"shop_status": "门店状态枚举,用于营业状态标识。",
"tenant_id": "租户/品牌 ID用于商户维度过滤与关联。",
"auto_light": "是否启用自动灯控配置,用于门店设备策略。",
"attendance_enabled": "是否启用考勤功能,用于门店考勤配置。",
"attendance_distance": "考勤允许距离(米),用于考勤打卡限制。",
"prod_env": "环境标识(生产/测试),用于区分配置环境。",
"light_status": "灯控状态/开关,用于灯控设备管理。",
"light_type": "灯控类型,用于设备类型区分。",
"light_token": "灯控控制令牌,用于对接灯控服务。",
"avatar": "门店头像/图片 URL用于展示。",
"wifi_name": "门店 WiFi 名称,用于展示与引导。",
"wifi_password": "门店 WiFi 密码,用于展示与引导。",
"customer_service_qrcode": "客服二维码 URL用于引导联系。",
"customer_service_wechat": "客服微信号,用于引导联系。",
"fixed_pay_qrCode": "固定收款码二维码URL用于收款引导。",
"create_time": "门店创建时间(快照字段)。",
"update_time": "门店更新时间(快照字段)。",
}
def _escape_sql(s: str) -> str:
return (s or "").replace("'", "''")
def _first_sentence(text: str, max_len: int = 140) -> str:
s = re.sub(r"\s+", " ", (text or "").strip())
if not s:
return ""
parts = re.split(r"[。;;]\s*", s)
s = parts[0].strip() if parts else s
if len(s) > max_len:
s = s[: max_len - 1] + ""
return s
def normalize_key(s: str) -> str:
return re.sub(r"[_\-\s]", "", (s or "").lower())
def snake_to_lower_camel(s: str) -> str:
parts = re.split(r"[_\-\s]+", s)
if not parts:
return s
first = parts[0].lower()
rest = "".join(p[:1].upper() + p[1:] for p in parts[1:] if p)
return first + rest
def snake_to_upper_camel(s: str) -> str:
parts = re.split(r"[_\-\s]+", s)
return "".join(p[:1].upper() + p[1:] for p in parts if p)
def find_key_in_record(record: dict, token: str) -> str | None:
if not isinstance(record, dict):
return None
if token in record:
return token
norm_to_key = {normalize_key(k): k for k in record.keys()}
candidates = [
token,
token.lower(),
token.upper(),
snake_to_lower_camel(token),
snake_to_upper_camel(token),
]
# 常见变体siteProfile/siteprofile
if normalize_key(token) == "siteprofile":
candidates.extend(["siteProfile", "siteprofile"])
for c in candidates:
nk = normalize_key(c)
if nk in norm_to_key:
return norm_to_key[nk]
return None
def parse_dwd_task_mappings(path: Path):
mod = ast.parse(path.read_text(encoding="utf-8"))
table_map = None
fact_mappings = None
for node in mod.body:
if isinstance(node, ast.ClassDef) and node.name == "DwdLoadTask":
for stmt in node.body:
if isinstance(stmt, ast.Assign) and len(stmt.targets) == 1 and isinstance(stmt.targets[0], ast.Name):
name = stmt.targets[0].id
if name == "TABLE_MAP":
table_map = ast.literal_eval(stmt.value)
elif name == "FACT_MAPPINGS":
fact_mappings = ast.literal_eval(stmt.value)
if isinstance(stmt, ast.AnnAssign) and isinstance(stmt.target, ast.Name):
name = stmt.target.id
if name == "TABLE_MAP":
table_map = ast.literal_eval(stmt.value)
elif name == "FACT_MAPPINGS":
fact_mappings = ast.literal_eval(stmt.value)
if not isinstance(table_map, dict) or not isinstance(fact_mappings, dict):
raise RuntimeError("Failed to parse TABLE_MAP/FACT_MAPPINGS from dwd_load_task.py")
return table_map, fact_mappings
def parse_columns_from_ddl(create_sql: str):
start = create_sql.find("(")
end = create_sql.rfind(")")
body = create_sql[start + 1 : end]
cols = []
for line in body.splitlines():
s = line.strip().rstrip(",")
if not s:
continue
if s.upper().startswith("PRIMARY KEY"):
continue
if s.upper().startswith("CONSTRAINT "):
continue
m = re.match(r"^([A-Za-z_][A-Za-z0-9_]*)\s+", s)
if not m:
continue
name = m.group(1)
if name.upper() in {"PRIMARY", "UNIQUE", "FOREIGN", "CHECK"}:
continue
cols.append(name.lower())
return cols
def _find_best_record_list(data, required_norm_keys: set[str]):
best = None
best_score = -1.0
best_path: list[str] = []
q = deque([(data, 0, [])])
visited = 0
while q and visited < 25000:
node, depth, path = q.popleft()
visited += 1
if depth > 10:
continue
if isinstance(node, list):
if node and all(isinstance(x, dict) for x in node[:3]):
scores = []
for x in node[:5]:
keys_norm = {normalize_key(k) for k in x.keys()}
scores.append(len(keys_norm & required_norm_keys))
score = sum(scores) / max(1, len(scores))
if score > best_score:
best_score = score
best = node
best_path = path
for x in node[:10]:
q.append((x, depth + 1, path))
else:
for x in node[:120]:
q.append((x, depth + 1, path))
elif isinstance(node, dict):
for k, v in list(node.items())[:160]:
q.append((v, depth + 1, path + [str(k)]))
node_str = ".".join(best_path) if best_path else "$"
return best or [], node_str
def _format_example(value, max_len: int = 120) -> str:
if value is None:
return "NULL"
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
s = value.strip()
if len(s) > max_len:
s = s[: max_len - 1] + ""
return s
if isinstance(value, dict):
keys = list(value)[:6]
mini = {k: value.get(k) for k in keys}
rendered = json.dumps(mini, ensure_ascii=False)
if len(value) > len(keys):
rendered = rendered[:-1] + ", …}"
if len(rendered) > max_len:
rendered = rendered[: max_len - 1] + ""
return rendered
if isinstance(value, list):
if not value:
return "[]"
rendered = json.dumps(value[0], ensure_ascii=False)
if len(value) > 1:
rendered = f"[{rendered}, …] (len={len(value)})"
else:
rendered = f"[{rendered}]"
if len(rendered) > max_len:
rendered = rendered[: max_len - 1] + ""
return rendered
s = str(value)
if len(s) > max_len:
s = s[: max_len - 1] + ""
return s
def _infer_purpose(table: str, col: str, json_path: str | None) -> str:
lcol = col.lower()
if lcol in SCD_COLS:
if lcol == "scd2_start_time":
return "SCD2 开始时间(版本生效起点),用于维度慢变追踪。"
if lcol == "scd2_end_time":
return "SCD2 结束时间(默认 9999-12-31 表示当前版本),用于维度慢变追踪。"
if lcol == "scd2_is_current":
return "SCD2 当前版本标记1=当前0=历史),用于筛选最新维度记录。"
if lcol == "scd2_version":
return "SCD2 版本号(自增),用于与时间段一起避免版本重叠。"
if json_path and json_path.startswith("siteProfile."):
sf = json_path.split(".", 1)[1]
return SITEPROFILE_FIELD_PURPOSE.get(sf, "门店快照字段,用于门店维度补充信息。")
if lcol.endswith("_id"):
return "标识类 ID 字段,用于关联/定位相关实体。"
if lcol.endswith("_time") or lcol.endswith("time") or lcol.endswith("_date"):
return "时间/日期字段,用于记录业务时间与统计口径对齐。"
if any(k in lcol for k in ["amount", "money", "fee", "price", "deduct", "cost", "balance"]):
return "金额字段,用于计费/结算/核算等金额计算。"
if any(k in lcol for k in ["count", "num", "number", "seconds", "qty", "quantity"]):
return "数量/时长字段,用于统计与计量。"
if lcol.endswith("_name") or lcol.endswith("name"):
return "名称字段,用于展示与辅助识别。"
if lcol.endswith("_status") or lcol == "status":
return "状态枚举字段,用于标识业务状态。"
if lcol.startswith("is_") or lcol.startswith("can_"):
return "布尔/开关字段,用于表示是否/可用性等业务开关。"
# 表级兜底
if table.startswith("dim_"):
return "维度字段,用于补充维度属性。"
return "明细字段,用于记录事实取值。"
def _parse_json_extract(expr: str):
# e.g. siteprofile->>'org_id'
m = re.match(r"^([A-Za-z_][A-Za-z0-9_]*)\s*->>\s*'([^']+)'\s*$", expr)
if not m:
return None
base = m.group(1)
field = m.group(2)
if normalize_key(base) == "siteprofile":
base = "siteProfile"
return base, field
def build_table_comment(table: str, source_ods: str | None, source_json_base: str | None) -> str:
table_l = table.lower()
if table_l.startswith("dim_"):
kind = "DWD 维度表"
else:
kind = "DWD 明细事实表"
extra = "扩展字段表" if table_l.endswith("_ex") else ""
if source_ods and source_json_base:
src = (
f"ODS 来源表:{source_ods}(对应 JSON{source_json_base}.json分析{source_json_base}-Analysis.md"
f"装载/清洗逻辑参考etl_billiards/tasks/dwd_load_task.pyDwdLoadTask"
)
else:
src = "来源:由 ODS 清洗装载生成(详见 DWD 装载任务)。"
return f"{kind}{('' + extra + '') if extra else ''}{table_l}{src}"
def get_source_info(table_l: str, table_map: dict) -> tuple[str | None, str | None]:
key = f"billiards_dwd.{table_l}"
source_ods = table_map.get(key)
if not source_ods:
return None, None
json_base = source_ods.split(".")[-1]
return source_ods, json_base
def build_column_mappings(table_l: str, cols: list[str], fact_mappings: dict) -> dict[str, tuple[str | None, str | None]]:
# return col -> (json_path, src_expr)
mapping_list = fact_mappings.get(f"billiards_dwd.{table_l}") or []
explicit = {dwd_col.lower(): src_expr for dwd_col, src_expr, _cast in mapping_list}
casts = {dwd_col.lower(): cast for dwd_col, _src_expr, cast in mapping_list}
out: dict[str, tuple[str | None, str | None]] = {}
for c in cols:
if c in SCD_COLS:
out[c] = (None, None)
continue
src_expr = explicit.get(c, c)
cast = casts.get(c)
json_path = None
parsed = _parse_json_extract(src_expr)
if parsed:
base, field = parsed
json_path = f"{base}.{field}"
else:
# derived: pay_date uses pay_time + cast date
if cast == "date":
json_path = src_expr
else:
json_path = src_expr
out[c] = (json_path, src_expr)
return out
def load_json_records(json_base: str, required_norm_keys: set[str]):
json_path = DOC_DIR / f"{json_base}.json"
data = json.loads(json_path.read_text(encoding="utf-8"))
return _find_best_record_list(data, required_norm_keys)
def pick_example_from_record(record: dict, json_path: str | None):
if not json_path:
return None
if json_path.startswith("siteProfile."):
base_key = find_key_in_record(record, "siteProfile")
base = record.get(base_key) if base_key else None
if isinstance(base, dict):
field = json_path.split(".", 1)[1]
return base.get(field)
return None
# plain key
key = find_key_in_record(record, json_path)
if key:
return record.get(key)
# fallback: try match by normalized name
nk = normalize_key(json_path)
for k in record.keys():
if normalize_key(k) == nk:
return record.get(k)
return None
def resolve_json_field_display(records: list, json_path: str | None, cast: str | None = None) -> str:
if not json_path:
return ""
if json_path.startswith("siteProfile."):
return json_path
actual_key = None
for r in records[:80]:
if not isinstance(r, dict):
continue
k = find_key_in_record(r, json_path)
if k:
actual_key = k
break
base = actual_key or json_path
if cast == "date":
return f"{base}派生DATE({base})"
if cast == "boolean":
return f"{base}派生BOOLEAN({base})"
if cast in {"numeric", "timestamptz"}:
return f"{base}派生CAST({base} AS {cast})"
return base
def resolve_ods_source_field(records: list, src_expr: str | None, cast: str | None = None) -> str:
if not src_expr:
return ""
parsed = _parse_json_extract(src_expr)
if parsed:
base, field = parsed
# 统一大小写展示
if normalize_key(base) == "siteprofile":
base = "siteProfile"
return f"{base}.{field}"
# 直接字段:尽量输出 JSON 实际键名(大小写/驼峰)
actual = None
for r in records[:80]:
if not isinstance(r, dict):
continue
k = find_key_in_record(r, src_expr)
if k:
actual = k
break
base = actual or src_expr
if cast == "date":
return f"{base}派生DATE({base})"
if cast == "boolean":
return f"{base}派生BOOLEAN({base})"
if cast in {"numeric", "timestamptz"}:
return f"{base}派生CAST({base} AS {cast})"
return base
def resolve_json_field_triplet(
json_file: str | None,
record_node: str | None,
records: list,
json_path: str | None,
cast: str | None = None,
) -> str:
if not json_file:
json_file = ""
node = record_node or "$"
if not json_path:
return f"{json_file} - 无 - 无"
if json_path.startswith("siteProfile."):
base_key = None
field_key = None
for r in records[:80]:
if not isinstance(r, dict):
continue
base_key = find_key_in_record(r, "siteProfile")
if base_key:
base = r.get(base_key)
if isinstance(base, dict):
raw_field = json_path.split(".", 1)[1]
# 尽量匹配子字段大小写
if raw_field in base:
field_key = raw_field
else:
nk = normalize_key(raw_field)
for k in base.keys():
if normalize_key(k) == nk:
field_key = k
break
break
base_key = base_key or "siteProfile"
field_key = field_key or json_path.split(".", 1)[1]
node = f"{node}.{base_key}" if node else base_key
field = field_key
else:
actual = None
for r in records[:80]:
if isinstance(r, dict):
actual = find_key_in_record(r, json_path)
if actual:
break
field = actual or json_path
if cast == "date":
field = f"{field}派生DATE({field})"
elif cast == "boolean":
field = f"{field}派生BOOLEAN({field})"
elif cast in {"numeric", "timestamptz"}:
field = f"{field}派生CAST({field} AS {cast})"
return f"{json_file} - {node} - {field}"
def main():
table_map, fact_mappings = parse_dwd_task_mappings(DWD_TASK_PATH)
raw = SQL_PATH.read_text(encoding="utf-8", errors="replace")
newline = "\r\n" if "\r\n" in raw else "\n"
# strip all sql comments and existing COMMENT ON statements, incl. DO-block comment exec lines
kept_lines = []
for line in raw.splitlines(True):
if line.lstrip().startswith("--"):
continue
if re.match(r"^\s*COMMENT ON\s+(TABLE|COLUMN)\s+", line, re.I):
continue
if "COMMENT ON COLUMN" in line or "COMMENT ON TABLE" in line:
# remove legacy execute format lines too
continue
kept_lines.append(line)
clean = "".join(kept_lines)
create_re = re.compile(
r"(^\s*CREATE TABLE IF NOT EXISTS\s+(?P<table>[A-Za-z0-9_]+)\s*\([\s\S]*?\)\s*;)",
re.M,
)
out_parts = []
last = 0
count_tables = 0
for m in create_re.finditer(clean):
stmt = m.group(1)
table = m.group("table").lower()
out_parts.append(clean[last : m.end()])
cols = parse_columns_from_ddl(stmt)
source_ods, json_base = get_source_info(table, table_map)
# derive required keys
required_norm = set()
col_map = build_column_mappings(table, cols, fact_mappings)
# cast map for json field display
cast_map = {
dwd_col.lower(): cast
for dwd_col, _src_expr, cast in (fact_mappings.get(f"billiards_dwd.{table}") or [])
}
src_expr_map = {
dwd_col.lower(): src_expr
for dwd_col, src_expr, _cast in (fact_mappings.get(f"billiards_dwd.{table}") or [])
}
for c, (jp, _src) in col_map.items():
if not jp:
continue
if jp.startswith("siteProfile."):
required_norm.add(normalize_key("siteProfile"))
else:
required_norm.add(normalize_key(jp))
records = []
record_node = "$"
if json_base and (DOC_DIR / f"{json_base}.json").exists():
try:
records, record_node = load_json_records(json_base, required_norm)
except Exception:
records = []
record_node = "$"
table_comment = build_table_comment(table, source_ods, json_base)
comment_lines = [f"COMMENT ON TABLE billiards_dwd.{table} IS '{_escape_sql(table_comment)}';"]
for c in cols:
jp, _src = col_map.get(c, (None, None))
if c in SCD_COLS:
if c == "scd2_start_time":
ex = "2025-11-10T00:00:00+08:00"
elif c == "scd2_end_time":
ex = "9999-12-31T00:00:00+00:00"
elif c == "scd2_is_current":
ex = "1"
else:
ex = "1"
json_field = "无 - DWD慢变元数据 - 无"
ods_src = "DWD慢变元数据"
else:
# pick example from first records
ex_val = None
for r in records[:80]:
v = pick_example_from_record(r, jp)
if v not in (None, ""):
ex_val = v
break
ex = _format_example(ex_val)
json_field = resolve_json_field_triplet(
f"{json_base}.json" if json_base else None,
record_node,
records,
jp,
cast_map.get(c),
)
src_expr = src_expr_map.get(c, jp)
ods_src = resolve_ods_source_field(records, src_expr, cast_map.get(c))
purpose = _first_sentence(_infer_purpose(table, c, jp), 140)
func = purpose
if "用于" not in func:
func = "用于" + func.rstrip("")
if source_ods:
ods_table_only = source_ods.split(".")[-1]
ods_src_display = f"{ods_table_only} - {ods_src}"
else:
ods_src_display = f"无 - {ods_src}"
comment = (
f"【说明】{purpose}"
f" 【示例】{ex}{func})。"
f" 【ODS来源】{ods_src_display}"
f" 【JSON字段】{json_field}"
)
comment_lines.append(
f"COMMENT ON COLUMN billiards_dwd.{table}.{c} IS '{_escape_sql(comment)}';"
)
out_parts.append(newline + newline + (newline.join(comment_lines)) + newline + newline)
last = m.end()
count_tables += 1
out_parts.append(clean[last:])
result = "".join(out_parts)
# collapse extra blank lines
result = re.sub(r"(?:\r?\n){4,}", newline * 3, result)
backup = SQL_PATH.with_suffix(SQL_PATH.suffix + ".bak")
if not backup.exists():
backup.write_text(raw, encoding="utf-8")
SQL_PATH.write_text(result, encoding="utf-8")
print(f"Rewrote comments for {count_tables} tables: {SQL_PATH}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,560 @@
# -*- coding: utf-8 -*-
import json
import re
from pathlib import Path
from collections import defaultdict
SQL_PATH = Path(r"C:\dev\LLTQ\ETL\feiqiu-ETL\etl_billiards\database\schema_ODS_doc.sql")
DOC_DIR = Path(r"C:\dev\LLTQ\export\test-json-doc")
TABLE_CN = {
"member_profiles": "会员档案/会员账户信息",
"member_balance_changes": "会员余额变更流水",
"member_stored_value_cards": "会员储值/卡券账户列表",
"recharge_settlements": "充值结算记录",
"settlement_records": "结账/结算记录",
"assistant_cancellation_records": "助教作废/取消记录",
"assistant_accounts_master": "助教档案主数据",
"assistant_service_records": "助教服务流水",
"site_tables_master": "门店桌台主数据",
"table_fee_discount_records": "台费折扣记录",
"table_fee_transactions": "台费流水",
"goods_stock_movements": "商品库存变动流水",
"stock_goods_category_tree": "商品分类树",
"goods_stock_summary": "商品库存汇总",
"payment_transactions": "支付流水",
"refund_transactions": "退款流水",
"platform_coupon_redemption_records": "平台券核销/使用记录",
"tenant_goods_master": "租户商品主数据",
"group_buy_packages": "团购套餐主数据",
"group_buy_redemption_records": "团购核销记录",
"settlement_ticket_details": "结算小票明细",
"store_goods_master": "门店商品主数据",
"store_goods_sales_records": "门店商品销售流水",
}
COMMON_FIELD_PURPOSE = {
"tenant_id": "租户/品牌 ID用于商户维度过滤与关联。",
"site_id": "门店 ID用于门店维度过滤与关联。",
"register_site_id": "会员注册门店 ID用于归属门店维度关联。",
"site_name": "门店名称快照,用于直接展示。",
"id": "本表主键 ID用于唯一标识一条记录。",
"system_member_id": "系统级会员 ID跨门店/跨卡种统一到‘人’的维度)。",
"order_trade_no": "订单交易号,用于串联同一订单下的各类消费明细。",
"order_settle_id": "订单结算/结账主键,用于关联结算记录与小票明细。",
"order_pay_id": "关联支付流水的主键 ID用于追溯支付明细。",
"point": "积分余额,用于记录会员积分取值。",
"growth_value": "成长值/成长积分,用于会员成长与等级评估。",
"referrer_member_id": "推荐人会员 ID用于记录会员推荐/拉新关系。",
"create_time": "记录创建时间(业务侧产生时间)。",
"status": "状态枚举,用于标识记录当前业务状态。",
"user_status": "用户状态枚举,用于标识会员账户/用户可用状态。",
"is_delete": "逻辑删除标记0=否1=是)。",
"payload": "完整原始 JSON 记录快照,用于回溯与二次解析。",
"source_file": "ETL 元数据:原始导出文件名,用于数据追溯。",
"source_endpoint": "ETL 元数据:采集来源(接口/文件路径),用于数据追溯。",
"fetched_at": "ETL 元数据:采集/入库时间戳,用于口径对齐与增量处理。",
}
ETL_META_FIELDS = {"source_file", "source_endpoint", "fetched_at"}
def _first_sentence(text: str, max_len: int = 120) -> str:
s = re.sub(r"\s+", " ", (text or "").strip())
if not s:
return ""
parts = re.split(r"[。;;]\s*", s)
s = parts[0].strip() if parts else s
if len(s) > max_len:
s = s[: max_len - 1] + ""
return s
def _escape_sql(s: str) -> str:
return (s or "").replace("'", "''")
def normalize_key(s: str) -> str:
return re.sub(r"[_\-\s]", "", (s or "").lower())
def snake_to_lower_camel(s: str) -> str:
parts = re.split(r"[_\-\s]+", s)
if not parts:
return s
first = parts[0].lower()
rest = "".join(p[:1].upper() + p[1:] for p in parts[1:] if p)
return first + rest
def snake_to_upper_camel(s: str) -> str:
parts = re.split(r"[_\-\s]+", s)
return "".join(p[:1].upper() + p[1:] for p in parts if p)
def find_key_in_record(record: dict, token: str) -> str | None:
if not isinstance(record, dict) or not token:
return None
if token in record:
return token
norm_to_key = {normalize_key(k): k for k in record.keys()}
candidates = [
token,
token.lower(),
token.upper(),
snake_to_lower_camel(token),
snake_to_upper_camel(token),
]
for c in candidates:
nk = normalize_key(c)
if nk in norm_to_key:
return norm_to_key[nk]
return None
def _infer_purpose(_table: str, col: str) -> str:
if col in COMMON_FIELD_PURPOSE:
return COMMON_FIELD_PURPOSE[col]
lower = col.lower()
if lower.endswith("_id"):
return "标识类 ID 字段,用于关联/定位相关实体。"
if lower.endswith("_time") or lower.endswith("time"):
return "时间字段,用于记录业务时间点/发生时间。"
if any(k in lower for k in ["amount", "money", "fee", "price", "deduct", "cost"]):
return "金额字段,用于计费/结算/分摊等金额计算。"
if any(k in lower for k in ["count", "num", "number", "seconds", "qty"]):
return "数量/时长字段,用于统计与计量。"
if lower.endswith("_name") or lower.endswith("name"):
return "名称字段,用于展示与辅助识别。"
if lower.endswith("_code") or lower.endswith("code"):
return "编码/枚举字段,用于表示类型、等级或业务枚举。"
if lower.startswith("is_") or lower.startswith("able_") or lower.startswith("can_"):
return "布尔/开关字段,用于表示权限、可用性或状态开关。"
return "来自 JSON 导出的原始字段,用于保留业务取值。"
def _format_example(value, max_len: int = 120) -> str:
if value is None:
return "NULL"
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
s = value.strip()
if len(s) > max_len:
s = s[: max_len - 1] + ""
return s
if isinstance(value, list):
if not value:
return "[]"
sample = value[0]
rendered = json.dumps(sample, ensure_ascii=False)
if len(value) > 1:
rendered = f"[{rendered}, …] (len={len(value)})"
else:
rendered = f"[{rendered}]"
if len(rendered) > max_len:
rendered = rendered[: max_len - 1] + ""
return rendered
if isinstance(value, dict):
keys = list(value)[:6]
mini = {k: value.get(k) for k in keys}
rendered = json.dumps(mini, ensure_ascii=False)
if len(value) > len(keys):
rendered = rendered[:-1] + ", …}"
if len(rendered) > max_len:
rendered = rendered[: max_len - 1] + ""
return rendered
rendered = str(value)
if len(rendered) > max_len:
rendered = rendered[: max_len - 1] + ""
return rendered
def _find_best_record_list(data, columns):
cols = set(columns)
best = None
best_score = -1
queue = [(data, 0)]
visited = 0
while queue and visited < 20000:
node, depth = queue.pop(0)
visited += 1
if depth > 8:
continue
if isinstance(node, list):
if node and all(isinstance(x, dict) for x in node[:3]):
scores = []
for x in node[:5]:
scores.append(len(set(x.keys()) & cols))
score = sum(scores) / max(1, len(scores))
if score > best_score:
best_score = score
best = node
for x in node[:10]:
queue.append((x, depth + 1))
else:
for x in node[:50]:
queue.append((x, depth + 1))
elif isinstance(node, dict):
for v in list(node.values())[:80]:
queue.append((v, depth + 1))
return best
def _find_best_record_list_and_node(data, columns):
cols = set(columns)
best = None
best_score = -1
best_path = []
queue = [(data, 0, [])]
visited = 0
while queue and visited < 25000:
node, depth, path = queue.pop(0)
visited += 1
if depth > 10:
continue
if isinstance(node, list):
if node and all(isinstance(x, dict) for x in node[:3]):
scores = []
for x in node[:5]:
scores.append(len(set(x.keys()) & cols))
score = sum(scores) / max(1, len(scores))
if score > best_score:
best_score = score
best = node
best_path = path
for x in node[:10]:
queue.append((x, depth + 1, path))
else:
for x in node[:80]:
queue.append((x, depth + 1, path))
elif isinstance(node, dict):
for k, v in list(node.items())[:120]:
queue.append((v, depth + 1, path + [str(k)]))
node_str = ".".join(best_path) if best_path else "$"
return best or [], node_str
def _choose_examples(records, columns):
examples = {}
if not records:
return examples
for col in columns:
val = None
for r in records[:120]:
if isinstance(r, dict) and col in r and r[col] not in (None, ""):
val = r[col]
break
examples[col] = val
return examples
def _extract_header_fields(line: str, columns_set):
s = line.strip()
if not s:
return []
# 支持 1. id / 1.1 siteProfile / 8. tenant_id
m = re.match(r"^\d+(?:\.\d+)*[\.)]?\s+(.+)$", s)
if m:
s = m.group(1).strip()
parts = re.split(r"\s*[/、,]\s*", s)
fields = [p.strip() for p in parts if p.strip() in columns_set]
if not fields and s in columns_set:
fields = [s]
if fields and len(line) <= 120:
return fields
return []
def _parse_field_purpose_from_block(block_lines):
lines = [l.rstrip() for l in block_lines]
def pick_after_label(labels):
for i, l in enumerate(lines):
for lab in labels:
if lab in l:
after = l.split(lab, 1)[1].strip()
if after:
return after
buf = []
j = i + 1
while j < len(lines) and not lines[j].strip():
j += 1
for k in range(j, len(lines)):
if not lines[k].strip():
break
if re.match(r"^[\w\u4e00-\u9fff]+[:]", lines[k].strip()):
break
buf.append(lines[k].strip())
if buf:
return " ".join(buf)
return ""
# 兼容「含义(结合其它文件):」「含义(推测):」等变体
picked = pick_after_label(["含义:", "含义:"])
if not picked:
for i, l in enumerate(lines):
s = l.strip()
m = re.match(r"^含义.*[:]\s*(.*)$", s)
if m:
after = m.group(1).strip()
if after:
picked = after
else:
buf = []
j = i + 1
while j < len(lines) and not lines[j].strip():
j += 1
for k in range(j, len(lines)):
if not lines[k].strip():
break
if re.match(r"^[\w\u4e00-\u9fff]+[:]", lines[k].strip()):
break
buf.append(lines[k].strip())
if buf:
picked = " ".join(buf)
break
if not picked:
picked = pick_after_label(["作用:", "作用:"])
if not picked:
for i, l in enumerate(lines):
s = l.strip()
m = re.match(r"^作用.*[:]\s*(.*)$", s)
if m:
after = m.group(1).strip()
if after:
picked = after
break
if not picked:
# 兜底:尽量避开“类型:/唯一值个数:”这类描述
for l in lines:
s = l.strip()
if not s:
continue
if any(
s.startswith(prefix)
for prefix in [
"类型:",
"非空:",
"唯一值",
"观测",
"特征",
"统计",
"分布",
"说明:",
"关联:",
"结构关系",
"和其它表",
"重复记录",
"全部为",
]
):
continue
picked = s
break
return _first_sentence(picked, 160)
def _is_poor_purpose(purpose: str) -> bool:
s = (purpose or "").strip()
if not s:
return True
if s.endswith("") or s.endswith(":"):
return True
if s.startswith("全部为"):
return True
if s.startswith("含义") and ("" in s or ":" in s) and len(s) <= 12:
return True
return False
def parse_analysis(analysis_text: str, columns):
columns_set = set(columns)
blocks = defaultdict(list)
current_fields = []
buf = []
for raw in analysis_text.splitlines():
fields = _extract_header_fields(raw, columns_set)
if fields:
if current_fields and buf:
for f in current_fields:
blocks[f].extend(buf)
current_fields = fields
buf = []
else:
if current_fields:
buf.append(raw)
if current_fields and buf:
for f in current_fields:
blocks[f].extend(buf)
purposes = {}
for col in columns:
if col in blocks and blocks[col]:
p = _parse_field_purpose_from_block(blocks[col])
if p:
purposes[col] = p
return purposes
def parse_columns_from_ddl(create_sql: str):
start = create_sql.find("(")
end = create_sql.rfind(")")
body = create_sql[start + 1 : end]
cols = []
for line in body.splitlines():
s = line.strip().rstrip(",")
if not s:
continue
if s.startswith(")"):
continue
if s.upper().startswith("CONSTRAINT "):
continue
m = re.match(r"^([A-Za-z_][A-Za-z0-9_]*)\s+", s)
if not m:
continue
name = m.group(1)
if name.upper() in {"PRIMARY", "UNIQUE", "FOREIGN", "CHECK"}:
continue
cols.append(name)
return cols
def build_comment_block(table: str, columns, analysis_text: str, records):
# records_node: 由外部确定,避免这里重复遍历 JSON
records, records_node = records
purposes = parse_analysis(analysis_text, columns)
examples = _choose_examples(records, columns)
table_cn = TABLE_CN.get(table, table)
table_comment = (
f"ODS 原始明细表:{table_cn}"
f"来源C:/dev/LLTQ/export/test-json-doc/{table}.json分析{table}-Analysis.md。"
f"字段以导出原样为主ETL 补充 source_file/source_endpoint/fetched_at并保留 payload 为原始记录快照。"
)
lines = []
lines.append(f"COMMENT ON TABLE billiards_ods.{table} IS '{_escape_sql(table_comment)}';")
for col in columns:
json_file = f"{table}.json"
if col in ETL_META_FIELDS:
json_field = f"{json_file} - ETL元数据 - 无"
elif col == "payload":
json_field = f"{json_file} - {records_node} - $"
else:
actual = None
for r in records[:50]:
if isinstance(r, dict):
actual = find_key_in_record(r, col)
if actual:
break
field_name = actual or col
json_field = f"{json_file} - {records_node} - {field_name}"
purpose = purposes.get(col) or _infer_purpose(table, col)
purpose = _first_sentence(purpose, 140) or _infer_purpose(table, col)
if _is_poor_purpose(purpose):
purpose = COMMON_FIELD_PURPOSE.get(col) or _infer_purpose(table, col)
if col in ETL_META_FIELDS:
if col == "source_file":
ex = f"{table}.json"
elif col == "source_endpoint":
ex = f"C:/dev/LLTQ/export/test-json-doc/{table}.json"
else:
ex = "2025-11-10T00:00:00+08:00"
elif col == "payload":
ex = "{...}"
else:
ex = _format_example(examples.get(col))
func = purpose
if "用于" not in func:
func = "用于" + func.rstrip("")
# ODS来源表名-字段名ODS自身字段ETL补充字段标记
if col in ETL_META_FIELDS:
ods_src = f"{table} - {col}ETL补充"
else:
ods_src = f"{table} - {col}"
comment = (
f"【说明】{purpose}"
f" 【示例】{ex}{func})。"
f" 【ODS来源】{ods_src}"
f" 【JSON字段】{json_field}"
)
lines.append(
f"COMMENT ON COLUMN billiards_ods.{table}.{col} IS '{_escape_sql(comment)}';"
)
return "\n".join(lines)
text = SQL_PATH.read_text(encoding="utf-8")
newline = "\r\n" if "\r\n" in text else "\n"
kept = []
for raw_line in text.splitlines(True):
stripped = raw_line.lstrip()
if stripped.startswith("--"):
continue
if re.match(r"^\s*COMMENT ON\s+(TABLE|COLUMN)\s+", raw_line):
continue
kept.append(raw_line)
clean = "".join(kept)
create_re = re.compile(
r"(CREATE TABLE IF NOT EXISTS\s+billiards_ods\.(?P<table>[A-Za-z0-9_]+)\s*\([\s\S]*?\)\s*;)" ,
re.M,
)
out_parts = []
last = 0
count = 0
for m in create_re.finditer(clean):
out_parts.append(clean[last : m.end()])
table = m.group("table")
create_sql = m.group(1)
cols = parse_columns_from_ddl(create_sql)
analysis_text = (DOC_DIR / f"{table}-Analysis.md").read_text(encoding="utf-8")
data = json.loads((DOC_DIR / f"{table}.json").read_text(encoding="utf-8"))
record_list, record_node = _find_best_record_list_and_node(data, cols)
out_parts.append(newline + newline + build_comment_block(table, cols, analysis_text, (record_list, record_node)) + newline + newline)
last = m.end()
count += 1
out_parts.append(clean[last:])
result = "".join(out_parts)
result = re.sub(r"(?:\r?\n){4,}", newline * 3, result)
backup = SQL_PATH.with_suffix(SQL_PATH.suffix + ".rewrite2.bak")
backup.write_text(text, encoding="utf-8")
SQL_PATH.write_text(result, encoding="utf-8")
print(f"Rewrote comments for {count} tables. Backup: {backup}")

1886
tmp/schema_ODS_doc copy.sql Normal file

File diff suppressed because it is too large Load Diff

1907
tmp/schema_ODS_doc.sql Normal file

File diff suppressed because it is too large Load Diff

1801
tmp/schema_dwd.sql Normal file

File diff suppressed because it is too large Load Diff

1878
tmp/schema_dwd_doc.sql Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

1
tmp/temp_chinese.txt Normal file
View File

@@ -0,0 +1 @@
含义

29
tmp/tmp_debug_sql.py Normal file
View File

@@ -0,0 +1,29 @@
import os, psycopg2
from etl_billiards.tasks.dwd_load_task import DwdLoadTask
dwd_table="billiards_dwd.dwd_table_fee_log"
ods_table="billiards_ods.table_fee_transactions"
conn=psycopg2.connect(os.environ["PG_DSN"])
cur=conn.cursor()
task=DwdLoadTask(config={}, db_connection=None, api_client=None, logger=None)
cur.execute("SELECT column_name FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_dwd", "dwd_table_fee_log"))
dwd_cols=[r[0].lower() for r in cur.fetchall()]
cur.execute("SELECT column_name FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_ods", "table_fee_transactions"))
ods_cols=[r[0].lower() for r in cur.fetchall()]
cur.execute("SELECT column_name,data_type FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_dwd", "dwd_table_fee_log"))
dwd_types={r[0].lower(): r[1].lower() for r in cur.fetchall()}
cur.execute("SELECT column_name,data_type FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_ods", "table_fee_transactions"))
ods_types={r[0].lower(): r[1].lower() for r in cur.fetchall()}
mapping=task.FACT_MAPPINGS.get(dwd_table)
if mapping:
insert_cols=[d for d,o,_ in mapping if o in ods_cols]
select_exprs=[task._cast_expr(o,cast_type) for d,o,cast_type in mapping if o in ods_cols]
else:
insert_cols=[c for c in dwd_cols if c in ods_cols and c not in task.SCD_COLS]
select_exprs=task._build_fact_select_exprs(insert_cols,dwd_types,ods_types)
print('insert_cols', insert_cols)
print('select_exprs', select_exprs)
sql=f"INSERT INTO {task._format_table(dwd_table,'billiards_dwd')} ({', '.join(f'\"{c}\"' for c in insert_cols)}) SELECT {', '.join(select_exprs)} FROM {task._format_table(ods_table,'billiards_ods')}"
print(sql)
cur.close(); conn.close()

7
tmp/tmp_drop_dwd.py Normal file
View File

@@ -0,0 +1,7 @@
import os, psycopg2
conn=psycopg2.connect(os.environ["PG_DSN"])
conn.autocommit=True
cur=conn.cursor()
cur.execute('DROP SCHEMA IF EXISTS billiards_dwd CASCADE')
cur.close(); conn.close()
print('dropped billiards_dwd')

19
tmp/tmp_dwd_tasks.py Normal file
View File

@@ -0,0 +1,19 @@
import os
import psycopg2
DSN = os.environ.get('PG_DSN')
store_id = int(os.environ.get('STORE_ID','2790685415443269'))
conn = psycopg2.connect(DSN)
conn.autocommit = True
cur = conn.cursor()
rows = []
for code in ('INIT_DWD_SCHEMA','DWD_LOAD_FROM_ODS','DWD_QUALITY_CHECK'):
cur.execute("SELECT task_id FROM etl_admin.etl_task WHERE task_code=%s AND store_id=%s", (code, store_id))
if cur.fetchone():
cur.execute("UPDATE etl_admin.etl_task SET enabled=TRUE, updated_at=now() WHERE task_code=%s AND store_id=%s", (code, store_id))
rows.append((code, 'updated'))
else:
cur.execute("INSERT INTO etl_admin.etl_task(task_code,store_id,enabled,cursor_field,window_minutes_default,overlap_seconds,page_size,params) VALUES (%s,%s,TRUE,NULL,60,120,1000,'{}') RETURNING task_id", (code, store_id))
rows.append((code, 'inserted', cur.fetchone()[0]))
print(rows)
cur.close(); conn.close()

28
tmp/tmp_problems.py Normal file
View File

@@ -0,0 +1,28 @@
import os, psycopg2
from etl_billiards.tasks.dwd_load_task import DwdLoadTask
conn=psycopg2.connect(os.environ['PG_DSN'])
cur=conn.cursor()
problems=[]
for dwd_table, ods_table in DwdLoadTask.TABLE_MAP.items():
if dwd_table.split('.')[-1].startswith('dwd_'):
if '.' in dwd_table:
dschema, dtable = dwd_table.split('.')
else:
dschema, dtable = 'billiards_dwd', dwd_table
if '.' in ods_table:
oschema, otable = ods_table.split('.')
else:
oschema, otable = 'billiards_ods', ods_table
cur.execute("SELECT column_name,data_type FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", (dschema,dtable))
dcols={r[0].lower():r[1].lower() for r in cur.fetchall()}
cur.execute("SELECT column_name,data_type FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", (oschema,otable))
ocols={r[0].lower():r[1].lower() for r in cur.fetchall()}
common=set(dcols)&set(ocols)
missing_dwd=list(set(ocols)-set(dcols))
missing_ods=list(set(dcols)-set(ocols))
mismatches=[(c,dcols[c],ocols[c]) for c in sorted(common) if dcols[c]!=ocols[c]]
problems.append((dwd_table,missing_dwd,missing_ods,mismatches))
cur.close();conn.close()
for p in problems:
print(p)

26
tmp/tmp_run_sql.py Normal file
View File

@@ -0,0 +1,26 @@
import os, psycopg2
from etl_billiards.tasks.dwd_load_task import DwdLoadTask
dwd_table="billiards_dwd.dwd_table_fee_log"
ods_table="billiards_ods.table_fee_transactions"
conn=psycopg2.connect(os.environ["PG_DSN"])
cur=conn.cursor()
task=DwdLoadTask(config={}, db_connection=None, api_client=None, logger=None)
cur.execute("SELECT column_name FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_dwd", "dwd_table_fee_log"))
dwd_cols=[r[0].lower() for r in cur.fetchall()]
cur.execute("SELECT column_name FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_ods", "table_fee_transactions"))
ods_cols=[r[0].lower() for r in cur.fetchall()]
cur.execute("SELECT column_name,data_type FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_dwd", "dwd_table_fee_log"))
dwd_types={r[0].lower(): r[1].lower() for r in cur.fetchall()}
cur.execute("SELECT column_name,data_type FROM information_schema.columns WHERE table_schema=%s AND table_name=%s", ("billiards_ods", "table_fee_transactions"))
ods_types={r[0].lower(): r[1].lower() for r in cur.fetchall()}
mapping=task.FACT_MAPPINGS.get(dwd_table)
insert_cols=[d for d,o,_ in mapping if o in ods_cols]
select_exprs=[task._cast_expr(o,cast_type) for d,o,cast_type in mapping if o in ods_cols]
sql=f"INSERT INTO {task._format_table(dwd_table,'billiards_dwd')} ({', '.join(f'\"{c}\"' for c in insert_cols)}) SELECT {', '.join(select_exprs)} FROM {task._format_table(ods_table,'billiards_ods')} LIMIT 1"
print(sql)
cur.execute(sql)
conn.commit()
print('ok')
cur.close(); conn.close()