Batch API-to-Warehouse ELT

# Integration Pattern: Batch API-to-Warehouse ELT ## Pattern Name **Batch API-to-Warehouse ELT** _Also known as: SaaS-to-Warehouse Ingestion Pipeline, Scheduled Pull Integration_ --- ## Intent Periodically extract operational data from a SaaS platform via its REST API, land it in a cloud data warehouse in raw form, and progressively transform it into analytics-ready structures — decoupling source system load from analytical consumption. --- ## Motivation Operational systems like ServiceNow are optimized for transactional workloads, not analytical queries. Reporting directly against them is slow, risky to platform stability, and limited by API rate constraints. This pattern moves data into an environment purpose-built for analytics, while preserving raw source fidelity for auditability and reprocessing. --- ## Applicability Use this pattern when: - The source system exposes a REST API with filtering and pagination support - Data volumes and change frequency are compatible with batch (vs. streaming) cadence - Analytics consumers need historical trends, cross-system joins, or aggregations - The organization requires a durable raw record of what was extracted and when --- ## Participants |Participant|Technology (Example Implementation)| |---|---| |Source System|ServiceNow (Table API)| |Orchestrator|Apache Airflow| |Extraction Layer|Python (requests / ServiceNow SDK)| |Staging / Landing Zone|Snowflake internal stage or S3-backed external stage| |Transformation Layer|Snowflake SQL tasks, dbt, or Airflow Python operators| |Serving Layer|Snowflake views / materialized tables| |Consumption Layer|Power BI (Snowflake native connector)| --- ## Structure ``` ServiceNow Table API │ │ REST/JSON (paginated, watermark-filtered) ▼ Airflow DAG — Extract │ │ Raw JSON / flattened Parquet ▼ Snowflake — Raw / Bronze Schema (landing zone, source-faithful, append-only) │ │ SQL transformation, reference resolution, │ type casting, business rule application ▼ Snowflake — Transformed / Silver Schema (cleaned, typed, relational) │ │ Aggregation, dimensional modeling ▼ Snowflake — Mart / Gold Schema (analytics-ready, access-controlled) │ ▼ Power BI Dashboards ``` --- ## Flow 1. **Schedule trigger** — Airflow DAG fires on a recurring schedule (e.g., hourly, daily) 2. **Watermark check** — Orchestrator retrieves the last successful run timestamp from metadata store 3. **Incremental extraction** — Python operator calls ServiceNow Table API with `sysparm_query=sys_updated_on>{watermark}`, paginating via `sysparm_limit` / `sysparm_offset` 4. **Raw landing** — Records written to Snowflake raw schema with extraction metadata (`_extracted_at`, `_source_table`, `_batch_id`); source payload preserved with minimal transformation 5. **Full reload cadence** — Reference and configuration tables (e.g., `sys_user`, `cmdb_ci`) refreshed on a slower full-reload schedule to ensure lookup accuracy 6. **Transformation pass** — Second DAG layer or dbt resolves `sys_id` reference fields to display values, decodes choice lists, casts types, and applies business rules 7. **Mart promotion** — Aggregated and dimensionally modeled tables promoted to mart schema for dashboard consumption 8. **Access control** — Snowflake RBAC roles applied at mart layer; Power BI connects via Snowflake native connector with appropriate service account --- ## Tradeoffs ### Advantages - **Source system isolation** — Analytical load never touches the operational system at query time - **Raw fidelity** — Landing zone preserves source data for auditability, debugging, and reprocessing - **Scalability** — Snowflake scales compute independently of extraction cadence - **Flexibility** — Transformation logic can evolve without re-extracting source data - **Operational simplicity** — Airflow + Python is a well-understood, widely supported stack ### Limitations - **Latency** — Batch cadence introduces data lag; not suitable for near-real-time use cases - **API rate limits** — Large full reloads can exhaust ServiceNow API quotas; requires throttling and retry logic - **Watermark fragility** — Incremental loads depend on `sys_updated_on` accuracy; soft-deleted or backdated records may be missed - **Reference resolution cost** — Resolving sys_ids at transformation time adds complexity; display values can change, creating historical inconsistency - **Schema drift** — ServiceNow table schema changes can silently break downstream transformations --- ## Related Patterns - **Change Data Capture (CDC)** — Event-driven alternative for lower-latency requirements - **Streaming Ingestion Pipeline** — Replaces batch extraction with real-time event streaming (e.g., Kafka) - **Reverse ETL** — Inverse flow; pushes warehouse data back into operational systems - **Medallion Architecture** — Structural organizing principle (Bronze / Silver / Gold) commonly applied within this pattern --- ## Implementation Notes - Tag all raw records with `_extracted_at`, `_source_table`, and `_batch_id` for lineage traceability - Store watermark state in a dedicated Airflow metadata table or Snowflake control table — not in DAG variables - Implement idempotent loads (upsert on `sys_id` + `sys_updated_on`) to safely support re-runs - Separate DAGs for high-churn transactional tables vs. slow-changing reference tables - Apply column-level masking in Snowflake for PII fields before mart promotion