ANA-002 recommended Data Pipeline
Analytics data flows to data warehouse
Raw analytics data flows to a queryable warehouse (BigQuery, Snowflake, etc.) via automated pipeline. Data is queryable via SQL.
Question to ask
"Can you write SQL against your raw analytics events?"
What to check
- ☐ Check for BigQuery client libraries and credentials
- ☐ Look for Dataform config and .sqlx transformation files
- ☐ Check for ETL tools (Fivetran, Airbyte, Segment warehouse)
- ☐ Verify pipeline is automated (not manual exports)
Verification guide
Severity: Recommended
Raw analytics data should flow to a queryable warehouse for ad-hoc analysis, custom reports, and data science work.
Check automatically:
- Look for BigQuery integration:
# BigQuery client libraries
grep -riE "bigquery|@google-cloud/bigquery|BIGQUERY" --include="*.ts" --include="*.js" --include="*.py" --include="*.json" . 2>/dev/null | grep -v node_modules | head -10
# BigQuery service account or credentials
ls -la *bigquery*.json *-credentials.json service-account*.json 2>/dev/null
grep -riE "GOOGLE_APPLICATION_CREDENTIALS|bigquery.*key" .env.example .env* 2>/dev/null
- Check for Dataform (transformation layer):
# Dataform config and SQLX files
ls -la dataform.json 2>/dev/null
find . -name "*.sqlx" -type f 2>/dev/null | head -5
grep -riE "dataform" package.json .github/workflows/*.yml 2>/dev/null
- Check for ETL/pipeline tools:
# Managed ETL services
grep -riE "fivetran|airbyte|stitch|meltano" . 2>/dev/null | grep -v node_modules | head -5
# Segment warehouse destinations
grep -riE "segment.*warehouse|warehouse.*destination" . 2>/dev/null | grep -v node_modules | head -5
- Check for alternative warehouses:
# Snowflake, Redshift, ClickHouse
grep -riE "snowflake|redshift|clickhouse" --include="*.ts" --include="*.js" --include="*.py" --include="*.json" --include="*.yml" . 2>/dev/null | grep -v node_modules | head -10
If not found in code, ask user:
- "Where does your analytics data end up for querying?"
- "Is there an automated pipeline to a data warehouse?"
- "Can data analysts run SQL queries against raw analytics events?"
Cross-reference with:
- ANA-001 (server-side tracking feeds the pipeline)
- ANA-004 (warehouse needs BI layer to be useful)
Pass criteria:
- Analytics data flows to a queryable warehouse (BigQuery, Snowflake, Redshift, ClickHouse)
- Pipeline is automated (not manual exports)
- Data is queryable via SQL
Fail criteria:
- No warehouse integration
- Data only accessible via SaaS dashboard (can't write SQL against it)
- Manual CSV exports as the "pipeline"
Partial (acceptable):
- Pipeline runs on schedule (daily, hourly) rather than real-time - note the latency
- Warehouse exists but only contains subset of events - document what's missing
Recommendations:
- Warehouse: BigQuery (recommended for GCP shops), Snowflake, Redshift
- Transformation: Dataform (recommended), dbt
- ETL: Fivetran, Airbyte (self-hosted option)
Evidence to capture:
- Warehouse name and region
- Pipeline tool/method
- Data freshness (real-time, hourly, daily)
- Evidence of automated sync (CI config, cron, managed service)