ANA-002 recommended Data Pipeline

Analytics data flows to data warehouse

Raw analytics data flows to a queryable warehouse (BigQuery, Snowflake, etc.) via automated pipeline. Data is queryable via SQL.

Question to ask

"Can you write SQL against your raw analytics events?"

What to check

  • Check for BigQuery client libraries and credentials
  • Look for Dataform config and .sqlx transformation files
  • Check for ETL tools (Fivetran, Airbyte, Segment warehouse)
  • Verify pipeline is automated (not manual exports)

Verification guide

Severity: Recommended

Raw analytics data should flow to a queryable warehouse for ad-hoc analysis, custom reports, and data science work.

Check automatically:

  1. Look for BigQuery integration:
# BigQuery client libraries
grep -riE "bigquery|@google-cloud/bigquery|BIGQUERY" --include="*.ts" --include="*.js" --include="*.py" --include="*.json" . 2>/dev/null | grep -v node_modules | head -10

# BigQuery service account or credentials
ls -la *bigquery*.json *-credentials.json service-account*.json 2>/dev/null
grep -riE "GOOGLE_APPLICATION_CREDENTIALS|bigquery.*key" .env.example .env* 2>/dev/null
  1. Check for Dataform (transformation layer):
# Dataform config and SQLX files
ls -la dataform.json 2>/dev/null
find . -name "*.sqlx" -type f 2>/dev/null | head -5
grep -riE "dataform" package.json .github/workflows/*.yml 2>/dev/null
  1. Check for ETL/pipeline tools:
# Managed ETL services
grep -riE "fivetran|airbyte|stitch|meltano" . 2>/dev/null | grep -v node_modules | head -5

# Segment warehouse destinations
grep -riE "segment.*warehouse|warehouse.*destination" . 2>/dev/null | grep -v node_modules | head -5
  1. Check for alternative warehouses:
# Snowflake, Redshift, ClickHouse
grep -riE "snowflake|redshift|clickhouse" --include="*.ts" --include="*.js" --include="*.py" --include="*.json" --include="*.yml" . 2>/dev/null | grep -v node_modules | head -10

If not found in code, ask user:

  • "Where does your analytics data end up for querying?"
  • "Is there an automated pipeline to a data warehouse?"
  • "Can data analysts run SQL queries against raw analytics events?"

Cross-reference with:

  • ANA-001 (server-side tracking feeds the pipeline)
  • ANA-004 (warehouse needs BI layer to be useful)

Pass criteria:

  • Analytics data flows to a queryable warehouse (BigQuery, Snowflake, Redshift, ClickHouse)
  • Pipeline is automated (not manual exports)
  • Data is queryable via SQL

Fail criteria:

  • No warehouse integration
  • Data only accessible via SaaS dashboard (can't write SQL against it)
  • Manual CSV exports as the "pipeline"

Partial (acceptable):

  • Pipeline runs on schedule (daily, hourly) rather than real-time - note the latency
  • Warehouse exists but only contains subset of events - document what's missing

Recommendations:

  • Warehouse: BigQuery (recommended for GCP shops), Snowflake, Redshift
  • Transformation: Dataform (recommended), dbt
  • ETL: Fivetran, Airbyte (self-hosted option)

Evidence to capture:

  • Warehouse name and region
  • Pipeline tool/method
  • Data freshness (real-time, hourly, daily)
  • Evidence of automated sync (CI config, cron, managed service)

Section

18. Analytics

Performance & Analytics