ANA-002 recommended Data Pipeline

Analytics data flows to data warehouse

Raw analytics data flows to a queryable warehouse (BigQuery, Snowflake, etc.) via automated pipeline. Data is queryable via SQL.

Question to ask

"Can you write SQL against your raw analytics events?"

Verification guide

Severity: Recommended

Raw analytics data should flow to a queryable warehouse for ad-hoc analysis, custom reports, and data science work.

Check automatically:

Look for BigQuery integration:

# BigQuery client libraries
grep -riE "bigquery|@google-cloud/bigquery|BIGQUERY" --include="*.ts" --include="*.js" --include="*.py" --include="*.json" . 2>/dev/null | grep -v node_modules | head -10

# BigQuery service account or credentials
ls -la *bigquery*.json *-credentials.json service-account*.json 2>/dev/null
grep -riE "GOOGLE_APPLICATION_CREDENTIALS|bigquery.*key" .env.example .env* 2>/dev/null

Check for Dataform (transformation layer):

# Dataform config and SQLX files
ls -la dataform.json 2>/dev/null
find . -name "*.sqlx" -type f 2>/dev/null | head -5
grep -riE "dataform" package.json .github/workflows/*.yml 2>/dev/null

Check for ETL/pipeline tools:

# Managed ETL services
grep -riE "fivetran|airbyte|stitch|meltano" . 2>/dev/null | grep -v node_modules | head -5

# Segment warehouse destinations
grep -riE "segment.*warehouse|warehouse.*destination" . 2>/dev/null | grep -v node_modules | head -5

Check for alternative warehouses:

# Snowflake, Redshift, ClickHouse
grep -riE "snowflake|redshift|clickhouse" --include="*.ts" --include="*.js" --include="*.py" --include="*.json" --include="*.yml" . 2>/dev/null | grep -v node_modules | head -10

If not found in code, ask user:

"Where does your analytics data end up for querying?"
"Is there an automated pipeline to a data warehouse?"
"Can data analysts run SQL queries against raw analytics events?"

Cross-reference with:

ANA-001 (server-side tracking feeds the pipeline)
ANA-004 (warehouse needs BI layer to be useful)

Pass criteria:

Analytics data flows to a queryable warehouse (BigQuery, Snowflake, Redshift, ClickHouse)
Pipeline is automated (not manual exports)
Data is queryable via SQL

Fail criteria:

No warehouse integration
Data only accessible via SaaS dashboard (can't write SQL against it)
Manual CSV exports as the "pipeline"

Partial (acceptable):

Pipeline runs on schedule (daily, hourly) rather than real-time - note the latency
Warehouse exists but only contains subset of events - document what's missing

Recommendations:

Warehouse: BigQuery (recommended for GCP shops), Snowflake, Redshift
Transformation: Dataform (recommended), dbt
ETL: Fivetran, Airbyte (self-hosted option)

Evidence to capture:

Warehouse name and region
Pipeline tool/method
Data freshness (real-time, hourly, daily)
Evidence of automated sync (CI config, cron, managed service)

Section

18. Analytics

Performance & Analytics

Analytics data flows to data warehouse

What to check

Related items

Verification guide