Architecture
The technical picture of how tripl is built. If you want the what and the why in plain language first, read concepts.md; this page is the how for people working on the system.
For local setup, commands, and the source tree, see CONTRIBUTING.md.
System shape
tripl is three cooperating processes plus a database and a message broker:
┌─────────────┐
browser ───────▶ │ frontend │ React + Vite (static SPA)
└──────┬──────┘
│ HTTP /api/v1
┌──────▼──────┐ ┌──────────────┐
│ api │◀──────▶│ PostgreSQL │ system of record
│ (FastAPI) │ └──────▲───────┘
└──────┬──────┘ │
│ enqueue │ read/write
┌──────▼──────┐ ┌──────┴───────┐
│ RabbitMQ │ │ workers │
│ (broker) │◀──────▶│ (Celery) │
└─────────────┘ └──────┬───────┘
│ read-only queries
┌──────▼───────┐
│ warehouses │ ClickHouse /
│ (external) │ BigQuery / Postgres
└──────────────┘
- api — FastAPI service. Owns all HTTP, auth, and business logic; reads and writes PostgreSQL; enqueues background work onto RabbitMQ.
- celery-worker — runs scans, collects metrics, detects anomalies and drift, and dispatches alerts. It is the only process that connects to the external warehouses.
- celery-beat — the scheduler. Triggers due metric-collection checks — for both event counts and the metric catalog (a ~300 s due-check) — and the schema-drift retention cleanup.
- PostgreSQL — the system of record for the plan, metrics, anomalies, audit log, and alert deliveries.
- RabbitMQ — the broker between the api and the workers.
- The warehouses are external and are never started by tripl; the worker only ever issues read queries against them.
Locally, all of the above (except the warehouses) run under Docker Compose:
postgres, rabbitmq, api, celery-worker, celery-beat, and frontend.
Backend (backend/)
- FastAPI with fully async request paths.
- SQLAlchemy (async) over PostgreSQL, migrations via Alembic.
- Pydantic v2 schemas as the request/response contract.
- Routers live under
src/tripl/api/v1and stay thin; business rules live insrc/tripl/services. - Shared compute that both the request path and the worker need — warehouse
adapters, analyzers (anomaly/drift/scan logic), and interval helpers —
lives in a neutral
src/tripl/corekernel that imports neitherservicesnorworker. This keepsservicesfrom importingworkerat module level; the request path reaches the worker only via lazy, runtime Celery dispatch. - DB engine and pool configuration is centralized in
src/tripl/db_config.py(an async pooled engine for the API, a sync pooled engine for Celery; the worker→async bridge uses a throwaway NullPool engine — seeworker/search_reindex.py). - Migrations are applied by the deployment entrypoint (the Compose
apicommand runsalembic upgrade head) before the API starts serving requests, so the schema is current. The app process itself does not run migrations on startup; its lifespan only configures logging and asserts production readiness. - Health check:
GET /health.
Authentication & access
- Session auth via an HTTP-only cookie for interactive users. Emails are
validated with
email-validator(RFC 5321 / 6531). - API keys (Bearer tokens,
Authorization: Bearer tk_…) for scripts and agents. Only the SHA-256 hash of a key is stored. Keys carry areadorwritescope, an optional project binding, and an optional expiry, and are revocable. See agent-api-guide.md. - RBAC with three roles: owner / editor / viewer. Owner-only routes (security and instance administration) require an interactive owner session and are never reachable with an API key.
Worker (backend/src/tripl/worker/)
- Celery app with a RabbitMQ broker.
- Warehouse adapters (
core/adapters) provide a common interface over ClickHouse, BigQuery, and PostgreSQL source databases. - Analyzers (
core/analyzers) hold the scan, anomaly, and drift logic. (Both live in the sharedcorekernel — see the Backend section — so the request path can reuse them without importing the worker package.) - Tasks (
worker/tasks) are the Celery entrypoints for scans, metrics, anomalies, and alert delivery.
Detection
- Anomaly detection runs at three scopes — project-total, event-type, and event. It combines z-score thresholds with seasonality decomposition (STL / MSTL) so it understands daily and weekly rhythms rather than just a flat baseline.
- Forecast — a next-bucket extrapolation, rendered as a dashed line on the metric chart.
- Schema drift — detects fields appearing, disappearing, or carrying new values; keeps sample values; and prunes old drift records on a retention schedule.
- Distribution drift — uses PSI (Population Stability Index) over event field values.
- Correlation-aware grouping collapses signals that share an underlying cause so one root problem yields one alert, not many.
- Metric anomalies run the same detector at a dedicated metric scope.
Metrics are classified count-shaped (counts/sums) or fractional (ratios,
averages, raw SQL): count-shaped series keep zero-fill and the
min_expected_countgate, while fractional series drop both (a missing bucket means "no data", not zero) so sub-unit ratios don't false-fire. Per-projectdetect_metricsenables the scope; per-ruleinclude_metricsopts metric anomalies into alerting (off by default).
Metrics
- Counts are collected into PostgreSQL on a configurable interval (15m / 1h / 6h / 1d / 1w), with replay-by-chunk support for backfills.
- Bulk metric upserts are chunked to stay under PostgreSQL's 65535 bind-parameter limit.
Catalog metrics
MetricDefinitionis a user-defined, project-scoped metric (the catalog) — global rather than branched. Three kinds:sql(a userSELECTreturning a per-bucket value against a data source on its own interval),fact_aggregation(count/sum/avg/min/max/count_distinctover a measure column of a table or base query, with an optional filter and breakdowns), andevent_composition(asingleevent count, aratioA/B, or an eventper_distinct_user, derived from already-collectedevent_metrics).- Scheduling. The
check_metric_definitions_duebeat task runs about every 300 s and dispatchescollect_metric_definitionsfor each active metric whose interval is due.sql/fact_aggregationmetrics query their own data source through the adapter;event_compositionmetrics read existing event series on the shared scan grid (no warehouse query). - Aggregations. Adapter
_aggregate_value_sqlbuilds the per-kind SQL for ClickHouse / BigQuery / PostgreSQL;core/adapters/measure_validatorchecks the measure/distinct column against the source's real columns before it reaches a query. - Storage. Values land in
metric_values, with per-split rows inmetric_value_breakdowns(platform / app-version / …, like event breakdowns). A divide-by-zero in aratiobucket produces no value — a gap, not a0— so the row is dropped rather than written as zero. - Surface. Catalog CRUD lives at
/projects/{slug}/metrics; a series read service feeds the frontend MetricsPage (list + kind-aware create/edit form) and the metric drilldown, which reuses the monitoring detail tabs.
Frontend (frontend/)
- React 19 + TypeScript + Vite.
- Tailwind CSS 4 with shadcn-style UI primitives.
- TanStack Query for server state, Recharts for charts, dnd-kit for drag-and-drop reordering.
- The information architecture is four job-based groups — Plan / Observe /
Govern / Connect — defined once in
src/lib/navigation.tsand consumed by both the sidebar and the breadcrumbs so they never drift apart. - Serving. In development the Vite dev server serves the SPA with HMR and
proxies
/apito the backend. In production there are two options: (a) consolidated single container — FastAPI serves the built SPA itself viaapp.frontend()(FastAPI 0.138+) whenSERVE_FRONTEND=true, so one image serves API + SPA (rootDockerfile+ the defaultcompose.yaml, no nginx; see RELEASE.md); or (b) standalone static tier —frontend/Dockerfileserves the build through nginx (frontend/nginx.conf) next to the API. Consolidated mode routes the SPA through the API'sSecurityHeadersMiddleware/BrotliMiddleware, so it inherits the same CSP/headers and compression; because the app is then the network edge,rate_limit_trust_forwarded_forstaysFalse(don't trust client-sent forwarded headers) unless a trusted proxy is added in front. - Plan branch context travels as a
?branch=query parameter threaded through every plan API call and the React Query keys; the active branch is persisted inlocalStorageper project slug.
Data model (core objects)
| Object | What it is |
|---|---|
Project | A tracking-plan namespace — one product/world. |
EventType | A folder grouping related events. |
Event | A concrete tracked event. |
FieldDefinition | A typed field on an event type. |
MetaFieldDefinition | Project-level metadata carried by every event. |
Variable | A reusable value list. |
Relation | A declared connection between events. |
DataSource | A connection to an external warehouse. |
ScanConfig | A saved scan query + extraction rules. |
ScanJob | One async execution of a scan config. |
EventMetric | Time-bucketed counts for an event. |
MetricDefinition | A user-defined metric (the metrics catalog); project-scoped, not branched. |
MetricValue | Time-bucketed values for a MetricDefinition. |
MetricValueBreakdown | Per-breakdown metric values (platform / app-version / …). |
MetricAnomaly | A persisted anomaly bucket. |
AlertDestination | A delivery channel (Slack, Telegram, …). |
AlertRule | Filtering + delivery configuration for signals. |
AlertDelivery | A record of one alert that was sent. |
Plan branches deep-copy the relevant objects (event types, fields, events, variables, meta fields, relations, photos, comments) and merge back via a 3-way merge that preserves live IDs by natural key. Metrics are deliberately not branched — they are project-scoped and shared across every branch.
Operational flows
Scan flow
- The api creates or updates a
ScanConfig. - Running it creates a
ScanJob. - A Celery task executes the query against the warehouse via the adapter.
- Cardinality analysis decides whether observed values become event fields or variables.
- Events and variables are created or updated in PostgreSQL.
ScanJob.result_summaryis filled in for the UI.
Metrics flow
- Beat schedules due-checks.
- Due scans dispatch metrics collection.
- Counts are aggregated into
event_metrics. - Anomalies are recalculated into
metric_anomalies. - Matching alert rules enqueue deliveries.
Catalog metric flow
- Beat (
check_metric_definitions_due, ~300 s) finds active, due metrics. collect_metric_definitionsevaluates each — querying the warehouse (sql/fact_aggregation) or composing event series (event_composition).- Values upsert into
metric_values/metric_value_breakdowns. - Metric-scope anomalies are recalculated into
metric_anomalies. - Alert rules with
include_metricsenqueue deliveries.
Alert flow
- Anomaly items are matched against rule configuration.
AlertDeliveryandAlertDeliveryItemrows are written.- A Celery task sends the formatted notification.
- Delivery status becomes
pending,sent, orfailed.
Storage & integrations
- PostgreSQL stores the plan, metrics, audit log, and alert deliveries.
- Photo / attachment storage is pluggable: local filesystem or GCS.
- Alert destinations: Slack, Telegram, generic webhook, email (SMTP), Jira (REST v3 with an ADF body), and Linear (GraphQL).
Observability (both opt-in)
- Prometheus — a
/metricsendpoint, enabled withPROMETHEUS_METRICS_ENABLED, exposing scan, anomaly, alert-delivery, schema-drift, and Celery task counters and histograms. - OpenTelemetry — tracing for FastAPI + SQLAlchemy + Celery, enabled with
OTEL_EXPORTER_OTLP_ENDPOINT. It degrades to a logged no-op when the env var is blank or theopentelemetry-*packages aren't installed.
See also
- CONTRIBUTING.md — setup, commands, source tree, API surface.
- PLAN.md — product scope and roadmap.
- agent-api-guide.md — the API contract for agents and scripts.
- concepts.md — the same system in plain language.