Observability & Logging
Overview
OpenLift ships an enterprise-grade observability plugin (src/plugins/observability.plugin.ts) that wires Fastify into the logging, metrics, and event pipeline. The plugin:
- Injects correlation IDs on every request and propagates them to responses, logs, and emitted events.
- Decorates Fastify with helpers such as
logBusinessEvent,logSecurityEvent,logAuditEvent,logPerformanceEvent, andlogSystemEvent. - Streams logs to the
LoggingService(backed by Loki + Redis) with automatic batching, retries, adaptive buffering, and circuit breakers. - Falls back to standard Fastify logging when observability is disabled, ensuring you always have audit trails.
- Exposes
/metricsendpoints (including/metrics/memory) plus guarded admin routes for forcing flushes or resetting the Loki circuit breaker.
TL;DR: Enable the plugin, point it at Loki/Prometheus, and the rest of the stack (Progression Playbooks, Measurements, Workout History) automatically emits business/security/performance events with the same schema.
Runtime Flow
- Correlation hook – attaches/propagates
x-correlation-idper request whenENABLE_CORRELATION_ID=true. - Decorators – services call
fastify.logBusinessEvent(...)etc. to write to Loki and emit internal events. - Request logging –
onRequest/onResponsehooks record HTTP traces unless paths are excluded. - Metrics endpoints –
GET /metrics(Prometheus scrape) andGET /metrics/memory(heap diagnostics) expose live stats when metrics are on. - Admin endpoints –
POST /admin/logging/flushandPOST /admin/logging/reset-circuit-breaker(guarded byauthenticate+authorizeAdmin) provide emergency controls.
When OBSERVABILITY_ENABLED=false, the plugin stays loaded but every decorator falls back to writing to the base logger so instrumentation still works in dev or lightweight deployments.
Environment Variables
All observability settings live in src/config/observability.config.ts. The schema enforces sane defaults but production deployments must explicitly provide Loki and Prometheus credentials.
Core Flags
| Variable | Default | Notes |
|---|---|---|
OBSERVABILITY_ENABLED | false | Master switch; when false, decorators log via standard Fastify only. |
METRICS_ENABLED | true | Enables /metrics + /metrics/memory. |
LOGS_ENABLED | true | Allows event decorators to emit logs; disable for air-gapped testing. |
ENABLE_CORRELATION_ID | true | Controls automatic request/response correlation IDs. |
ENABLE_LOKI_SHIPPING | true | Turns on Loki transport; set false to force local fallback. |
ENABLE_LOCAL_FALLBACK | true | Keeps file-based logging ready when Loki is unreachable. |
External Targets
| Variable | Purpose |
|---|---|
LOKI_ENDPOINT, LOKI_USERNAME, LOKI_PASSWORD | Required when observability is enabled in production. |
PROMETHEUS_ENDPOINT, PROMETHEUS_USERNAME, PROMETHEUS_PASSWORD | Required for authenticated scrapes in production. |
Performance & Reliability
| Variable | Default | Description |
|---|---|---|
LOKI_BATCH_SIZE / LOKI_BATCH_INTERVAL | 1000 / 5000 | Buffer size & dispatch interval (ms). |
LOKI_MIN_BATCH_SIZE / LOKI_MAX_BATCH_SIZE | 100 / 5000 | Adaptive batching bounds. |
LOKI_RETRY_ATTEMPTS, LOKI_RETRY_DELAY, LOKI_RETRY_MAX_DELAY | 3, 1000, 30000 | Retry policy for Loki writes. |
CIRCUIT_BREAKER_ERROR_THRESHOLD, CIRCUIT_BREAKER_MIN_REQUESTS, CIRCUIT_BREAKER_TIMEOUT | 50, 20, 60000 | When exceeded, Loki transport opens the breaker and log writes fall back locally. |
LOG_BUFFER_SIZE, LOG_BUFFER_MEMORY_LIMIT, LOG_BUFFER_HIGH_WATER_MARK, LOG_BUFFER_LOW_WATER_MARK | 100000, 50MB, 80, 60 | Protects memory usage when bursts occur. |
Data Handling & Compliance
| Variable | Default | Description |
|---|---|---|
ENABLE_DATA_SANITIZATION | true | Enables regex redaction of sensitive fields. |
PII_DETECTION_ENABLED | true | Turns on PII detection heuristics. |
SENSITIVE_FIELD_PATTERNS | email,password,token,secret,key,authorization | Comma-separated keywords to redact. |
EVENT_EMISSION_MODE | infrastructure_only | Valid values: none, infrastructure_only, all, exceptions_only, custom. |
EMIT_SECURITY_EVENTS, EMIT_BUSINESS_EVENTS, EMIT_OPERATIONAL_ERRORS | true | Fine-grained toggles for event types. |
ENABLE_EVENT_RATE_LIMITING, MAX_EVENTS_PER_MINUTE | true, 500 | Guardrails for high-volume installations. |
Service Identity & Fallback
| Variable | Default | Description |
|---|---|---|
SERVICE_NAME, SERVICE_VERSION | openlift-service, 1.0.0 | Included on every log/event. |
DEPLOYMENT_ENVIRONMENT, CLUSTER_NAME, INSTANCE_ID | development, local-dev, generated | Use to distinguish multi-cluster installs. |
FALLBACK_LOG_DIRECTORY, FALLBACK_MAX_FILE_SIZE, FALLBACK_MAX_FILES | ./logs, 100MB, 5 | Controls on-disk log rollover when Loki is down. |
📌 Production rule: When
OBSERVABILITY_ENABLED=trueandNODE_ENV=production, both Loki and Prometheus credentials must be present or the service will fail fast on boot.
Configuration Profiles
configProfiles exports presets for development, staging, and production to keep batch sizes and sanitization aligned with each environment. You can merge them into your process env or .env file; e.g. in dev, smaller batches (LOKI_BATCH_SIZE=100) and ENABLE_EVENT_RATE_LIMITING=false make debugging easier.
Metrics & Admin Endpoints
| Endpoint | Method | Auth? | Description |
|---|---|---|---|
/metrics | GET | optional | Prometheus scrape (requires METRICS_ENABLED=true). |
/metrics/memory | GET | optional | Heap + buffer diagnostics pulled from LoggingService. |
/admin/logging/flush | POST | yes (authenticate + authorizeAdmin) | Forces the Loki transport to flush buffered logs. |
/admin/logging/reset-circuit-breaker | POST | yes | Manually closes the circuit breaker after Loki recovers. |
If you do not register fastify.authenticate/fastify.authorizeAdmin, the admin endpoints will not mount—watch for the warning Observability admin endpoints not registered....
Integration Checklist
- Set the core environment variables and secrets for your deployment tier.
- Ensure
LoggingService(Redis-backed) is available to the DI container. If not, the plugin instantiates a local fallback but you lose centralized shipping. - Register your own event listeners via
eventEmitter(observability.business_event,observability.security_event, etc.) if you want in-app reactions. - Wire Grafana/Prometheus to Loki using the credentials you configured above.
- Verify correlation IDs propagate by hitting any API and checking response headers for
x-correlation-id.
Troubleshooting
- Loki unavailable
- Metrics missing
- Correlation IDs
- Check
/admin/logging/reset-circuit-breaker. - Confirm
ENABLE_LOCAL_FALLBACK=trueso logs continue to write to disk. - Inspect
logs/for compressed fallback files and ship them manually if needed.
- Verify
METRICS_ENABLED=trueandPROMETHEUS_*credentials if scraping remotely. - Ensure no reverse proxy is stripping the
/metricspath.
- Confirm
ENABLE_CORRELATION_ID=true. - Check incoming requests are not already sending malformed
x-correlation-id; Fastify will reuse what the client provides.
Related Files
src/plugins/observability.plugin.ts– main Fastify plugin implementing hooks, decorators, and admin routes.src/config/observability.config.ts– Zod schema + environment profiles.src/services/logging/logging.service.ts– Loki/Redis transport implementation referenced by the plugin.
Stay disciplined with these controls and OpenLift will emit consistent observability data no matter where you deploy it.