6/8 - Public Beta (Discord)
|See Changelog
Skip to main content

Observability & Logging

Overview

OpenLift ships an enterprise-grade observability plugin (src/plugins/observability.plugin.ts) that wires Fastify into the logging, metrics, and event pipeline. The plugin:

  • Injects correlation IDs on every request and propagates them to responses, logs, and emitted events.
  • Decorates Fastify with helpers such as logBusinessEvent, logSecurityEvent, logAuditEvent, logPerformanceEvent, and logSystemEvent.
  • Streams logs to the LoggingService (backed by Loki + Redis) with automatic batching, retries, adaptive buffering, and circuit breakers.
  • Falls back to standard Fastify logging when observability is disabled, ensuring you always have audit trails.
  • Exposes /metrics endpoints (including /metrics/memory) plus guarded admin routes for forcing flushes or resetting the Loki circuit breaker.

TL;DR: Enable the plugin, point it at Loki/Prometheus, and the rest of the stack (Progression Playbooks, Measurements, Workout History) automatically emits business/security/performance events with the same schema.

Runtime Flow

  1. Correlation hook – attaches/propagates x-correlation-id per request when ENABLE_CORRELATION_ID=true.
  2. Decorators – services call fastify.logBusinessEvent(...) etc. to write to Loki and emit internal events.
  3. Request loggingonRequest/onResponse hooks record HTTP traces unless paths are excluded.
  4. Metrics endpointsGET /metrics (Prometheus scrape) and GET /metrics/memory (heap diagnostics) expose live stats when metrics are on.
  5. Admin endpointsPOST /admin/logging/flush and POST /admin/logging/reset-circuit-breaker (guarded by authenticate + authorizeAdmin) provide emergency controls.

When OBSERVABILITY_ENABLED=false, the plugin stays loaded but every decorator falls back to writing to the base logger so instrumentation still works in dev or lightweight deployments.

Environment Variables

All observability settings live in src/config/observability.config.ts. The schema enforces sane defaults but production deployments must explicitly provide Loki and Prometheus credentials.

Core Flags

VariableDefaultNotes
OBSERVABILITY_ENABLEDfalseMaster switch; when false, decorators log via standard Fastify only.
METRICS_ENABLEDtrueEnables /metrics + /metrics/memory.
LOGS_ENABLEDtrueAllows event decorators to emit logs; disable for air-gapped testing.
ENABLE_CORRELATION_IDtrueControls automatic request/response correlation IDs.
ENABLE_LOKI_SHIPPINGtrueTurns on Loki transport; set false to force local fallback.
ENABLE_LOCAL_FALLBACKtrueKeeps file-based logging ready when Loki is unreachable.

External Targets

VariablePurpose
LOKI_ENDPOINT, LOKI_USERNAME, LOKI_PASSWORDRequired when observability is enabled in production.
PROMETHEUS_ENDPOINT, PROMETHEUS_USERNAME, PROMETHEUS_PASSWORDRequired for authenticated scrapes in production.

Performance & Reliability

VariableDefaultDescription
LOKI_BATCH_SIZE / LOKI_BATCH_INTERVAL1000 / 5000Buffer size & dispatch interval (ms).
LOKI_MIN_BATCH_SIZE / LOKI_MAX_BATCH_SIZE100 / 5000Adaptive batching bounds.
LOKI_RETRY_ATTEMPTS, LOKI_RETRY_DELAY, LOKI_RETRY_MAX_DELAY3, 1000, 30000Retry policy for Loki writes.
CIRCUIT_BREAKER_ERROR_THRESHOLD, CIRCUIT_BREAKER_MIN_REQUESTS, CIRCUIT_BREAKER_TIMEOUT50, 20, 60000When exceeded, Loki transport opens the breaker and log writes fall back locally.
LOG_BUFFER_SIZE, LOG_BUFFER_MEMORY_LIMIT, LOG_BUFFER_HIGH_WATER_MARK, LOG_BUFFER_LOW_WATER_MARK100000, 50MB, 80, 60Protects memory usage when bursts occur.

Data Handling & Compliance

VariableDefaultDescription
ENABLE_DATA_SANITIZATIONtrueEnables regex redaction of sensitive fields.
PII_DETECTION_ENABLEDtrueTurns on PII detection heuristics.
SENSITIVE_FIELD_PATTERNSemail,password,token,secret,key,authorizationComma-separated keywords to redact.
EVENT_EMISSION_MODEinfrastructure_onlyValid values: none, infrastructure_only, all, exceptions_only, custom.
EMIT_SECURITY_EVENTS, EMIT_BUSINESS_EVENTS, EMIT_OPERATIONAL_ERRORStrueFine-grained toggles for event types.
ENABLE_EVENT_RATE_LIMITING, MAX_EVENTS_PER_MINUTEtrue, 500Guardrails for high-volume installations.

Service Identity & Fallback

VariableDefaultDescription
SERVICE_NAME, SERVICE_VERSIONopenlift-service, 1.0.0Included on every log/event.
DEPLOYMENT_ENVIRONMENT, CLUSTER_NAME, INSTANCE_IDdevelopment, local-dev, generatedUse to distinguish multi-cluster installs.
FALLBACK_LOG_DIRECTORY, FALLBACK_MAX_FILE_SIZE, FALLBACK_MAX_FILES./logs, 100MB, 5Controls on-disk log rollover when Loki is down.

📌 Production rule: When OBSERVABILITY_ENABLED=true and NODE_ENV=production, both Loki and Prometheus credentials must be present or the service will fail fast on boot.

Configuration Profiles

configProfiles exports presets for development, staging, and production to keep batch sizes and sanitization aligned with each environment. You can merge them into your process env or .env file; e.g. in dev, smaller batches (LOKI_BATCH_SIZE=100) and ENABLE_EVENT_RATE_LIMITING=false make debugging easier.

Metrics & Admin Endpoints

EndpointMethodAuth?Description
/metricsGEToptionalPrometheus scrape (requires METRICS_ENABLED=true).
/metrics/memoryGEToptionalHeap + buffer diagnostics pulled from LoggingService.
/admin/logging/flushPOSTyes (authenticate + authorizeAdmin)Forces the Loki transport to flush buffered logs.
/admin/logging/reset-circuit-breakerPOSTyesManually closes the circuit breaker after Loki recovers.

If you do not register fastify.authenticate/fastify.authorizeAdmin, the admin endpoints will not mount—watch for the warning Observability admin endpoints not registered....

Integration Checklist

  1. Set the core environment variables and secrets for your deployment tier.
  2. Ensure LoggingService (Redis-backed) is available to the DI container. If not, the plugin instantiates a local fallback but you lose centralized shipping.
  3. Register your own event listeners via eventEmitter (observability.business_event, observability.security_event, etc.) if you want in-app reactions.
  4. Wire Grafana/Prometheus to Loki using the credentials you configured above.
  5. Verify correlation IDs propagate by hitting any API and checking response headers for x-correlation-id.

Troubleshooting

  • Check /admin/logging/reset-circuit-breaker.
  • Confirm ENABLE_LOCAL_FALLBACK=true so logs continue to write to disk.
  • Inspect logs/ for compressed fallback files and ship them manually if needed.
  • src/plugins/observability.plugin.ts – main Fastify plugin implementing hooks, decorators, and admin routes.
  • src/config/observability.config.ts – Zod schema + environment profiles.
  • src/services/logging/logging.service.ts – Loki/Redis transport implementation referenced by the plugin.

Stay disciplined with these controls and OpenLift will emit consistent observability data no matter where you deploy it.