Distributed Tracing
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
Purpose
Track requests across distributed systems to understand latency, dependencies, and failure points.
When to Use
- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths
Distributed Tracing Concepts
Trace Structure
Trace (Request ID: abc123)
β
Span (frontend) [100ms]
β
Span (api-gateway) [80ms]
ββ Span (auth-service) [10ms]
ββ Span (user-service) [60ms]
ββ Span (database) [40ms]
Key Components
- Trace - End-to-end request journey
- Span - Single operation within a trace
- Context - Metadata propagated between services
- Tags - Key-value pairs for filtering
- Logs - Timestamped events within a span
Jaeger Setup
Kubernetes Deployment
# Deploy Jaeger Operator kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability # Deploy Jaeger instance kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
Docker Compose
version: "3.8" services: jaeger: image: jaegertracing/all-in-one:latest ports: - "5775:5775/udp" - "6831:6831/udp" - "6832:6832/udp" - "5778:5778" - "16686:16686" # UI - "14268:14268" # Collector - "14250:14250" # gRPC - "9411:9411" # Zipkin environment: - COLLECTOR_ZIPKIN_HOST_PORT=:9411
Reference: See references/jaeger-setup.md
Application Instrumentation
OpenTelemetry (Recommended)
Python (Flask)
from opentelemetry import trace from opentelemetry.exporter.jaeger.thrift import JaegerExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.instrumentation.flask import FlaskInstrumentor from flask import Flask # Initialize tracer resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider) # Instrument Flask app = Flask(__name__) FlaskInstrumentor().instrument_app(app) @app.route('/api/users') def get_users(): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("get_users") as span: span.set_attribute("user.count", 100) # Business logic users = fetch_users_from_db() return {"users": users} def fetch_users_from_db(): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("database_query") as span: span.set_attribute("db.system", "postgresql") span.set_attribute("db.statement", "SELECT * FROM users") # Database query return query_database()
Node.js (Express)
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node"); const { JaegerExporter } = require("@opentelemetry/exporter-jaeger"); const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base"); const { registerInstrumentations } = require("@opentelemetry/instrumentation"); const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http"); const { ExpressInstrumentation, } = require("@opentelemetry/instrumentation-express"); // Initialize tracer const provider = new NodeTracerProvider({ resource: { attributes: { "service.name": "my-service" } }, }); const exporter = new JaegerExporter({ endpoint: "http://jaeger:14268/api/traces", }); provider.addSpanProcessor(new BatchSpanProcessor(exporter)); provider.register(); // Instrument libraries registerInstrumentations({ instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()], }); const express = require("express"); const app = express(); app.get("/api/users", async (req, res) => { const tracer = trace.getTracer("my-service"); const span = tracer.startSpan("get_users"); try { const users = await fetchUsers(); span.setAttributes({ "user.count": users.length }); res.json({ users }); } finally { span.end(); } });
Go
package main import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/jaeger" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.4.0" ) func initTracer() (*sdktrace.TracerProvider, error) { exporter, err := jaeger.New(jaeger.WithCollectorEndpoint( jaeger.WithEndpoint("http://jaeger:14268/api/traces"), )) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("my-service"), )), ) otel.SetTracerProvider(tp) return tp, nil } func getUsers(ctx context.Context) ([]User, error) { tracer := otel.Tracer("my-service") ctx, span := tracer.Start(ctx, "get_users") defer span.End() span.SetAttributes(attribute.String("user.filter", "active")) users, err := fetchUsersFromDB(ctx) if err != nil { span.RecordError(err) return nil, err } span.SetAttributes(attribute.Int("user.count", len(users))) return users, nil }
Reference: See references/instrumentation.md
Context Propagation
HTTP Headers
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
Propagation in HTTP Requests
Python
from opentelemetry.propagate import inject headers = {} inject(headers) # Injects trace context response = requests.get('http://downstream-service/api', headers=headers)
Node.js
const { propagation } = require("@opentelemetry/api"); const headers = {}; propagation.inject(context.active(), headers); axios.get("http://downstream-service/api", { headers });
Tempo Setup (Grafana)
Kubernetes Deployment
apiVersion: v1 kind: ConfigMap metadata: name: tempo-config data: tempo.yaml: | server: http_listen_port: 3200 distributor: receivers: jaeger: protocols: thrift_http: grpc: otlp: protocols: http: grpc: storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com querier: frontend_worker: frontend_address: tempo-query-frontend:9095 --- apiVersion: apps/v1 kind: Deployment metadata: name: tempo spec: replicas: 1 template: spec: containers: - name: tempo image: grafana/tempo:latest args: - -config.file=/etc/tempo/tempo.yaml volumeMounts: - name: config mountPath: /etc/tempo volumes: - name: config configMap: name: tempo-config
Reference: See assets/jaeger-config.yaml.template
Sampling Strategies
Probabilistic Sampling
# Sample 1% of traces sampler: type: probabilistic param: 0.01
Rate Limiting Sampling
# Sample max 100 traces per second sampler: type: ratelimiting param: 100
Adaptive Sampling
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased # Sample based on trace ID (deterministic) sampler = ParentBased(root=TraceIdRatioBased(0.01))
Trace Analysis
Finding Slow Requests
Jaeger Query:
service=my-service
duration > 1s
Finding Errors
Jaeger Query:
service=my-service
error=true
tags.http.status_code >= 500
Service Dependency Graph
Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies
Best Practices
- Sample appropriately (1-10% in production)
- Add meaningful tags (user_id, request_id)
- Propagate context across all service boundaries
- Log exceptions in spans
- Use consistent naming for operations
- Monitor tracing overhead (<1% CPU impact)
- Set up alerts for trace errors
- Implement distributed context (baggage)
- Use span events for important milestones
- Document instrumentation standards
Integration with Logging
Correlated Logs
import logging from opentelemetry import trace logger = logging.getLogger(__name__) def process_request(): span = trace.get_current_span() trace_id = span.get_span_context().trace_id logger.info( "Processing request", extra={"trace_id": format(trace_id, '032x')} )
Troubleshooting
No traces appearing:
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs
High latency overhead:
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration
Reference Files
references/jaeger-setup.md- Jaeger installationreferences/instrumentation.md- Instrumentation patternsassets/jaeger-config.yaml.template- Jaeger configuration
Related Skills
prometheus-configuration- For metricsgrafana-dashboards- For visualizationslo-implementation- For latency SLOs