Build an OpenTelemetry stack for Kubernetes apps

This post records how I built an OpenTelemetry stack for Kubernetes apps.

I started with Docker Compose instead of moving the whole observability backend into Kubernetes immediately. That gave me a smaller blast radius: application pods can export OTLP data to one host, while Prometheus, Loki, Tempo, and Grafana run as a separate backend.

The stack looks like this:

Kubernetes app -> OTLP HTTP/gRPC -> OpenTelemetry Collector -> Prometheus/Loki/Tempo -> Grafana

This post uses the same fake application set as the other examples. The OpenTelemetry snippets focus on example-api, while the same pattern can be repeated for example-worker and example-admin.

example-api: service.name=example-api, service.namespace=example
example-worker: service.name=example-worker, service.namespace=example
example-admin: service.name=example-admin, service.namespace=example

Series

This post is part of my home Kubernetes GitOps series:

Services

The compose stack has five services:

otel-collector: receives OTLP and reads pod logs
prometheus: scrapes collector-exported metrics
loki: stores logs
tempo: stores traces
grafana: browses metrics, logs, and traces

The collector exposes common OTLP ports:

4317: OTLP gRPC
4318: OTLP HTTP
9464: Prometheus scrape endpoint

Docker Compose shape

The collector needs access to Kubernetes pod stdout logs on the host.

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.153.0
    restart: unless-stopped
    user: "0:0"
    command:
      - --config=/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
      - "9464:9464"
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
      - /var/log/pods:/var/log/pods:ro

Running the collector as root is not elegant, but it was the practical way to read pod log files from /var/log/pods in this environment. If the host permissions are different, I would prefer a narrower user/group setup.

Prometheus scrapes the collector:

prometheus:
  image: prom/prometheus:v3.11.3
  command:
    - --config.file=/etc/prometheus/prometheus.yml
    - --storage.tsdb.path=/prometheus
  ports:
    - "9090:9090"

Loki and Tempo store logs and traces:

loki:
  image: grafana/loki:3.7.2
  ports:
    - "3100:3100"

tempo:
  image: grafana/tempo:3.0.0
  command:
    - -target=all
    - -config.file=/etc/tempo.yaml
  ports:
    - "3200:3200"
    - "4319:4317"

For Grafana, I avoid treating default credentials as a real setup. Use a strong admin password or a secret file for anything persistent.

grafana:
  image: grafana/grafana:13.0.1-security-01
  ports:
    - "3000:3000"
  environment:
    GF_SECURITY_ADMIN_USER: admin
    GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:?set a password}
    GF_AUTH_ANONYMOUS_ENABLED: "false"

Collector receivers

The collector receives OTLP data from applications:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

It also reads Kubernetes pod log files:

  file_log/example_api:
    include:
      - /var/log/pods/example-api_example-api-*_*/app/*.log
    start_at: end
    include_file_path: true
    operators:
      - type: container
      - type: regex_parser
        parse_from: attributes["log.file.path"]
        regex: '^/var/log/pods/(?P<k8s_namespace_name>[^_]+)_(?P<k8s_pod_name>[^_]+)_[^/]+/(?P<k8s_container_name>[^/]+)/(?P<k8s_restart_count>\d+)\.log$'

The path pattern is important. A ready Loki does not mean logs exist. If the collector cannot read the host path or the include pattern is wrong, Grafana Explore will still look empty.

Resource attributes

I add deployment/environment attributes in the collector so metrics, logs, and traces can line up.

processors:
  resource:
    attributes:
      - key: deployment.environment.name
        value: production
        action: upsert

For pod logs, I transform parsed file path attributes into Kubernetes resource attributes:

  transform/example_logs:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(resource.attributes["k8s.namespace.name"], log.attributes["k8s_namespace_name"])
          - set(resource.attributes["k8s.pod.name"], log.attributes["k8s_pod_name"])
          - set(resource.attributes["k8s.container.name"], log.attributes["k8s_container_name"])
          - set(resource.attributes["service.namespace"], "example")
          - set(resource.attributes["service.name"], "example-api")
          - set(resource.attributes["deployment.environment.name"], "production")

This makes Loki labels and Grafana queries more useful than raw file names.

Pipelines

The collector has separate pipelines for traces, metrics, and logs.

exporters:
  prometheus:
    endpoint: 0.0.0.0:9464
    resource_to_telemetry_conversion:
      enabled: true
  otlp_grpc/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  otlp_http/loki:
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - resource
        - batch
      exporters:
        - otlp_grpc/tempo
    metrics:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - resource
        - batch
      exporters:
        - prometheus
    logs:
      receivers:
        - otlp
        - file_log/example_api
      processors:
        - memory_limiter
        - transform/example_logs
        - resource
        - batch
      exporters:
        - otlp_http/loki

The Tempo and Loki endpoints are inside the Docker network, so plain internal service names are enough for this compose stack.

App environment

From inside a Kubernetes pod, localhost means the pod itself. The OTLP endpoint must be a host reachable from the cluster.

For OTLP HTTP:

OTEL_SERVICE_NAME=example-api
OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=production,service.namespace=example
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel.example.internal:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_LOGS_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp

For OTLP gRPC:

OTEL_SERVICE_NAME=example-api
OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=production,service.namespace=example
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel.example.internal:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_LOGS_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp

If the app receives its .env from Vault and External Secrets, I add these values to Vault instead of committing them into Git.

The examples above use plain HTTP OTLP inside an internal network. If telemetry crosses an untrusted network, I would put TLS in front of the collector or use an OTLP endpoint that supports TLS directly.

Start and check

Start the stack:

cd opentelemetry
docker compose up -d

Check service readiness:

docker compose ps
curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:3200/ready

Then verify data, not only service health.

For Prometheus, I start with up.

For Loki, I use {service_name=~".+"} to confirm that any service-labelled log stream exists.

For Tempo, I search by service.name = example-api.

If Loki is ready but {service_name=~".+"} returns nothing, I do not blame the Grafana UI first. I check whether the collector is reading pod logs and whether the app is exporting OTLP logs.

Common problems

localhost from a pod points to the pod, not the Docker host. Use a reachable host name for the OTLP endpoint.

Loki can be healthy and still have no streams. Check labels or a broad matcher before assuming Grafana is broken.

Tempo config can change between major versions. If Tempo crash-loops after an upgrade, check the config shape before debugging Docker networking.

High-cardinality labels make dashboards noisy. Normalize route labels in the app or collector before they become Prometheus series.

Conclusion

This compose stack is a good middle step. The Kubernetes app gets real metrics, logs, and traces, but the observability backend stays outside the cluster while I iterate.

The important validation lesson is simple: a green backend is not the same thing as ingested data. I check Prometheus, Loki, and Tempo directly before declaring the pipeline healthy.