Cloud & DevOps

Next.js Observability with OpenTelemetry in Production

An Tran

•

May 9, 2026

•

11 min read

•

Next.js Observability with OpenTelemetry in Production

An Tran

Engineering Lead

A practical setup for tracing API latency, server actions, and upstream dependencies in Next.js applications using OpenTelemetry.

Why teams miss production incidents

In a traditional Next.js application, monitoring is often limited to server console logs and generic frontend error boundaries. When a customer reports that a checkout page hangs, developers are forced to search through unstructured cloud logs. Without distributed tracing, it is impossible to see the causal path from a user click, through Next.js Server Actions, to an internal API, and down to a slow database query or third-party payment gateway.

OpenTelemetry (OTel) provides a vendor-neutral standard to instrument your application. By integrating OTel into your Next.js application, you capture trace telemetry across the entire request lifecycle. This post outlines the minimal setup required to gain clear visibility into your Next.js application running in production.

Minimal instrumentation setup

First, enable the experimental instrumentation feature in your `next.config.ts` or `next.config.js`:

ts

const nextConfig = {
  experimental: {
    instrumentationHook: true,
  },
};
export default nextConfig;

Next, create a file named `instrumentation.ts` in the root of your application (or inside the `src` directory if you use one). This code configures the OpenTelemetry Node SDK to intercept outgoing network requests, database connections, and incoming HTTP requests, sending traces to your collector:

ts

import { NodeSDK } from "@opentelemetry/sdk-node"
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node"
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http"

export function register() {
  if (process.env.NEXT_RUNTIME === "nodejs") {
    const sdk = new NodeSDK({
      traceExporter: new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
      }),
      instrumentations: [getNodeAutoInstrumentations()],
    });
    sdk.start();
  }
}

Alerting thresholds we recommend

Observability is only as good as the alerts that prompt your team to act. We recommend monitoring three core signals on your observability platform (e.g., Grafana, Honeycomb, or Datadog):

p95 API latency: Trigger a warning if the p95 latency of key API endpoints exceeds 1200ms over a rolling 5-minute window.
Checkout failure rate: Trigger an critical incident if the error rate (5xx or rejected promises) on checkout or payment endpoints exceeds 1.5% in any 5-minute window.
Database connection pool exhaustion: Alert if the database connection pool utilization remains above 75% for more than 5 minutes.

Incident triage flow

When an alert triggers, the SRE team should follow a structured triage sequence to isolate the offending layer:

Diagram (Mermaid)

Our take

Do not build complex, decorative dashboards that try to track every metric under the sun. High-quality observability is about quickly resolving critical incidents. Focus your initial instrumentation on business-critical user journeys — specifically checkout, search, and authentication. Once those paths are traced, expand coverage to background workers and caching layers. Start simple, prioritize trace context propagation, and keep alerting thresholds realistic.

Let's Build Something Great Together

Schedule a free consultation to discuss your project and explore how we can help.

Book a consultation View our services