OpenTelemetry via instrumentation.ts, instrumentation-client.js for browser SDK boot, custom spans across server/edge/client, Sentry integration, useReportWebVitals for CWV shipping, and four production runbooks: TTFB regression, cache miss storm, hydration error, memory leak during rolling deploy.
A-15 — Production Observability and Runbooks
Who this is for: Architects responsible for a Next.js application in production — the ones who get paged at 3am. This module is about building the observability infrastructure that turns "the site is down" into "the database connection pool exhausted at 03:14 UTC, triggered by a deployment that removed the connection limit from the Prisma config, here's the fix." That level of precision comes from traces, metrics, logs, and runbooks built before the incident, not during it.
The Three Pillars of Observability
Observability is the ability to understand a system's internal state from its external outputs. The three pillars:
Traces — the journey of a single request through your system. A trace for a product page request shows: Middleware execution (2ms), Server Component render (8ms), database query for product (45ms), database query for reviews (120ms), response sent. Traces answer "why was this specific request slow?"
Metrics — aggregated measurements over time. Request rate, error rate, p50/p95/p99 response times, cache hit rate, database connection pool size. Metrics answer "is the system healthy overall, and are things getting worse?"
Logs — discrete events. "User 123 purchased product 456." "Cache miss for key products:page:1." "Database query failed: connection timeout." Logs answer "what happened?"
None of these is sufficient alone. A slow request shows up in metrics (rising p99), is diagnosed via traces (database query taking 2s), and confirmed by logs (connection pool exhausted). The triad is the diagnostic workflow.
OpenTelemetry in Next.js
OpenTelemetry (OTel) is the industry standard for trace and metric instrumentation. Next.js 13+ has built-in OTel support.
ts// instrumentation.ts export async function register() { if (process.env.NEXT_RUNTIME === 'nodejs') { const { NodeSDK } = await import('@opentelemetry/sdk-node'); const { OTLPTraceExporter } = await import('@opentelemetry/exporter-trace-otlp-http'); const { getNodeAutoInstrumentations } = await import('@opentelemetry/auto-instrumentations-node'); const { Resource } = await import('@opentelemetry/resources'); const { SemanticResourceAttributes } = await import('@opentelemetry/semantic-conventions'); const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'my-nextjs-app', [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version, environment: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, headers: { Authorization: `Bearer ${process.env.OTEL_AUTH_TOKEN}`, }, }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy }), ], }); sdk.start(); } }
getNodeAutoInstrumentations automatically instruments:
- HTTP requests (both incoming and outgoing)
- Prisma queries (via
@prisma/instrumentation) - DNS lookups
setTimeout/setIntervalfor tracking async work
The result: every request automatically generates a trace showing all the work it triggered. No manual span creation required for the common cases.
Custom Spans for Business Logic
The auto-instrumentation covers infrastructure — database, HTTP. For business logic, add custom spans:
ts// lib/products.ts import { trace } from '@opentelemetry/api'; const tracer = trace.getTracer('products-service'); export async function getProductWithRecommendations(productId: string) { return tracer.startActiveSpan('get-product-with-recommendations', async (span) => { try { span.setAttributes({ 'product.id': productId, 'cache.strategy': 'use-cache', }); const [product, recommendations] = await Promise.all([ getProduct(productId), getRecommendations(productId), ]); span.setAttributes({ 'product.found': !!product, 'recommendations.count': recommendations.length, }); return { product, recommendations }; } catch (error) { span.recordException(error as Error); span.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { span.end(); } }); }
Custom spans appear nested inside the auto-instrumented HTTP span in your trace UI. You can see exactly where within a request the business logic executed, how long it took, and whether it threw.
Sentry for Error Tracking
OpenTelemetry traces tell you about slow requests. Sentry tells you about broken requests — the uncaught exceptions, the unhandled rejections, the React hydration mismatches.
bashnpx @sentry/wizard@latest -i nextjs
The wizard configures Sentry automatically. What it sets up:
instrumentation.tswith Sentry SDK initialisation- Error boundary integration for React
- Next.js specific configuration in
next.config.ts - Source map upload for production
ts// instrumentation.ts (generated by wizard) export async function register() { if (process.env.NEXT_RUNTIME === 'nodejs') { await import('../sentry.server.config'); } if (process.env.NEXT_RUNTIME === 'edge') { await import('../sentry.edge.config'); } }
tsx// app/global-error.tsx (generated by wizard) 'use client'; import * as Sentry from '@sentry/nextjs'; import NextError from 'next/error'; import { useEffect } from 'react'; export default function GlobalError({ error }: { error: Error }) { useEffect(() => { Sentry.captureException(error); }, [error]); return ( <html> <body> <NextError statusCode={0} /> </body> </html> ); }
The global-error.tsx catches errors that bubble past all error.tsx boundaries — the last resort error handler. Without it, uncaught root-level errors show a blank page.
Metrics and Alerting
The metrics that matter for a Next.js application:
Infrastructure metrics (from your host):
CPU utilisation
Memory usage
Network I/O
Disk I/O (for self-hosted)
Application metrics (from your monitoring service):
Request rate (requests/second)
Error rate (% of requests returning 5xx)
Response time p50 / p95 / p99
Cache hit rate (Full Route Cache, Data Cache)
Cold start rate (serverless)
Business metrics:
Successful checkouts/minute
User registrations/hour
Active connections (WebSocket, if applicable)
The alerting philosophy: alert on symptoms, not causes. "Error rate > 1%" is a symptom — it tells you users are experiencing failures. "Database CPU > 80%" is a cause — useful for investigation but not an emergency on its own. Symptom-based alerting reduces alert fatigue.
A minimal alert set for most applications:
- Error rate (5xx) > 1% for 5 minutes → P1 incident
- p99 response time > 5s for 10 minutes → P2 incident
- Health check endpoint returning non-200 → P1 incident
- Error rate > 0.1% for 30 minutes → P3 (monitor, not wake someone up)
The Runbook Template
A runbook is a document that answers: "what do I do when alert X fires?" Writing runbooks before incidents means the on-call engineer isn't making decisions under pressure for the first time.
markdown# Runbook: High Error Rate (5xx) ## Alert **Trigger:** Error rate > 1% for 5 minutes **Severity:** P1 **On-call response time:** 15 minutes ## Immediate Triage 1. Check the Sentry dashboard for the most common errors in the last 30 minutes - Link: https://sentry.io/organizations/your-org/issues/ - Look for: new error types, spike in existing errors 2. Check the deployment history - Link: https://vercel.com/your-project/deployments - Was there a deploy in the last 30 minutes? → likely deployment regression 3. Check the database connection pool - Link: https://your-monitoring/databases - Active connections > 95%? → see "Connection Pool Exhaustion" runbook ## Common Causes and Fixes ### Deployment Regression - Rollback: `vercel rollback [previous-deployment-url]` - Takes ~2 minutes to take effect - Root cause analysis: compare the new deploy's diff ### Database Connection Pool Exhaustion - Immediate: scale down non-critical background jobs to free connections - Check for missing `await` on Prisma queries (creates abandoned connections) - Increase `connection_limit` in `DATABASE_URL` if headroom exists ### Third-Party API Failure - Check status page for dependent services (Stripe, SendGrid, etc.) - Enable circuit breaker if available - Graceful degradation: return cached data if possible ## Escalation - 15min without progress → escalate to database team - 30min without progress → escalate to engineering lead
The format matters less than the content. What every runbook needs: the alert trigger, immediate triage steps, common causes with specific fixes, and escalation paths.
Structured Logging
Structured logs (JSON format) are parseable by log aggregation services (Datadog, Grafana Loki, CloudWatch Logs Insights). They're queryable: "show me all logs where userId is 123 and level is error."
ts// lib/logger.ts import pino from 'pino'; export const logger = pino({ level: process.env.LOG_LEVEL ?? 'info', transport: process.env.NODE_ENV === 'development' ? { target: 'pino-pretty' } // human-readable in dev : undefined, // JSON in production base: { service: 'nextjs-app', version: process.env.npm_package_version, environment: process.env.NODE_ENV, }, });
ts// In a Server Action import { logger } from '@/lib/logger'; import { trace } from '@opentelemetry/api'; export async function createOrder(formData: FormData) { const span = trace.getActiveSpan(); const traceId = span?.spanContext().traceId; const log = logger.child({ traceId, // correlate logs to traces userId: session.user.id, action: 'create-order', }); log.info({ productId: formData.get('productId') }, 'Creating order'); try { const order = await db.orders.create({ ... }); log.info({ orderId: order.id }, 'Order created successfully'); return order; } catch (error) { log.error({ error }, 'Order creation failed'); throw error; } }
The traceId in every log entry is the correlation key — in your observability platform, you can click from a log entry to the trace that generated it. This is the debugging superpower: see the error in Sentry, look up the trace in your tracing service, correlate the logs via traceId, understand exactly what happened.
Congratulations — You've Completed the Course
You've made it through all three phases:
Foundation — the mental model. Routing, rendering, data fetching, the core components. The vocabulary for everything that followed.
Practitioner — the production toolkit. Authentication, database integration, caching, middleware, SEO, testing, deployment. Everything you need to ship a real application.
Architect — the internals. React Flight, request lifecycle, caching mechanics, PPR, streaming SSR, Server Action security, advanced routing, state management, edge compute, build systems, performance engineering, security, infrastructure, observability. The layer that separates engineers who understand Next.js from engineers who just use it.
The last piece of advice: read the Next.js changelog with each release. The framework moves fast. The architectural understanding from this course is what lets you evaluate each new feature and know whether it changes anything for your specific application. The mechanics change; the mental model for reasoning about them doesn't.