Module A-15·28 min read

OpenTelemetry via instrumentation.ts, instrumentation-client.js for browser SDK boot, custom spans across server/edge/client, Sentry integration, useReportWebVitals for CWV shipping, and four production runbooks: TTFB regression, cache miss storm, hydration error, memory leak during rolling deploy.

A-15 — Production Observability and Runbooks

Who this is for: Architects responsible for a Next.js application in production — the ones who get paged at 3am. This module is about building the observability infrastructure that turns "the site is down" into "the database connection pool exhausted at 03:14 UTC, triggered by a deployment that removed the connection limit from the Prisma config, here's the fix." That level of precision comes from traces, metrics, logs, and runbooks built before the incident, not during it.

The Three Pillars of Observability

Observability is the ability to understand a system's internal state from its external outputs. The three pillars:

Traces — the journey of a single request through your system. A trace for a product page request shows: Middleware execution (2ms), Server Component render (8ms), database query for product (45ms), database query for reviews (120ms), response sent. Traces answer "why was this specific request slow?"

Metrics — aggregated measurements over time. Request rate, error rate, p50/p95/p99 response times, cache hit rate, database connection pool size. Metrics answer "is the system healthy overall, and are things getting worse?"

Logs — discrete events. "User 123 purchased product 456." "Cache miss for key products:page:1." "Database query failed: connection timeout." Logs answer "what happened?"

None of these is sufficient alone. A slow request shows up in metrics (rising p99), is diagnosed via traces (database query taking 2s), and confirmed by logs (connection pool exhausted). The triad is the diagnostic workflow.

OpenTelemetry in Next.js

OpenTelemetry (OTel) is the industry standard for trace and metric instrumentation. Next.js 13+ has built-in OTel support.

ts
// instrumentation.ts
export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    const { NodeSDK } = await import('@opentelemetry/sdk-node');
    const { OTLPTraceExporter } = await import('@opentelemetry/exporter-trace-otlp-http');
    const { getNodeAutoInstrumentations } = await import('@opentelemetry/auto-instrumentations-node');
    const { Resource } = await import('@opentelemetry/resources');
    const { SemanticResourceAttributes } = await import('@opentelemetry/semantic-conventions');

    const sdk = new NodeSDK({
      resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: 'my-nextjs-app',
        [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version,
        environment: process.env.NODE_ENV,
      }),
      traceExporter: new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
        headers: {
          Authorization: `Bearer ${process.env.OTEL_AUTH_TOKEN}`,
        },
      }),
      instrumentations: [
        getNodeAutoInstrumentations({
          '@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
        }),
      ],
    });

    sdk.start();
  }
}

getNodeAutoInstrumentations automatically instruments:

HTTP requests (both incoming and outgoing)
Prisma queries (via @prisma/instrumentation)
DNS lookups
setTimeout / setInterval for tracking async work

The result: every request automatically generates a trace showing all the work it triggered. No manual span creation required for the common cases.

Custom Spans for Business Logic

The auto-instrumentation covers infrastructure — database, HTTP. For business logic, add custom spans:

ts
// lib/products.ts
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('products-service');

export async function getProductWithRecommendations(productId: string) {
  return tracer.startActiveSpan('get-product-with-recommendations', async (span) => {
    try {
      span.setAttributes({
        'product.id': productId,
        'cache.strategy': 'use-cache',
      });
      
      const [product, recommendations] = await Promise.all([
        getProduct(productId),
        getRecommendations(productId),
      ]);
      
      span.setAttributes({
        'product.found': !!product,
        'recommendations.count': recommendations.length,
      });
      
      return { product, recommendations };
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

Custom spans appear nested inside the auto-instrumented HTTP span in your trace UI. You can see exactly where within a request the business logic executed, how long it took, and whether it threw.

Sentry for Error Tracking

OpenTelemetry traces tell you about slow requests. Sentry tells you about broken requests — the uncaught exceptions, the unhandled rejections, the React hydration mismatches.

bash
npx @sentry/wizard@latest -i nextjs

The wizard configures Sentry automatically. What it sets up:

instrumentation.ts with Sentry SDK initialisation
Error boundary integration for React
Next.js specific configuration in next.config.ts
Source map upload for production

ts
// instrumentation.ts (generated by wizard)
export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    await import('../sentry.server.config');
  }
  if (process.env.NEXT_RUNTIME === 'edge') {
    await import('../sentry.edge.config');
  }
}

tsx
// app/global-error.tsx (generated by wizard)
'use client';

import * as Sentry from '@sentry/nextjs';
import NextError from 'next/error';
import { useEffect } from 'react';

export default function GlobalError({ error }: { error: Error }) {
  useEffect(() => {
    Sentry.captureException(error);
  }, [error]);
  
  return (
    <html>
      <body>
        <NextError statusCode={0} />
      </body>
    </html>
  );
}

The global-error.tsx catches errors that bubble past all error.tsx boundaries — the last resort error handler. Without it, uncaught root-level errors show a blank page.

Metrics and Alerting

The metrics that matter for a Next.js application:

Infrastructure metrics (from your host):
  CPU utilisation
  Memory usage
  Network I/O
  Disk I/O (for self-hosted)

Application metrics (from your monitoring service):
  Request rate (requests/second)
  Error rate (% of requests returning 5xx)
  Response time p50 / p95 / p99
  Cache hit rate (Full Route Cache, Data Cache)
  Cold start rate (serverless)

Business metrics:
  Successful checkouts/minute
  User registrations/hour
  Active connections (WebSocket, if applicable)

The alerting philosophy: alert on symptoms, not causes. "Error rate > 1%" is a symptom — it tells you users are experiencing failures. "Database CPU > 80%" is a cause — useful for investigation but not an emergency on its own. Symptom-based alerting reduces alert fatigue.

A minimal alert set for most applications:

Error rate (5xx) > 1% for 5 minutes → P1 incident
p99 response time > 5s for 10 minutes → P2 incident
Health check endpoint returning non-200 → P1 incident
Error rate > 0.1% for 30 minutes → P3 (monitor, not wake someone up)

The Runbook Template

A runbook is a document that answers: "what do I do when alert X fires?" Writing runbooks before incidents means the on-call engineer isn't making decisions under pressure for the first time.

markdown
# Runbook: High Error Rate (5xx)

## Alert
**Trigger:** Error rate > 1% for 5 minutes  
**Severity:** P1  
**On-call response time:** 15 minutes

## Immediate Triage

1. Check the Sentry dashboard for the most common errors in the last 30 minutes
   - Link: https://sentry.io/organizations/your-org/issues/
   - Look for: new error types, spike in existing errors

2. Check the deployment history
   - Link: https://vercel.com/your-project/deployments
   - Was there a deploy in the last 30 minutes? → likely deployment regression

3. Check the database connection pool
   - Link: https://your-monitoring/databases
   - Active connections > 95%? → see "Connection Pool Exhaustion" runbook

## Common Causes and Fixes

### Deployment Regression
- Rollback: `vercel rollback [previous-deployment-url]`
- Takes ~2 minutes to take effect
- Root cause analysis: compare the new deploy's diff

### Database Connection Pool Exhaustion
- Immediate: scale down non-critical background jobs to free connections
- Check for missing `await` on Prisma queries (creates abandoned connections)
- Increase `connection_limit` in `DATABASE_URL` if headroom exists

### Third-Party API Failure
- Check status page for dependent services (Stripe, SendGrid, etc.)
- Enable circuit breaker if available
- Graceful degradation: return cached data if possible

## Escalation
- 15min without progress → escalate to database team
- 30min without progress → escalate to engineering lead

The format matters less than the content. What every runbook needs: the alert trigger, immediate triage steps, common causes with specific fixes, and escalation paths.

Structured Logging

Structured logs (JSON format) are parseable by log aggregation services (Datadog, Grafana Loki, CloudWatch Logs Insights). They're queryable: "show me all logs where userId is 123 and level is error."

ts
// lib/logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty' } // human-readable in dev
    : undefined, // JSON in production
  base: {
    service: 'nextjs-app',
    version: process.env.npm_package_version,
    environment: process.env.NODE_ENV,
  },
});

ts
// In a Server Action
import { logger } from '@/lib/logger';
import { trace } from '@opentelemetry/api';

export async function createOrder(formData: FormData) {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;
  
  const log = logger.child({
    traceId,   // correlate logs to traces
    userId: session.user.id,
    action: 'create-order',
  });
  
  log.info({ productId: formData.get('productId') }, 'Creating order');
  
  try {
    const order = await db.orders.create({ ... });
    log.info({ orderId: order.id }, 'Order created successfully');
    return order;
  } catch (error) {
    log.error({ error }, 'Order creation failed');
    throw error;
  }
}

The traceId in every log entry is the correlation key — in your observability platform, you can click from a log entry to the trace that generated it. This is the debugging superpower: see the error in Sentry, look up the trace in your tracing service, correlate the logs via traceId, understand exactly what happened.

Congratulations — You've Completed the Course

You've made it through all three phases:

Foundation — the mental model. Routing, rendering, data fetching, the core components. The vocabulary for everything that followed.

Practitioner — the production toolkit. Authentication, database integration, caching, middleware, SEO, testing, deployment. Everything you need to ship a real application.

Architect — the internals. React Flight, request lifecycle, caching mechanics, PPR, streaming SSR, Server Action security, advanced routing, state management, edge compute, build systems, performance engineering, security, infrastructure, observability. The layer that separates engineers who understand Next.js from engineers who just use it.

The last piece of advice: read the Next.js changelog with each release. The framework moves fast. The architectural understanding from this course is what lets you evaluate each new feature and know whether it changes anything for your specific application. The mechanics change; the mental model for reasoning about them doesn't.

PreviousModule A-14: Self-Hosting vs Serverless, WebSockets, and Long-Lived Connections Next Module A-16: The Router Cache — Client-Side RSC Payload Cache Deep Dive