Module P-17·26 min read

ReadableStream and TransformStream in Route Handlers, the Vercel AI SDK (streamText, useChat, useCompletion), token-by-token streaming to the browser, abort signal propagation for cancelled requests, rate limiting streaming endpoints, streaming error handling constraints, and cost control via token budgets.

P-17 — AI Integration and Streaming Route Handlers

Who this is for: Engineers building AI-powered features in Next.js who need more than a copy-paste tutorial. This module covers the full production picture: why streaming exists, how it works at the HTTP layer, abort signal propagation, token budgeting, error handling mid-stream, and rate limiting. AI streaming is the #1 use case driving Next.js adoption right now and the entire preceding curriculum had zero coverage of it. That changes here.

Why Streaming Matters for AI

Standard JSON responses require the entire LLM response to be generated before sending byte 1 to the client. A 500-token response takes 5–15 seconds depending on the model. That entire time the user stares at a spinner. The spinner is not a UX problem. It's an architectural choice you made by not streaming.

Streaming sends tokens as they are generated. The model starts producing output immediately. Users see the first word in roughly 300ms — the time for a round-trip plus the time for the model to generate the first token, not the time for the full response. The difference is between a product that feels alive and one that feels broken.

The browser mechanism is ReadableStream over HTTP. In HTTP/1.1, this is Transfer-Encoding: chunked — the server sends the response body in variable-length chunks without declaring a Content-Length up front. In HTTP/2, it's DATA frames that the server sends incrementally over a single multiplexed stream. Next.js Route Handlers support both; you get HTTP/2 when deployed to Vercel or any HTTP/2-capable host without any configuration.

The client side is simpler than people expect: fetch() returns a Response object whose body is already a ReadableStream. You can read it chunk by chunk with response.body.getReader(), or you can let the Vercel AI SDK's useChat hook handle all of it.

The Minimal Streaming Route Handler

Before reaching for the AI SDK, understand what's happening underneath it. This is the primitive:

typescript
// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json()

  const stream = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder()

      // Simulate token-by-token streaming
      const tokens = ['Hello', ' world', '!']
      for (const token of tokens) {
        controller.enqueue(encoder.encode(token))
        await new Promise(r => setTimeout(r, 100))
      }
      controller.close()
    }
  })

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/plain; charset=utf-8',
      'Transfer-Encoding': 'chunked',
      'X-Content-Type-Options': 'nosniff',
    }
  })
}

ReadableStream takes a start function that receives a controller. You push chunks into the stream with controller.enqueue() and signal completion with controller.close(). Each enqueue call sends a chunk immediately — no buffering.

TextEncoder converts JavaScript strings to Uint8Array, which is what the browser's ReadableStream API operates on. HTTP bodies are bytes, not strings.

The X-Content-Type-Options: nosniff header matters: without it, some browsers or proxies try to buffer the response until they've seen enough bytes to "sniff" the content type, defeating the purpose of streaming.

The Vercel AI SDK — streamText and useChat

The minimal example above works. In production you'll use the Vercel AI SDK, which handles the protocol details, error frames, tool calls, and the client-side React hooks.

Install:

bash
npm install ai @ai-sdk/openai

The route handler:

typescript
// app/api/chat/route.ts
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'

export async function POST(req: Request) {
  const { messages } = await req.json()

  const result = streamText({
    model: openai('gpt-4o'),
    system: 'You are a helpful assistant.',
    messages,
    maxTokens: 1024,
  })

  return result.toDataStreamResponse()
}

toDataStreamResponse() returns a Response with the AI SDK's data stream protocol. This is not plain text — it's a structured format where each line is a typed frame:

0:"token" — text delta
2:[{...}] — tool call
3:"error message" — error
8:{usage} — token usage at stream end
d:{finishReason} — stream done

The client component:

typescript
// app/chat/page.tsx
'use client'
import { useChat } from 'ai/react'

export default function ChatPage() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
    api: '/api/chat',
  })

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Type a message..."
          disabled={isLoading}
        />
        <button type="submit" disabled={isLoading}>
          {isLoading ? 'Generating...' : 'Send'}
        </button>
      </form>
    </div>
  )
}

useChat manages the message list, streams incoming tokens into the last assistant message, and handles the data stream protocol. isLoading is true from request start until the stream closes.

The messages array maintains the full conversation history. Each call to handleSubmit sends the full history to the server. This is how LLMs maintain context — stateless server, stateful client.

Abort Signal Propagation — The Cancelled Request Problem

This is the expensive mistake nobody warns you about until you get the bill.

When a user closes the chat tab mid-stream, sends a new message before the current one finishes, or navigates away, the browser cancels the in-flight HTTP request. The browser's fetch cancellation propagates as an AbortSignal — specifically req.signal inside a Route Handler.

Without abort signal handling, your LLM provider keeps generating tokens you're paying for. The streaming loop in your route handler runs to completion with no client to receive the output. You pay for every token.

The numbers: GPT-4o costs $0.015 per 1,000 output tokens. A 2,000-token response cancelled at token 50 but not aborted at the LLM level burns 1,950 tokens — $0.029 per cancelled request. At 10,000 requests/day with a 20% cancellation rate (a reasonable estimate for impatient users), that's 2,000 wasted requests/day × $0.029 = $58/day = $1,740/month in wasted spend. At higher traffic, this compounds faster than any other line item.

The fix is one line:

typescript
// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json()

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    abortSignal: req.signal,  // propagate cancellation to the LLM provider
  })

  return result.toDataStreamResponse()
}

req.signal is the AbortSignal that fires when the client disconnects. Passing it to streamText propagates the cancellation downstream to the OpenAI API call. When the signal fires, the provider call is aborted, token generation stops, and you stop paying.

The AI SDK's streamText accepts abortSignal for exactly this reason. If you're using another SDK or calling the provider directly with fetch, pass signal in the fetch options:

typescript
const response = await fetch('https://api.openai.com/v1/chat/completions', {
  method: 'POST',
  signal: req.signal,  // same propagation
  headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` },
  body: JSON.stringify({ model: 'gpt-4o', messages, stream: true }),
})

Always propagate req.signal. It is never wrong to do this and it is always expensive not to.

TransformStream for Token Processing

Sometimes you need to inspect or modify tokens before they reach the client: stripping PII that the model accidentally included in its output, injecting metadata, enforcing content policies, or implementing per-token rate limiting.

TransformStream sits between the LLM output stream and the response stream. Each chunk that arrives from the LLM passes through your transform function before being enqueued to the client:

typescript
// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json()

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    abortSignal: req.signal,
  })

  // TransformStream intercepts each chunk before it reaches the client
  const transform = new TransformStream({
    transform(chunk, controller) {
      const text = new TextDecoder().decode(chunk)
      // Strip any accidentally leaked email addresses
      const sanitized = text.replace(/[\w.-]+@[\w.-]+\.\w{2,}/g, '[email]')
      controller.enqueue(new TextEncoder().encode(sanitized))
    }
  })

  const stream = result.toReadableStream().pipeThrough(transform)
  return new Response(stream, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' }
  })
}

pipeThrough connects the ReadableStream to the TransformStream and returns a new ReadableStream representing the transform's output. You can chain multiple transforms: .pipeThrough(sanitize).pipeThrough(rateLimit).pipeThrough(audit).

One caveat: when using toReadableStream() instead of toDataStreamResponse(), you lose the AI SDK's structured data protocol. The client receives raw text, not data frames. That's appropriate for simple text streaming but incompatible with useChat's tool call handling. Use toReadableStream() only when you need complete control and you're writing custom client parsing.

Streaming Error Handling — The 500-After-Stream Problem

HTTP status codes live in response headers. Response headers are sent before the response body. Once you start streaming the body, headers are already committed to the client. You cannot return a 500 status code after streaming begins.

This creates a fundamental problem:

typescript
// WRONG — cannot change status after streaming starts
export async function POST(req: Request) {
  const stream = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of llmStream) {
          controller.enqueue(encoder.encode(chunk))
        }
      } catch (e) {
        // TOO LATE — headers already sent, status cannot change
        // This Response is ignored by Next.js
        return new Response('Error', { status: 500 })
      }
    }
  })
  return new Response(stream) // Headers committed here: status 200
}

The client receives status 200 and starts reading. If the LLM errors halfway through, the client has no way to know from the HTTP status. Some clients will interpret a truncated stream as a network error. Most won't.

The correct approach: encode errors as events in the stream itself.

typescript
const stream = new ReadableStream({
  async start(controller) {
    const encoder = new TextEncoder()
    try {
      for await (const chunk of llmStream) {
        controller.enqueue(encoder.encode(chunk))
      }
    } catch (error) {
      // Send error as a structured event the client can parse
      controller.enqueue(
        encoder.encode(
          `data: ${JSON.stringify({ error: 'Stream interrupted', code: 'STREAM_ERROR' })}\n\n`
        )
      )
    } finally {
      controller.close()
    }
  }
})

The client must be written to parse these error events and handle them — for example by showing an inline error message and a retry button, rather than displaying a partial response as if it were complete.

The Vercel AI SDK handles this automatically. toDataStreamResponse() uses the data protocol format where errors are sent as 3:"error message" frames. The useChat hook parses these frames and exposes the error through useChat's error state field. You get proper error handling in the UI without writing any of the parsing yourself:

typescript
// Client — AI SDK error handling
const { messages, error, reload } = useChat({ api: '/api/chat' })

if (error) {
  return (
    <div>
      <p>Error: {error.message}</p>
      <button onClick={reload}>Retry</button>
    </div>
  )
}

reload retries the last message. It's the streaming equivalent of reset() in an error boundary.

Rate Limiting Streaming Endpoints

Standard rate limiting middleware counts requests. A streaming chat request stays open for 10–60 seconds depending on response length and model speed. Request-count limiting alone creates two problems:

A single slow streaming request holds a connection for 60 seconds while a burst of fast requests exhausts the rate limit. The slow request didn't cost more, it just stayed open longer.
There's no protection against adversarial prompt injections that produce very long responses, burning your token budget while only counting as one request.

The production approach combines request-count limiting with a per-request token cap:

typescript
import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '1 m'), // 10 requests per minute per IP
})

export async function POST(req: Request) {
  // Rate limit by IP (use user ID if authenticated)
  const ip = req.headers.get('x-forwarded-for') ?? 'anonymous'
  const { success, limit, remaining } = await ratelimit.limit(ip)

  if (!success) {
    return new Response('Too many requests', {
      status: 429,
      headers: {
        'X-RateLimit-Limit': limit.toString(),
        'X-RateLimit-Remaining': remaining.toString(),
        'Retry-After': '60',
      }
    })
  }

  const { messages } = await req.json()

  // Per-request token cap — prevents runaway costs regardless of rate limiting
  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    maxTokens: 500,          // hard output limit per request
    abortSignal: req.signal,
  })

  return result.toDataStreamResponse()
}

For authenticated applications, rate limit by user ID instead of IP. IP-based limiting is trivially bypassed by users sharing IPs (office NAT) or by adversaries with dynamic IPs. User ID limits are fair and accurate.

For high-scale applications, consider adding a token-bucket rate limiter on top that tracks token usage rather than request count. Upstash's tokenBucket limiter supports this directly.

Server-Sent Events vs Streaming JSON

There are two distinct streaming patterns in production Next.js apps. They solve different problems.

AI SDK Data Protocol (Streaming JSON)

What the AI SDK uses. Binary-safe. Supports structured event types: text deltas, tool calls, metadata, errors, usage stats. Each frame is a line starting with a type code (0:, 2:, 3:, 8:, d:). The client side uses useChat or useCompletion hooks to parse these frames.

Use this when: building chat interfaces, AI completions, or any LLM-powered feature where you want tool calls, structured metadata, and the full AI SDK feature set.

Server-Sent Events (SSE)

The Web standard for server-to-client streaming. Text-based. Browser's native EventSource API. No library required on the client. Supports named event types (event: progress) and message IDs for automatic reconnection (id: 42). The format is simple: data: <content>\n\n for each event.

Use this when: streaming non-AI content to the client — build logs, export progress, live score updates, background job status. Anything where you need the browser to receive incremental updates without the complexity of the AI SDK.

typescript
// app/api/export/route.ts — SSE for non-AI streaming
export async function GET(req: Request) {
  const encoder = new TextEncoder()

  const stream = new ReadableStream({
    async start(controller) {
      // SSE format: "data: <content>\n\n"
      controller.enqueue(encoder.encode('data: {"status":"started"}\n\n'))

      for (let i = 0; i <= 100; i += 10) {
        await processChunk(i)
        controller.enqueue(encoder.encode(`data: {"progress":${i}}\n\n`))
      }

      controller.enqueue(encoder.encode('data: {"status":"complete"}\n\n'))
      controller.close()
    }
  })

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
    }
  })
}

Client consumption without any library:

typescript
// Any Client Component
useEffect(() => {
  const source = new EventSource('/api/export')

  source.onmessage = (event) => {
    const data = JSON.parse(event.data)
    if (data.status === 'complete') source.close()
    else setProgress(data.progress)
  }

  source.onerror = () => source.close()

  return () => source.close()
}, [])

EventSource automatically reconnects on network interruption using the Last-Event-ID header. If you add id: <n> to your SSE frames, the browser sends that ID on reconnection so you can resume from where you left off.

Key difference between the two: EventSource is GET-only. You cannot send a POST body with EventSource. For AI chat, where you need to send the message history in the request body, use fetch with POST and handle the ReadableStream yourself or use useChat. For unidirectional server-to-client pushes, EventSource is simpler.

Cost Control and Token Budgets

LLM cost is proportional to tokens. Streaming doesn't change the cost — it changes when you perceive that you're paying. The bill at the end of the month is identical whether you stream or not. Streaming changes user experience, not cost structure.

The places to control cost:

maxTokens — hard output limit. If the model hasn't finished by token N, it stops. This is the most important lever. Set it based on what your feature actually needs, not what feels generous.

System prompt length — every request includes your system prompt in the prompt tokens. A 500-token system prompt at 10K requests/day is 5M prompt tokens/day. Write tight system prompts.

Conversation history truncation — useChat sends the full conversation history on every turn. A 20-turn conversation sends 20x messages, where early turns are mostly prompt tokens. Truncate old turns before sending:

typescript
const result = streamText({
  model: openai('gpt-4o'),
  messages: messages.slice(-10), // only last 10 turns
  maxTokens: 1024,
  abortSignal: req.signal,
  experimental_telemetry: {
    isEnabled: true,
    functionId: 'chat',  // group by feature for cost attribution in your dashboard
  },
  onFinish: async ({ usage }) => {
    // Log actual usage for cost monitoring and anomaly detection
    await db.usageLog.create({
      data: {
        userId: session.user.id,
        promptTokens: usage.promptTokens,
        completionTokens: usage.completionTokens,
        model: 'gpt-4o',
        // Cost calculation: (prompt_tokens × price_in + completion_tokens × price_out) / 1000
        cost: (usage.promptTokens * 0.005 + usage.completionTokens * 0.015) / 1000,
      }
    })
  }
})

Model Cost Comparison

For a typical chat feature: 500 prompt tokens, 300 completion tokens per request.

Model	Input $/1K	Output $/1K	Cost/request	At 10K req/day	At 100K req/day	At 1M req/day
GPT-4o	$0.005	$0.015	$0.007	$70/day	$700/day	$7,000/day
GPT-4o mini	$0.00015	$0.0006	$0.00026	$2.60/day	$26/day	$260/day
Claude 3.5 Haiku	$0.0008	$0.004	$0.0016	$16/day	$160/day	$1,600/day
Claude 3.5 Sonnet	$0.003	$0.015	$0.006	$60/day	$600/day	$6,000/day

The 27x cost difference between GPT-4o and GPT-4o mini is real. For most conversational chat features, GPT-4o mini is indistinguishable to users. Use the larger models for reasoning-heavy tasks (code generation, complex analysis) and the smaller, faster, cheaper models for simple conversation and Q&A.

Route different request types to different models:

typescript
function selectModel(intent: 'chat' | 'code' | 'analysis') {
  switch (intent) {
    case 'chat': return openai('gpt-4o-mini')    // cheap, fast
    case 'code': return openai('gpt-4o')          // worth the cost
    case 'analysis': return openai('gpt-4o')      // ditto
  }
}

Putting It Together — A Production Chat Route

Here is a single route handler that applies everything from this module: abort signal propagation, rate limiting, token capping, usage logging, and the correct response format:

typescript
// app/api/chat/route.ts
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'
import { auth } from '@/lib/auth'
import { db } from '@/lib/db'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(20, '1 m'),
})

export async function POST(req: Request) {
  // Auth check — streaming endpoints are expensive, don't leave them open
  const session = await auth()
  if (!session?.user) {
    return new Response('Unauthorized', { status: 401 })
  }

  // Rate limit by user ID
  const { success } = await ratelimit.limit(session.user.id)
  if (!success) {
    return new Response('Rate limit exceeded', {
      status: 429,
      headers: { 'Retry-After': '60' }
    })
  }

  const { messages } = await req.json()

  const result = streamText({
    model: openai('gpt-4o-mini'),
    system: 'You are a helpful assistant. Be concise.',
    messages: messages.slice(-10),  // truncate history
    maxTokens: 500,
    abortSignal: req.signal,        // propagate cancellation
    onFinish: async ({ usage }) => {
      await db.usageLog.create({
        data: {
          userId: session.user.id,
          promptTokens: usage.promptTokens,
          completionTokens: usage.completionTokens,
          cost: (usage.promptTokens * 0.00015 + usage.completionTokens * 0.0006) / 1000,
        }
      })
    }
  })

  return result.toDataStreamResponse()
}

This is not a toy example. This is the pattern deployed in applications handling millions of requests.

What We Did Not Cover

Tool calls and function calling — streamText supports tools and toolChoice. The AI SDK streams tool call events through the data protocol. This is its own module when the curriculum covers AI agents.

Multi-modal inputs — image attachments to GPT-4o via the messages format. Works with the same streaming architecture.

Streaming to React Server Components — the streamUI function from ai/rsc. A different pattern using React Suspense rather than useChat. Covered in the RSC-advanced module.

Provider failover — switching from OpenAI to Anthropic when the primary provider has an outage. Implement with a try/catch around streamText that retries with a fallback model.

Where We Go From Here

P-18 covers deployment and edge runtime configuration — where streaming route handlers run (Node.js vs Edge runtime), the constraints of each, and how streaming interacts with Vercel's Edge Network and serverless cold starts.

PreviousModule P-16: Migrating from Pages Router to App Router Next Module A-1: RSC Internals — The React Flight Protocol and React 19