How BullMQ maps job lifecycle to Sorted Sets, Lists, and Hashes. Worker polling, delayed job scheduling, stalled job detection via heartbeat, the rate limiter internals, and choosing BullMQ vs raw Streams.
P-6 — BullMQ Internals: The Redis Data Structures Behind the Job Queue
Who this module is for: You use BullMQ (or Bull) for job queues and have run into issues — jobs that get stuck, queues that slow down under load, stalled job detection that is too aggressive or not aggressive enough. This module explains the Redis data structures BullMQ uses for every queue state, so you can reason about its behaviour, tune it correctly, and debug it at the Redis level.
Why Understanding BullMQ Internals Matters
BullMQ is a job queue built on Redis. Most engineers treat it as a black box — they add jobs with queue.add() and process them in a worker.process() function. But when queues misbehave (jobs stay in "active" forever, delayed jobs fire late, rate limits fail), you cannot diagnose or fix the problem without understanding the Redis layer.
Every BullMQ behaviour maps to specific Redis operations. Knowing this lets you:
- Query queue state directly with
redis-cliwithout going through BullMQ's API - Understand why a job is "stuck" and fix it
- Tune TTL, stall checks, and rate limiter settings appropriately
- Identify Redis memory usage caused by large queues
The Key Schema
BullMQ uses a namespaced key prefix. For a queue named emails:
bull:emails:id → String: auto-incrementing job ID counter
bull:emails:wait → List: jobs waiting to be picked up (FIFO)
bull:emails:active → List: jobs currently being processed
bull:emails:completed → Sorted Set: completed jobs (score = completion timestamp)
bull:emails:failed → Sorted Set: failed jobs (score = failure timestamp)
bull:emails:delayed → Sorted Set: delayed jobs (score = run-at timestamp)
bull:emails:prioritized → Sorted Set: priority jobs (score = priority × time)
bull:emails:paused → List: queue is paused, jobs go here instead of wait
bull:emails:meta → Hash: queue metadata (paused, maxLen, etc.)
bull:emails:{jobId} → Hash: job data (id, data, opts, timestamp, etc.)
bull:emails:events → Stream: BullMQ events (completed, failed, stalled, etc.)
bull:emails:rate-limiter → String or Hash: rate limiter state
bull:emails:stalled-check:{lockKey} → key used for stall detection
Job Lifecycle in Redis
Adding a Job (queue.add)
javascriptawait emailQueue.add('send-welcome', { userId: '1001', email: 'j@example.com' });
What happens in Redis:
INCR bull:emails:id→ generates job ID, e.g.,42HSET bull:emails:42with all job fields:id: "42"name: "send-welcome"data: '{"userId":"1001","email":"j@example.com"}'opts: '{"attempts":1,"delay":0,...}'timestamp: "1717000000000"delay: "0"priority: "0"
RPUSH bull:emails:wait 42→ add job ID to the wait listXADD bull:emails:events * event added jobId 42→ emit event to the events stream
The job data (step 2) is stored in a Hash for O(1) field access. The queue lists and sorted sets store only the job ID — the actual data is always in the Hash.
Adding a Delayed Job
javascriptawait emailQueue.add('send-followup', { userId: '1001' }, { delay: 3600000 }); // 1 hour
Instead of RPUSH bull:emails:wait, BullMQ uses:
ZADD bull:emails:delayed {runAt_timestamp_ms} {jobId}
A scheduler process (the QueueScheduler in Bull v3, built into BullMQ workers) polls the delayed sorted set with:
ZRANGEBYSCORE bull:emails:delayed 0 {now_ms} COUNT 100
When jobs become ready (their score ≤ current timestamp), the scheduler moves them to bull:emails:wait via LPUSH and ZREM.
Adding a Priority Job
javascriptawait emailQueue.add('vip-email', { userId: '99' }, { priority: 1 }); // lower = higher priority
ZADD bull:emails:prioritized {priority_score} {jobId}
Workers preferentially consume from prioritized before wait.
Processing a Job (worker picks up)
The worker calls:
LMOVE bull:emails:wait bull:emails:active RIGHT LEFT
This atomically moves the job ID from the tail of wait to the head of active. If no jobs are waiting, the worker calls:
BLMOVE bull:emails:wait bull:emails:active RIGHT LEFT 5
Blocking for up to 5 seconds. When a job arrives, the BLMOVE completes and the job ID is in active.
The worker then reads the job data:
HGETALL bull:emails:{jobId}
And acquires a "lock" on the job:
SET bull:emails:{jobId}:lock {worker_token} PX 30000 NX
This lock prevents another worker from claiming the same job. The lock expires in 30 seconds (configurable with lockDuration).
Job Completion
javascript// Worker signals success await job.moveToCompleted('email sent', workerToken);
BullMQ executes a Lua script that atomically:
- Verifies the worker still holds the lock (
GET bull:emails:{jobId}:lock) LREM bull:emails:active 0 {jobId}— removes from active listZADD bull:emails:completed {timestamp} {jobId}— adds to completed set- Optionally trims completed set if
removeOnCompleteis configured DEL bull:emails:{jobId}:lock— releases the lockXADD bull:emails:events * event completed jobId {jobId}— emits event
Job Failure
javascript// Worker signals failure (after all retries exhausted) await job.moveToFailed(error, workerToken);
Similar Lua script:
- Verify lock
LREM bull:emails:active 0 {jobId}- If retries remain:
RPUSH bull:emails:wait {jobId}(or with backoff delay:ZADD bull:emails:delayed ...) - If no retries remain:
ZADD bull:emails:failed {timestamp} {jobId} - Update job Hash with
failedReason,stacktrace,attemptsMade - Release lock, emit event
Stalled Job Detection
A job becomes "stalled" when the worker crashes (SIGKILL, OOM) after moving the job to active but before completing or failing it. The lock expires but no worker claims the job — it is stuck in active indefinitely.
The stall check runs periodically (configurable with stalledInterval, default 30 seconds):
javascript// Worker's internal stall check (runs in QueueEvents or Worker itself) // Checks all jobs in 'active' that have an expired lock
The Lua-based stall check:
- Scans
bull:emails:activefor job IDs - For each: checks if
bull:emails:{jobId}:lockexists - If the lock does not exist (expired): the job is stalled
- If
attemptsMade < maxAttempts: moves back towait(retry) - If exhausted retries: moves to
failed
javascript// Configure stall detection const worker = new Worker('emails', processor, { stalledInterval: 30000, // check every 30 seconds maxStalledCount: 1, // mark as failed after 1 stall lockDuration: 30000, // lock expires in 30 seconds lockRenewTime: 15000, // renew lock every 15 seconds });
Tuning stall detection:
lockDurationshould be longer than the maximum expected job processing timelockRenewTimeis automatically set tolockDuration / 2— the worker renews its lock halfway through the duration- If a job legitimately takes 5 minutes: set
lockDuration: 360000(6 minutes) maxStalledCount: 0means stalled jobs are retried indefinitely (dangerous for infinite loops)
Rate Limiter Internals
javascriptconst worker = new Worker('emails', processor, { limiter: { max: 100, duration: 1000, // 100 jobs per second }, });
BullMQ's rate limiter uses a sliding window implemented with a Sorted Set:
bull:emails:rate-limiter → Sorted Set: {jobId} with score = timestamp
Before processing each job, the worker:
- Removes entries older than
durationms from the rate limiter key - Counts remaining entries
- If count ≥
max: delays the current job by inserting it back intodelayedfor the next window - Otherwise: increments the window counter and proceeds
Querying Queue State Directly
With this knowledge, you can inspect BullMQ queues using raw Redis commands:
bash# How many jobs are waiting? redis-cli LLEN bull:emails:wait # How many jobs are active? redis-cli LLEN bull:emails:active # What jobs are active? (get their IDs) redis-cli LRANGE bull:emails:active 0 -1 # Get details of a specific job redis-cli HGETALL bull:emails:42 # What delayed jobs are coming up in the next 60 seconds? redis-cli ZRANGEBYSCORE bull:emails:delayed 0 $(($(date +%s%3N) + 60000)) WITHSCORES # How many failed jobs? redis-cli ZCARD bull:emails:failed # View the events stream redis-cli XREVRANGE bull:emails:events + - COUNT 10
Memory Considerations
For high-throughput queues, BullMQ keys accumulate:
- Completed jobs:
bull:emails:{jobId}Hashes persist after completion unlessremoveOnCompleteis set - Failed jobs: Same — persist forever unless
removeOnFail
javascript// Recommended: auto-remove jobs after a count or age const worker = new Worker('emails', processor, { removeOnComplete: { count: 1000 }, // keep last 1000 completed removeOnFail: { count: 500 }, // keep last 500 failed });
Without this, a queue processing 1,000 jobs/hour generates 24,000 job Hashes per day. Each Hash is ~300–500 bytes. At 1M jobs total: ~300–500MB just for the job Hashes.
The completed and failed Sorted Sets also grow unboundedly. removeOnComplete.count limits the Sorted Set size by trimming (ZREMRANGEBYRANK) after each completion.
Summary
- BullMQ uses
wait(List) for FIFO queuing,active(List) for in-progress jobs,completed/failed(Sorted Sets) for history,delayed(Sorted Set with timestamp score) for scheduling - Job data lives in a Hash
bull:{queue}:{jobId}; queues store only the ID - Workers use
LMOVE wait active(atomic) to claim jobs; a Lua-based lock prevents double-processing - Stalled jobs (lock expired, still in
active) are detected and retried or failed by the stall checker - Tune
lockDurationto exceed max job processing time;lockRenewTimedefaults to halflockDuration - Rate limiting uses a sliding window Sorted Set — delayed jobs are re-queued when the window is full
- Enable
removeOnCompleteandremoveOnFailto prevent unbounded memory growth - Query queue state directly with Redis commands for debugging without the BullMQ API overhead
Next: P-7 — Cache Stampede, Avalanche, and Penetration — three cache failure modes that look similar in monitoring but require different solutions.