The SQS FIFO queue that lets a Lambda run 50 overnight jobs without timing out

Lambda has a hard 15-minute execution limit. An overnight agent that runs 50 job applications needs 3–6 hours. These facts are incompatible — unless you decompose the work.

The coordinator/worker/SQS FIFO pattern solves this. It’s what runs Cass’s job application pipeline.

The problem

A naive overnight Lambda:

def handler(event, context):
    listings = fetch_qualifying_listings()  # 50+ items
    for listing in listings:
        process_listing(listing)  # each takes 2-4 min

This hits the 15-minute timeout after item 5 or 6. The remaining 44 items don’t get processed. You don’t get an error — the Lambda just stops.

The solution

Split the work into two Lambdas:

Coordinator — runs on a cron, fast pass over available work, pushes each item onto an SQS FIFO queue, returns immediately
Worker — triggered by SQS, processes one item per invocation, takes as long as it needs

EventBridge (nightly cron)
    │
    ▼
Coordinator Lambda (30 sec)
    │ pushes 50 messages
    ▼
SQS FIFO Queue
    │ 50 invocations
    ▼
Worker Lambda × 50 (2-4 min each, in parallel)

The coordinator finishes in 30 seconds. Each worker gets a fresh 15-minute budget. 50 items × 4 minutes = 200 minutes of work, processed concurrently, finished in the time it takes for the slowest worker.

The coordinator

import boto3
import hashlib
import json

sqs = boto3.client("sqs")
QUEUE_URL = os.environ["SQS_QUEUE_URL"]

def handler(event, context):
    listings = fetch_qualifying_listings()

    pushed = 0
    for listing in listings:
        dedup_id = hashlib.md5(
            f"{listing['id']}-{listing['updated_at']}".encode()
        ).hexdigest()

        sqs.send_message(
            QueueUrl=QUEUE_URL,
            MessageBody=json.dumps(listing),
            MessageGroupId="job-applications",
            MessageDeduplicationId=dedup_id
        )
        pushed += 1

    print(f"Coordinator: pushed {pushed} listings to SQS")
    return {"pushed": pushed}

The MessageDeduplicationId is critical. If the coordinator crashes halfway and is retried by EventBridge, the second run will push duplicate messages. SQS FIFO deduplication (5-minute window) drops the duplicates. Without this, workers process the same listing twice — and submit the same application twice.

The MessageGroupId controls concurrency. Within a group, SQS delivers messages in order, one at a time. For parallelism, use listing['id'] as the group ID — each item is its own group and SQS can deliver multiple items concurrently.

The worker

def handler(event, context):
    results = []

    for record in event["Records"]:
        listing = json.loads(record["body"])

        try:
            result = process_listing(listing)
            results.append({"itemIdentifier": record["messageId"], "status": "success"})
        except Exception as e:
            print(f"Worker error for {listing['id']}: {e}")
            results.append({"itemIdentifier": record["messageId"]})

    return {
        "batchItemFailures": [
            {"itemIdentifier": r["itemIdentifier"]}
            for r in results
            if "status" not in r
        ]
    }

ReportBatchItemFailures is the other critical piece. When a batch has 5 messages and 1 fails, the default behavior is to fail the whole batch — all 5 go back to the queue. With ReportBatchItemFailures, only the failed message is retried. The other 4 are deleted. Without this, a single flaky API call causes an exponential retry cascade.

The SAM template

Resources:
  JobQueue:
    Type: AWS::SQS::Queue
    Properties:
      FifoQueue: true
      ContentBasedDeduplication: false
      VisibilityTimeoutSeconds: 900  # must be >= Lambda timeout
      RedrivePolicy:
        deadLetterTargetArn: !GetAtt JobDLQ.Arn
        maxReceiveCount: 3

  JobDLQ:
    Type: AWS::SQS::Queue
    Properties:
      FifoQueue: true

  CoordinatorFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: coordinator.handler
      Timeout: 60
      Environment:
        Variables:
          SQS_QUEUE_URL: !Ref JobQueue
      Events:
        Nightly:
          Type: Schedule
          Properties:
            Schedule: cron(0 5 * * ? *)
      Policies:
        - SQSSendMessagePolicy:
            QueueName: !GetAtt JobQueue.QueueName

  WorkerFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: worker.handler
      Timeout: 900
      ReservedConcurrentExecutions: 10
      Events:
        SQSTrigger:
          Type: SQS
          Properties:
            Queue: !GetAtt JobQueue.Arn
            BatchSize: 1
            FunctionResponseTypes:
              - ReportBatchItemFailures

BatchSize: 1 means each Lambda invocation processes one message. For anything running 2+ minutes per item, this is correct. ReservedConcurrentExecutions: 10 caps parallel workers so the queue doesn’t exhaust your account’s Lambda concurrency limit.

The dead-letter queue

Items that fail 3 times land in the DLQ. In the morning, check the depth:

resp = sqs.get_queue_attributes(
    QueueUrl=DLQ_URL,
    AttributeNames=["ApproximateNumberOfMessages"]
)
dlq_depth = int(resp["Attributes"]["ApproximateNumberOfMessages"])
if dlq_depth > 0:
    print(f"WARNING: {dlq_depth} failed jobs in DLQ — investigate")

The DLQ is the signal that something is consistently broken. Items that hit a flaky retry and recover never reach it.

Cost

For an overnight batch of 50 items:

SQS FIFO: first million requests/month free. 50 messages = essentially zero.
Lambda coordinator: 60 seconds at 512 MB = ~$0.00
Lambda workers: 50 × 4 minutes at 512 MB = ~$0.02/night, ~$0.60/month

A 1,000-item overnight batch is still under $10/month. Far cheaper than running an EC2 instance overnight.

Module 8 of Build Your Own Cass walks through the full coordinator/worker implementation for the job application pipeline: the scoring agent loop, the application agent loop, the DLQ monitoring cron, and the complete SAM template.