The SQS FIFO queue that lets a Lambda run 50 overnight jobs without timing out
Lambda has a hard 15-minute execution limit. An overnight agent that runs 50 job applications needs 3–6 hours. These facts are incompatible — unless you decompose the work.
The coordinator/worker/SQS FIFO pattern solves this. It’s what runs Cass’s job application pipeline.
The problem
A naive overnight Lambda:
def handler(event, context):
listings = fetch_qualifying_listings() # 50+ items
for listing in listings:
process_listing(listing) # each takes 2-4 min
This hits the 15-minute timeout after item 5 or 6. The remaining 44 items don’t get processed. You don’t get an error — the Lambda just stops.
The solution
Split the work into two Lambdas:
- Coordinator — runs on a cron, fast pass over available work, pushes each item onto an SQS FIFO queue, returns immediately
- Worker — triggered by SQS, processes one item per invocation, takes as long as it needs
EventBridge (nightly cron)
│
▼
Coordinator Lambda (30 sec)
│ pushes 50 messages
▼
SQS FIFO Queue
│ 50 invocations
▼
Worker Lambda × 50 (2-4 min each, in parallel)
The coordinator finishes in 30 seconds. Each worker gets a fresh 15-minute budget. 50 items × 4 minutes = 200 minutes of work, processed concurrently, finished in the time it takes for the slowest worker.
The coordinator
import boto3
import hashlib
import json
sqs = boto3.client("sqs")
QUEUE_URL = os.environ["SQS_QUEUE_URL"]
def handler(event, context):
listings = fetch_qualifying_listings()
pushed = 0
for listing in listings:
dedup_id = hashlib.md5(
f"{listing['id']}-{listing['updated_at']}".encode()
).hexdigest()
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(listing),
MessageGroupId="job-applications",
MessageDeduplicationId=dedup_id
)
pushed += 1
print(f"Coordinator: pushed {pushed} listings to SQS")
return {"pushed": pushed}
The MessageDeduplicationId is critical. If the coordinator crashes halfway and is retried by EventBridge, the second run will push duplicate messages. SQS FIFO deduplication (5-minute window) drops the duplicates. Without this, workers process the same listing twice — and submit the same application twice.
The MessageGroupId controls concurrency. Within a group, SQS delivers messages in order, one at a time. For parallelism, use listing['id'] as the group ID — each item is its own group and SQS can deliver multiple items concurrently.
The worker
def handler(event, context):
results = []
for record in event["Records"]:
listing = json.loads(record["body"])
try:
result = process_listing(listing)
results.append({"itemIdentifier": record["messageId"], "status": "success"})
except Exception as e:
print(f"Worker error for {listing['id']}: {e}")
results.append({"itemIdentifier": record["messageId"]})
return {
"batchItemFailures": [
{"itemIdentifier": r["itemIdentifier"]}
for r in results
if "status" not in r
]
}
ReportBatchItemFailures is the other critical piece. When a batch has 5 messages and 1 fails, the default behavior is to fail the whole batch — all 5 go back to the queue. With ReportBatchItemFailures, only the failed message is retried. The other 4 are deleted. Without this, a single flaky API call causes an exponential retry cascade.
The SAM template
Resources:
JobQueue:
Type: AWS::SQS::Queue
Properties:
FifoQueue: true
ContentBasedDeduplication: false
VisibilityTimeoutSeconds: 900 # must be >= Lambda timeout
RedrivePolicy:
deadLetterTargetArn: !GetAtt JobDLQ.Arn
maxReceiveCount: 3
JobDLQ:
Type: AWS::SQS::Queue
Properties:
FifoQueue: true
CoordinatorFunction:
Type: AWS::Serverless::Function
Properties:
Handler: coordinator.handler
Timeout: 60
Environment:
Variables:
SQS_QUEUE_URL: !Ref JobQueue
Events:
Nightly:
Type: Schedule
Properties:
Schedule: cron(0 5 * * ? *)
Policies:
- SQSSendMessagePolicy:
QueueName: !GetAtt JobQueue.QueueName
WorkerFunction:
Type: AWS::Serverless::Function
Properties:
Handler: worker.handler
Timeout: 900
ReservedConcurrentExecutions: 10
Events:
SQSTrigger:
Type: SQS
Properties:
Queue: !GetAtt JobQueue.Arn
BatchSize: 1
FunctionResponseTypes:
- ReportBatchItemFailures
BatchSize: 1 means each Lambda invocation processes one message. For anything running 2+ minutes per item, this is correct. ReservedConcurrentExecutions: 10 caps parallel workers so the queue doesn’t exhaust your account’s Lambda concurrency limit.
The dead-letter queue
Items that fail 3 times land in the DLQ. In the morning, check the depth:
resp = sqs.get_queue_attributes(
QueueUrl=DLQ_URL,
AttributeNames=["ApproximateNumberOfMessages"]
)
dlq_depth = int(resp["Attributes"]["ApproximateNumberOfMessages"])
if dlq_depth > 0:
print(f"WARNING: {dlq_depth} failed jobs in DLQ — investigate")
The DLQ is the signal that something is consistently broken. Items that hit a flaky retry and recover never reach it.
Cost
For an overnight batch of 50 items:
- SQS FIFO: first million requests/month free. 50 messages = essentially zero.
- Lambda coordinator: 60 seconds at 512 MB = ~$0.00
- Lambda workers: 50 × 4 minutes at 512 MB = ~$0.02/night, ~$0.60/month
A 1,000-item overnight batch is still under $10/month. Far cheaper than running an EC2 instance overnight.
Module 8 of Build Your Own Cass walks through the full coordinator/worker implementation for the job application pipeline: the scoring agent loop, the application agent loop, the DLQ monitoring cron, and the complete SAM template.