Updating a DynamoDB row by the wrong key shape, for a year
A pre-release endpoint sweep on a backend I run turned up two
500s on the check-in respond and snooze endpoints. The cause was
old: a composite-key table was being addressed by a single-field
key in several places. It had been quietly throwing
ValidationException for so long that those errors had become
background noise nobody read.
What was happening
The table checkonmine_checkin_requests has a composite primary
key:
- HASH:
user_id(String) - RANGE:
requested_at(String)
Plus a GSI id-index on the synthetic id field for lookups by
id.
Several places in the code looked like this:
$req = $db->getItem('checkin_requests', ['id' => $requestId]);
// ...
$db->updateItem('checkin_requests',
['id' => $requestId],
['status' => 'resolved']
);
DynamoDB rejects that — getItem and updateItem require the
full primary key, not a GSI key. They throw
ValidationException: The provided key element does not match the schema.
But respondToCheckin, triggerEscalation, and the cron's
expired-mark all used that pattern. So all three were quietly
500ing whenever they were exercised, and the cron's "tried to
update a doomed row" log line was being swallowed by sheer
volume.
What I found
The right pattern is a small two-step helper: look up by the GSI to get the composite key, then mutate by the composite key.
The fix
function getCheckinRequestById($db, $id) {
$result = $db->query('checkin_requests', [
'IndexName' => 'id-index',
'KeyConditionExpression' => 'id = :id',
'ExpressionAttributeValues' => [':id' => ['S' => $id]],
'Limit' => 1,
]);
return $result[0] ?? null;
}
// then everywhere we previously did getItem('checkin_requests', ['id' => $id])
$req = getCheckinRequestById($db, $requestId);
if (!$req) {
return ['error' => 'not_found'];
}
$db->updateItem('checkin_requests',
['user_id' => $req['user_id'], 'requested_at' => $req['requested_at']],
['status' => 'resolved']
);
Applied at three call sites: respondToCheckin,
triggerEscalation, and the cron's
processExpiredCheckins mark step.
Verified post-deploy: respond/snooze are 200, the cron runs
clean, the ValidationException log lines stopped.
What I'd do differently
The "tried to update by the wrong key" pattern is invisible if your monitoring isn't watching for that exact exception class. The cron in question wakes up every five minutes, exits with "status: completed" because the broken update is wrapped in a try/catch, and nobody knows anything is wrong.
Two changes that would have caught this years ago: (a) alarm on
any ValidationException in CloudWatch, period — it almost
always means the code's mental model of the schema is wrong;
(b) an integration test that exercises every state transition on
every persistent entity. The respond/snooze paths got tested by
hand during initial build and never again.
The audit script that found this also found 22 other GSI references on tables where the GSI didn't exist. Those are worth writing about separately — same shape of bug, different direction.