Updating a DynamoDB row by the wrong key shape, for a year

FILE 0x90·UPDATING A DYNAMODB ROW BY THE WRONG KEY SHAPE, FOR A YEAR

April 24, 2026 · aws, dynamodb, php, debugging

A pre-release endpoint sweep on a backend I run turned up two 500s on the check-in respond and snooze endpoints. The cause was old: a composite-key table was being addressed by a single-field key in several places. It had been quietly throwing ValidationException for so long that those errors had become background noise nobody read.

What was happening

The table checkonmine_checkin_requests has a composite primary key:

HASH: user_id (String)
RANGE: requested_at (String)

Plus a GSI id-index on the synthetic id field for lookups by id.

Several places in the code looked like this:

$req = $db->getItem('checkin_requests', ['id' => $requestId]);
// ...
$db->updateItem('checkin_requests',
    ['id' => $requestId],
    ['status' => 'resolved']
);

DynamoDB rejects that — getItem and updateItem require the full primary key, not a GSI key. They throw ValidationException: The provided key element does not match the schema.

But respondToCheckin, triggerEscalation, and the cron's expired-mark all used that pattern. So all three were quietly 500ing whenever they were exercised, and the cron's "tried to update a doomed row" log line was being swallowed by sheer volume.

What I found

The right pattern is a small two-step helper: look up by the GSI to get the composite key, then mutate by the composite key.

The fix

function getCheckinRequestById($db, $id) {
    $result = $db->query('checkin_requests', [
        'IndexName' => 'id-index',
        'KeyConditionExpression' => 'id = :id',
        'ExpressionAttributeValues' => [':id' => ['S' => $id]],
        'Limit' => 1,
    ]);
    return $result[0] ?? null;
}

// then everywhere we previously did getItem('checkin_requests', ['id' => $id])
$req = getCheckinRequestById($db, $requestId);
if (!$req) {
    return ['error' => 'not_found'];
}

$db->updateItem('checkin_requests',
    ['user_id' => $req['user_id'], 'requested_at' => $req['requested_at']],
    ['status' => 'resolved']
);

Applied at three call sites: respondToCheckin, triggerEscalation, and the cron's processExpiredCheckins mark step.

Verified post-deploy: respond/snooze are 200, the cron runs clean, the ValidationException log lines stopped.

What I'd do differently

The "tried to update by the wrong key" pattern is invisible if your monitoring isn't watching for that exact exception class. The cron in question wakes up every five minutes, exits with "status: completed" because the broken update is wrapped in a try/catch, and nobody knows anything is wrong.

Two changes that would have caught this years ago: (a) alarm on any ValidationException in CloudWatch, period — it almost always means the code's mental model of the schema is wrong; (b) an integration test that exercises every state transition on every persistent entity. The respond/snooze paths got tested by hand during initial build and never again.

The audit script that found this also found 22 other GSI references on tables where the GSI didn't exist. Those are worth writing about separately — same shape of bug, different direction.