Auditing 22 references to GSIs that don't exist
A pre-release audit pass on a backend I run turned up 22 places
where the code passed IndexName: foo-index to a DynamoDB query
on a table where foo-index didn't exist. Most of them had been
silently dormant because the offending code paths weren't called
in normal usage. One of them was crashing a cron job that hadn't
been invoked since being deployed.
What was happening
The pattern was always the same:
$result = $db->query('checkonmine_pings', [
'IndexName' => 'user_id-index',
'KeyConditionExpression' => 'user_id = :u',
'ExpressionAttributeValues' => [':u' => ['S' => $userId]],
]);
The trap: many of those tables already had user_id as the
primary HASH key. The query would have worked perfectly without
the IndexName line — but with it, DynamoDB returns a
ResourceNotFoundException because the named GSI doesn't exist.
Most of these were copy-pasted from one query that needed a real GSI on a different table. Nobody noticed because nobody ever exercised the code paths.
What I found
I wrote a tiny audit script that lists every IndexName reference
in the PHP source and compares against the live GSIs on each
table:
python audit_indexes.py
# rough sketch
import boto3, re, os
ddb = boto3.client('dynamodb', region_name='us-east-1')
live = {}
for t in ddb.list_tables()['TableNames']:
desc = ddb.describe_table(TableName=t)['Table']
live[t] = {g['IndexName'] for g in desc.get('GlobalSecondaryIndexes', [])}
usages = [] # (file, line, table, index_name)
for root, _, files in os.walk('public'):
for f in files:
if not f.endswith('.php'): continue
# naive: walk source, find IndexName => '...' near nearest 'TableName' or table arg
# in practice: regex per query() call site, follow the table arg
for file, line, table, idx in usages:
if idx not in live.get(table, set()):
print(f'{file}:{line} {table} -> {idx} MISSING')
Results:
- 18 cases where the code referenced
user_id-indexon tables that already haveuser_idas the primary HASH key (no GSI needed at all, remove the line). - 2 cases on a
smart_device_eventstable referencing two indexes that don't exist (table only hadid-index). Rewrote one to query the base table with aFilterExpressionondevice_idsince the access pattern was low-traffic. - 2 cases on an
error_logstable referencinguser_id-indexanderror_type-index. Same approach — admin table, low traffic,scan + FilterExpressionis fine.
Plus two missing tables that the code happily referenced as if they existed: created them via the CLI with the schemas the existing service module already expected.
The fix
Mostly mechanical. For each "GSI doesn't exist but table has the
right HASH key" case, just delete the IndexName line:
// before
$result = $db->query('checkonmine_pings', [
'IndexName' => 'user_id-index',
'KeyConditionExpression' => 'user_id = :u',
...
]);
// after
$result = $db->query('checkonmine_pings', [
'KeyConditionExpression' => 'user_id = :u',
...
]);
For the "GSI doesn't exist and access pattern is rare" cases,
swap Query for Scan + FilterExpression. Not the right answer
on a hot table, but for an admin error-log lookup that runs
maybe once a week it's fine.
For the two missing tables, aws dynamodb create-table with
BillingMode=PAY_PER_REQUEST and SSE on. No CloudFormation —
this codebase doesn't manage tables that way and I wasn't going
to introduce it for two tables.
What I'd do differently
The script that found these is now part of the repo as
audit_indexes.py. Re-running it after any major code change
takes about three seconds and catches the next bug like this
before deploy.
The pattern of "I copy-pasted a working query and didn't
question the IndexName line" is going to happen forever, on
every team. The defense is automation, not vigilance. Anything
you've ever found by reading the code, you should be able to
re-find by running a script.