Back to blog
FILE 0x4F·FROGGER

Frogger

A multi-tenant AI assistant on the Claude Agent SDK that auto-resolves thousands of IT tickets every month for a managed-service provider.

Frogger was built and is operated as part of my current employment. The architecture details and metrics on this page are non-confidential and reflect what I personally designed and shipped. The employer is referred to generically as “the MSP” throughout, and no client names or specific tenants are disclosed.

What it does

Frogger lives in Microsoft Teams as a chat-and-mention bot. Anyone on the MSP team — or one of the 50+ client tenants the MSP manages — can ping it. It triages, investigates, and where allowed, fixes IT tickets autonomously. End-to-end, with no human in the loop on the routine cases. The bot is also embedded in approval-gated workflows for higher-risk actions like mass deploys or fleet-wide password resets.

It runs on the Claude Agent SDK with a custom tool layer that bridges the MSP's actual operational systems — ConnectWise Manage, NinjaOne, ConnectSecure, Microsoft Graph, Breach Secure Now, and several dozen smaller integrations. It reads tickets, queries device state, runs scripts, opens / updates / closes tickets, logs time entries, and surfaces summaries back into the Teams threads where the work originated.

~4,500tickets auto-resolved per 30 days
50+client tenants spanned
90+managed endpoints reachable
124+M365 tenants reported on monthly

Architecture

Backend

Python on AWS Lambda. DynamoDB for state, API Gateway for the bot webhook, EventBridge Scheduler for everything that runs on a clock (monthly license reports, nightly phishing-result syncs, scheduled scans). Bot Framework + Microsoft Graph on the Teams side; the bot is registered as an Azure AD app with channel-specific RSC permissions granted at install time per chat, so it never has tenant-wide standing access it doesn't actively need.

Python · AWS Lambda · DynamoDB · API Gateway · EventBridge · Bot Framework · Microsoft Graph · SES · Bedrock

Agent layer

Claude Agent SDK driving a tool catalog of ~40 tools across read, mutate, and "needs approval" tiers. Tools are grouped by integration (ConnectWise, NinjaOne, ConnectSecure, Graph, Breach Secure Now, AGJ-internal SharePoint) so I can swap one out without touching the others. Each tool has hard-coded input validation plus a permission gate that knows which Teams channel made the call.

The agent itself is a chain: classify the incoming Teams message → route to the right system prompt → execute tools → post a markdown summary back into the originating chat. Long-running jobs (multi-tenant scans, fleet queries) detach into background tasks and post completion messages when they finish.

Approval flow for high-blast-radius actions

Anything that touches more than one device, runs an arbitrary script, or modifies billing-impacting state hits the approval flow first. The bot posts an Adaptive Card to a dedicated SecOps approver group chat with the proposed action, the blast radius, and an "Approve" / "Deny" pair of buttons. The action is gated on a quorum of peer approvals (configurable per action class). It replaces what used to be hard-coded numeric limits with a human-in-the-loop check that scales with risk.

Run-command threshold tuning

I added per-action thresholds so the friction lands where it matters. Single-device runs go straight through. Anything targeting more than 100 endpoints requires the peer approval. Mass deploys to all clients require two approvers from different peer groups. The threshold logic lives in frogger_approvals.needs_approval and can be re-tuned without redeploying the agent.

Monthly M365 reporting

The same agent layer drives an unattended monthly job that pulls Microsoft 365 license usage, SKU mix, and cost across the partner tenant plus every customer tenant under GDAP. Bedrock summarizes the per-tenant deltas; SES delivers the report at the start of each month. The report runs without supervision — I designed it to fail loudly via a separate alert path rather than emit a partial report.

Selected technical decisions

What I'd build differently if I started today

Stack summary

Python · Claude Agent SDK · AWS Lambda · DynamoDB · API Gateway · EventBridge Scheduler · Bot Framework · Microsoft Graph (GDAP) · ConnectWise Manage REST · NinjaOne v2 · ConnectSecure v4 · Breach Secure Now · Adaptive Cards · AWS Bedrock

← Resume · Cass →