Frogger
A multi-tenant AI assistant on the Claude Agent SDK that auto-resolves thousands of IT tickets every month for a managed-service provider.
What it does
Frogger lives in Microsoft Teams as a chat-and-mention bot. Anyone on the MSP team — or one of the 50+ client tenants the MSP manages — can ping it. It triages, investigates, and where allowed, fixes IT tickets autonomously. End-to-end, with no human in the loop on the routine cases. The bot is also embedded in approval-gated workflows for higher-risk actions like mass deploys or fleet-wide password resets.
It runs on the Claude Agent SDK with a custom tool layer that bridges the MSP's actual operational systems — ConnectWise Manage, NinjaOne, ConnectSecure, Microsoft Graph, Breach Secure Now, and several dozen smaller integrations. It reads tickets, queries device state, runs scripts, opens / updates / closes tickets, logs time entries, and surfaces summaries back into the Teams threads where the work originated.
Architecture
Backend
Python on AWS Lambda. DynamoDB for state, API Gateway for the bot webhook, EventBridge Scheduler for everything that runs on a clock (monthly license reports, nightly phishing-result syncs, scheduled scans). Bot Framework + Microsoft Graph on the Teams side; the bot is registered as an Azure AD app with channel-specific RSC permissions granted at install time per chat, so it never has tenant-wide standing access it doesn't actively need.
Agent layer
Claude Agent SDK driving a tool catalog of ~40 tools across read, mutate, and "needs approval" tiers. Tools are grouped by integration (ConnectWise, NinjaOne, ConnectSecure, Graph, Breach Secure Now, AGJ-internal SharePoint) so I can swap one out without touching the others. Each tool has hard-coded input validation plus a permission gate that knows which Teams channel made the call.
The agent itself is a chain: classify the incoming Teams message → route to the right system prompt → execute tools → post a markdown summary back into the originating chat. Long-running jobs (multi-tenant scans, fleet queries) detach into background tasks and post completion messages when they finish.
Approval flow for high-blast-radius actions
Anything that touches more than one device, runs an arbitrary script, or modifies billing-impacting state hits the approval flow first. The bot posts an Adaptive Card to a dedicated SecOps approver group chat with the proposed action, the blast radius, and an "Approve" / "Deny" pair of buttons. The action is gated on a quorum of peer approvals (configurable per action class). It replaces what used to be hard-coded numeric limits with a human-in-the-loop check that scales with risk.
Run-command threshold tuning
I added per-action thresholds so the friction lands where it matters. Single-device runs go straight through. Anything targeting more than 100 endpoints requires the peer approval. Mass deploys to all clients require two approvers from different peer groups. The threshold logic lives in frogger_approvals.needs_approval and can be re-tuned without redeploying the agent.
Monthly M365 reporting
The same agent layer drives an unattended monthly job that pulls Microsoft 365 license usage, SKU mix, and cost across the partner tenant plus every customer tenant under GDAP. Bedrock summarizes the per-tenant deltas; SES delivers the report at the start of each month. The report runs without supervision — I designed it to fail loudly via a separate alert path rather than emit a partial report.
Selected technical decisions
- Claude Agent SDK over building from scratch. The bot's strength is its tool surface and the integration depth, not the reasoning loop. The SDK gave me a battle-tested loop on day one so I could spend time on the parts that are actually differentiated.
- Per-tenant context isolation. Tools that span tenants take an explicit
tenant_idparameter; the agent can never accidentally cross client boundaries because the tool layer enforces the scope before the API call goes out. - Approval-as-a-card, not approval-as-a-form. Adaptive Cards live inside Teams, no context switch, no separate dashboard. Approvers approve in the same chat where they would normally have hand-coordinated the action anyway.
- Auto-resolve audit trail. Every ticket Frogger closes writes a structured audit row to DynamoDB plus a ConnectWise time entry with the agent transcript attached. Anything Frogger does can be reconstructed after the fact, including the exact tool calls and the agent's reasoning summary.
- RDP signing chain. I designed and shipped a per-server code-signing trust chain that suppresses both of Microsoft's RDP warning dialogs (unknown publisher + identity verification) for the MSP fleet without provisioning per-user certificates. It's adjacent to the bot but came out of the same body of work.
What I'd build differently if I started today
- A first-class “dry run” mode for every mutating tool. Right now, dry-runs are per-tool conventions; they should be a tool-layer primitive.
- Move the approval cards to a webhook-driven serverless adapter rather than the bot itself, so approvals survive a bot restart and can be reviewed asynchronously.
- Push more of the tenant context into structured agent memory instead of recomputing it per turn. Caching wins would be material.