What Does $33 of AI Code Generation Buy You?
4,586 lines of Go, 59 minutes, and a cost breakdown nobody publishes
AI code generation is a SaaS with pay-per-use economics. You would never run an AWS workload without Cost Explorer. But most engineers using AI tools have no idea where their tokens go. I instrumented my pipeline, broke down a $33.29 coding session line by line, and found that without caching, the same session would have cost $180 — context loading was consuming most of the budget, not intelligence. The fix was the same one you would apply to any cloud bill: observe, identify, optimize.
$33 for 4,586 Lines of Go
Your cell phone bill is $80 a month. You think you are paying for calls and data. The actual network usage is closer to $15. The rest is spectrum licensing, tower leases, billing systems, regulatory compliance. The overhead is 80% of the bill and you never look at it because the carrier does not break it down for you.
AI code generation has the same cost structure and the same transparency problem. The pitch ends at “10x productivity.” No mention of what the API call cost. No analysis of where the tokens went. No comparison to a developer doing the same work.
An Anthropic researcher recently used 16 Claude agents to build a C compiler from scratch — 100,000 lines of Rust, $20,000 in API tokens, 2 billion input tokens consumed over two weeks [1]. An impressive result that consumed tokens the way Cleopatra consumed milk baths. When you are an AI company demonstrating what your model can do, you do not care what it costs. Your employees use the model internally at zero marginal cost. The $20,000 is a marketing expense, not an engineering budget.
They will start caring about token efficiency when flat-rate subscription users like the ones on Anthropic’s MAX plan consume more than the plan recovers. Until then, the incentive is to show capability, not to optimize cost. That is your problem, not theirs.
If you are paying for the tokens yourself, you should care. Because when you buy $33 of AI code generation, you think you are buying code. What you actually get is the fully loaded model — except the features are a slow engine that stalls on context loading, high-octane fuel burned on timeout loops, and a navigation system that keeps proposing the same wrong turn. You did not ask for any of that. It came with the car.
I am an applied mathematician with an optimization background. I cannot look at a cost structure without wondering if it can be reduced. So I set up an experiment to find out. I am after all trained on how to do that.
The Setup
The project is go-unix-utils: a specification-driven reimplementation of POSIX/GNU utilities in Go. The methodology: extract formal specifications from C source code, write SRDs that capture the exact behavior contracts, then run an autonomous pipeline — cobbler-scaffold — that handles the coding without a human in the loop.
The pipeline has two phases. The planning phase reads the project documentation and proposes implementation tasks — what to build next, in what order, and at what scope. The coding phase takes each task and writes the code. Every invocation is instrumented: tokens consumed, cache hits, cost, wall-clock time, lines produced, turns taken.
One full session across three batches produced pkg/testutils, cmd/ts, cmd/cat, cmd/wc, cmd/sponge, pkg/sys, and pkg/format. Sixteen cycles total.
The Breakdown
Total cost: $33.29.
Wall-clock time: 59 minutes, unattended.
Lines produced: 4,586 — 2,842 production code, 1,744 tests.
Cost per thousand lines (KLOC): $7.26.
For comparison: Steve McConnell’s Code Complete documents developer productivity on small projects at 50 to 125 lines of delivered, tested code per day [2]. That includes all the overhead — meetings, debugging, code review, documentation, Jira tickets. At a senior engineer’s fully-loaded daily cost of $800 to $1,000, that works out to $6,400 to $20,000 per KLOC. The AI does not go to meetings. Its output is compiled, tested, and committed — in this project, differentially tested against GNU coreutils. The comparison is roughly apples to apples on the output side — and the pipeline is about 1,000x cheaper per delivered line.
I know from experience that McConnell’s 50-125 lines/day is conservative. Over a 20-year career I was consistently the fastest developer on every team I worked on — routinely 500-1,000 lines of delivered code per day. I tracked this because I am an optimizer by nature and I wanted to get faster. Nobody else cared about the number. No manager ever asked. No performance review ever mentioned it. Because there is more to being a developer than the lines of code you produce — being a good teammate, mentoring junior engineers, communicating direction to the people around you. That is where the time actually goes, and that is what McConnell’s 50-125 lines/day reflects. Not slow typing. The whole job.
At a fully loaded cost of $800/day, my personal rate was $800-$1,600/KLOC — roughly 10x better than McConnell’s average. But I also spent 40% of my time not coding — meetings, reviews, specifications, coordination — which means my effective coding rate was really $1,300-$2,700/KLOC when you account for the full day. Even benchmarked against a top performer, the pipeline is 200-400x cheaper.
The other objections are predictable. Lines of code is a garbage metric — but McConnell uses it, so the comparison is on his terms, not mine. AI-generated code quality varies — but this code passes the same test suite as the reference implementation. And the comparison does not include the cost of writing specifications, which is real human labor. The $33 buys the coding. It does not buy the thinking that makes the coding possible. Even with all those caveats, the cost difference is not 2x or 5x. It is hundreds of times. That gap is large enough to survive any reasonable adjustment.
What changes the calculation is not the cost per line. It is what you were doing during those 59 minutes. I was reviewing SRDs for the next batch. The pipeline ran unattended. About 40% of my time is not coding — it is specifications, reviews, decomposition, the work that makes the coding possible. The pipeline doesn’t eliminate that 40%. It runs in parallel with it. The cost comparison assumes a developer doing nothing else — but the developer using the pipeline is not idle.
Where the Money Goes
The cost split: planning consumed 26% of the total, coding consumed 74%.
That ratio tells you something. Planning reads documentation and proposes tasks — relatively lightweight. Coding reads the full project context, writes the implementation, and iterates across multiple turns per task. The input token count for coding is dominated by context loading: the architecture documents, the requirements, the existing source files, the test framework, and the accumulated multi-turn conversation history.
Ninety-three percent of input tokens were served from cache. This matters because cached tokens cost one-tenth of fresh input tokens at current Claude API pricing [3]. Without the cache hit rate, the same session would have cost roughly $180 instead of $33.
The cost model is essentially a fixed charge for the first read of each document, then near-zero for subsequent reads within the same session. Planning is cheap because it is a single-message interaction — one prompt containing the project state, one structured response containing the task list. The planner does not read files across multiple turns. It gets everything it needs in one shot, and the cache makes the second and third planning cycle nearly free. Coding is expensive because each task is a multi-turn conversation. The agent reads a file, edits it, reads the compiler output, edits again — each turn reloading the full project context. Each new task forces a fresh context load. As the project grows — more packages, more SRDs, more existing code — that per-task context load grows with it. The cost curve is not linear. It bends.
The 26/74 split also reflects a difference in failure rates. Planning produces the text of task descriptions — structured text where errors are hard to detect unless a bug surfaces later. Coding produces code that has to compile and pass tests. Planning almost always succeeds. Coding fails often enough that the retries, timeouts, and wasted context loads dominate the cost. The SWE-bench+ research found the same pattern from the other direction: cost-per-resolved-issue ranged from $12 to $655 depending on the agent — not because of price differences, but because of resolution rate differences [4]. An agent resolving 50% of tasks at $12 per attempt costs $24 per resolved issue. An agent resolving 2% at the same per-attempt cost reaches $655. The cost is in the failure rate, not the API call.
What Broke
Three failure modes appeared across sessions.
Output token limits. A hard ceiling of 32,000 output tokens hit during a later run. A task producing roughly 300 lines of Go with tests crossed the limit mid-session. The agent stopped. The partial output was unusable. Raising the ceiling to 128,000 tokens resolves it, but only by moving the wall, not removing it.
Timeout loops. The coding phase had a 15-minute budget per task. On a run with 295KB of context loaded into the first turn, the agent consumed 10 minutes and 46 seconds processing context before writing a single line. When the task timed out, it restarted with the same context. The loop repeated until the run was killed manually.
Planning re-proposals. A configuration intended to limit task size caused the planning agent to propose the same task repeatedly with minor modifications instead of splitting it into smaller pieces. Nine Claude invocations, three iteration cycles, zero production lines. The planning prompt requested exactly one task per iteration; the validation loop had no mechanism to force decomposition. Cost: $3.92 for nothing.
None of these are capability failures. The model was not confused about Go syntax or the utility specification. Each failure can be fixed with better pipeline design and understanding — a budget that needs adjusting, a context that needs trimming, a prompt that needs tighter constraints. The AI wrote correct code. The orchestration around it needed to catch up.
All three problems have since been resolved. The token ceiling was raised. The timeout loops were fixed by right-sizing tasks so context loading fits within the budget. The re-proposal problem was solved by adding requirement-level state tracking so the planning agent knows what has already been completed. The details of how — and the cost impact of each fix — will be the subject of a follow-up article.
Observe, Identify, Optimize
This is the same exercise you would do with an AWS bill. You open Cost Explorer, find the service consuming 70% of your spend, and ask whether you are using it efficiently. Maybe you migrate from RDS to DynamoDB. Maybe you right-size your instances. The fix is never “use less cloud.” The fix is to understand where the money goes and restructure.
AI code generation is a SaaS with the same pay-per-use economics — except there is no Cost Explorer. AWS gives you dashboards, breakdowns by service, alerts when spending spikes. AI providers give you a total token count and a bill. No breakdown by phase. No visibility into cache hit rates. No way to see which part of your workflow is consuming 74% of your budget unless you build the instrumentation yourself. I built the instrumentation. Here is what it revealed.
Step 1: Observe. The context loading problem explains why the planning/coding split matters. In a session where context loading consumes most of the time budget, the agent cannot produce much. In a session where the task fits the budget cleanly, the agent produces correct output and closes the issue. The same model, the same temperature, the same project. The difference is task sizing.
Step 2: Identify. The 93% cache hit rate is why the cost is manageable. A pipeline that does not cache project documentation is reading the same files at full price every session. At the context sizes this project requires, that is the difference between $33 and $180. Cache efficiency is not a nice-to-have in this cost model. It is most of the budget. This is your RDS-to-DynamoDB moment — same workload, different cost structure, 5x savings.
Step 3: Optimize. Task sizing controls cost more than model selection or prompt engineering. A task that fits within the context budget completes cleanly. A task that exceeds it burns tokens on restarts, timeouts, and re-proposals — the $3.92-for-nothing failure mode. Right-sizing tasks is the AI equivalent of right-sizing EC2 instances.
Step 4: Reduce. The specification quality observation is harder to quantify but shows up clearly in the output. The pipeline cannot improve on what the SRDs say. Where a SRD was precise about error handling or exit codes, the generated code handled them correctly. Where it was vague, the generated code reflected that vagueness. Better specifications reduce rework tokens the way better schema design reduces database queries. The $33.29 includes no cost for requirements — because requirements were written before the session started. That prior work is not in the $33. It is what makes the $33 possible.
The models are getting more capable but they are not getting proportionally cheaper to run. The context windows are growing, which means you can feed them more — but feeding them more costs more. The AI companies are not breaking this down for you any more than your cell phone carrier breaks down your bill. The only way to understand your cost structure is to instrument your own pipeline, read your own invoices, and find the overhead yourself. Since this early session, I have reduced costs significantly. The most recent session — run 38 — produced 73,648 lines of code (45,789 production, 27,859 tests) for $432, covering 105 commands and 1,395 requirements. Cost per KLOC dropped from $7.26 to $5.87. Planning overhead dropped from 26% to 10%. Cost per requirement dropped from $0.41 to $0.29 between the last two sessions — a 29% reduction through better task sizing, requirement weighting, and shared package architecture.
At $5.87/KLOC vs McConnell’s $6,400-$20,000/KLOC for human developers [2], the gap is three orders of magnitude. I used to be the fastest developer on every team I worked on. The pipeline is 200-400x cheaper than I was.
The generation pipeline — cobbler-scaffold — is open source. The instrumentation data is in the go-unix-utils engineering documentation. For background on the methodology: The Architecture-First Approach, Three Commands to a Crude Orchestrator, and What Level of Autonomy Is Your AI Development Workflow?.
REFERENCES
[1] Carlini, N. (2026). “Building a C Compiler with Claude.” https://nicholas.carlini.com/writing/2025/building-a-c-compiler-with-claude.html
[2] McConnell, S. (2004). Code Complete, 2nd edition. Microsoft Press. Productivity data discussed in Part IV. https://www.oreilly.com/library/view/code-complete-2nd/0735619670/
[3] Anthropic (2026). Claude API Pricing. https://www.anthropic.com/pricing
[4] Aleithan, R., et al. (2024). “SWE-bench+: Enhanced Coding Benchmark for LLMs.” https://arxiv.org/abs/2410.06992

