CodeCaulk: Autonomous Code Improvement Through Local AI Models

Background

Sunbowl is a Shopify development agency managing 500+ production stores. Many of these stores are inherited from store owners or other agencies, and naturally carry technical debt — inconsistent error handling, duplicated logic, and performance inefficiencies that no single sprint could justify fixing. Our own codebases, spanning TypeScript, React, Node.js, Firebase, and Shopify Liquid, had also grown organically over years of client work. Manual code review caught new problems but couldn’t systematically address the existing debt across hundreds of thousands of lines.

The challenge was to see if we could use AI to continuously improve our code without committing significant resources to it — doing it safely, and without pulling developers off client work.

Results

1,422

Functions scanned across all passes

Improvements applied in first week

70%

Raw suggestions filtered by validation gates

Applied improvements included adding error boundaries to async operations, consolidating duplicated fetch-and-retry patterns across hooks, and replacing redundant array scans with indexed lookups in hot paths. Proof that the system was working as intended, 19 suggestions that passed review but failed to apply were stale — the source code had been modified by earlier successful applies in the same run or had been completely removed by a human developer.

Architecture

The pipeline operates in four stages across each monitored repository:

Stage 1 — Scan and Suggest

A local Ollama model (coderscaulk, fine-tuned for this codebase) scans every function and component, making three passes:

Resilience — Missing error handling, unguarded async operations, absent null checks
Efficiency — Redundant computations, unnecessary re-renders, suboptimal data access patterns
DRY — Duplicated logic that could be consolidated

The model operates on individual function chunks with 30 lines of surrounding context, producing structured suggestions with the original code, improved code, and an explanation. Running locally means zero API cost for the most token-intensive stage of the pipeline.

Scanning is incremental. Each run tracks the last processed git SHA per repository. On subsequent nights, only files changed since the last scan are reprocessed, keeping cycle times short for active repos.

Stage 2 — Validate

Every suggestion passes through a five-gate validation pipeline before any human or AI reviewer sees it:

Syntax — Response structure and format verification
TypeScript — tsc --noEmit compilation against the full project
ESLint — Linting rules enforced
Tests — Existing test suite via Vitest must still pass
Behavioral Equivalence — Automated equivalence test confirms the change doesn’t alter observable behavior

Each gate is a hard pass/fail. A suggestion that introduces a type error, breaks a test, or changes behavior is rejected regardless of how reasonable it looks. This is what we believe makes an autonomous application safe — the pipeline is opinionated about correctness, not about style.

Stage 3 — Dual-Model Review

Suggestions that survive all five gates are sent to two Claude models independently:

Claude Haiku (via API) — Fast, cost-efficient first pass
Claude Opus (via CLI, MAX subscription) — Deep reasoning second pass

Both models must independently approve a suggestion for it to proceed. Each reviewer can APPROVE, REJECT, or QUARANTINE, and provides reasoning and a confidence score. This dual-approval strategy catches edge cases that either model alone might miss.

Stage 4 — Apply and Deploy

Approved changes are patched into the source files using a two-level matching strategy (exact match, then whitespace-normalized match). Each change is committed individually to a date-stamped branch (optimized/YYYY-MM-DD) with a descriptive commit message, then pushed to the remote. Slack and email notifications summarize what changed and why.

The optimized branch is never auto-merged. A developer reviews the branch and merges into staging or main when ready. This keeps a human in the loop for the final deployment decision while automating everything before it.

Why Local Models

Volume. Scanning every function across multiple repositories generates thousands of LLM calls per night. At cloud API pricing, exhaustive scanning would cost hundreds of dollars per run. A local model on a Mac Studio runs the same workload at zero marginal cost, making it economically viable to scan aggressively — every function, every file, every night.

Iteration speed. The local model was fine-tuned on this specific codebase’s patterns, frameworks, and conventions. When it suggests improvements to a React hook that uses Firebase, it understands the project’s actual patterns rather than suggesting generic alternatives.

Privacy. Client code from 500+ Shopify stores never leaves the local machine during the suggestion phase. Only validated, anonymized suggestions reach cloud models during the review stage.

Operational Lessons

The validation pipeline is the product. The local model generates plenty of suggestions that look reasonable but would break compilation, fail tests, or change behavior. The five-gate pipeline rejects roughly 70% of raw suggestions. Without it, autonomous application would be reckless. With it, the model’s imperfect suggestions become a feature — cast a wide net, filter aggressively.

Dual-model review catches what single review misses. Early runs with single-model review (Haiku only) approved changes that were technically correct but contextually wrong — like adding error handling that swallowed errors the UI needed to display. Adding Opus as a second reviewer with independent judgment reduced false approvals significantly.

Incremental scanning is essential. Full scans of a 151-file repo take hours through the local model. SHA-based incremental scanning reduces nightly runs to minutes when only a few files changed, making the pipeline practical as a daily job rather than a weekly batch.

Stale suggestions are a signal, not a bug. When the patcher can’t find the original code to replace, it means the codebase is actively improving — either through earlier applies in the same run or through developer work during the day. The system correctly skips these rather than forcing patches onto changed code.

Stack

Component	Technology
Suggestion generation	Ollama (local, fine-tuned coderscaulk model)
Validation gates	TypeScript compiler, ESLint, Vitest
Code review	Claude Haiku (API) + Claude Opus (CLI)
Orchestration	TypeScript, Node.js
Scheduling	macOS LaunchAgent (nightly at 22:00)
Notifications	Slack webhooks, Mailjet email
Target codebases	TypeScript, React, Node.js, Firebase, Shopify Liquid

What’s Next

The pipeline currently operates on a fixed set of repositories added through a management dashboard. Expanding to the full portfolio of Shopify store themes — primarily Liquid and JavaScript — is the next target, alongside exploring whether the local model can be fine-tuned on the pipeline’s own approval/rejection data to improve suggestion quality over time.