Battle for Context: How We Implemented AI Coding in an Enterprise Project

Published: (December 30, 2025 at 03:57 PM EST)
8 min read
Source: Dev.to

Source: Dev.to

Introduction: A Task Nobody Had Solved

Imagine this: you need to give an analyst the ability to code. Not “write a prompt to ChatGPT,” but actually make changes to an enterprise product with a three‑year history and a million lines of code.

The developer isn’t sitting next to them dictating every line. They set up the environment, control quality, and only intervene when something goes wrong.

Sounds like science fiction? We thought so too—until we tried.

Why This Is Harder Than It Seems

When a programmer uses an AI assistant, they control every step. They see what’s happening under the hood and notice oddities in the code immediately.

With an analyst, everything is different. They see only the result: “the form appeared” or “the form doesn’t work.” Code quality, architectural decisions, potential bugs—all of this remains behind the scenes.

We decided to create a system that compensates for this blindness—a system where AI can’t “cause trouble” even if it really wants to.

Tool Selection: Why Cursor

We tried several options (GitHub Copilot, Claude Code, various API wrappers) and settled on Cursor for several reasons.

Multi‑model Support

Cursor lets us use different models for different tasks:

flowchart LR
    A[Task] --> B{Type?}
    B -->|Planning| C[Claude Opus 4.5]
    B -->|Implementation| D[Gemini 3 Flash]
    C --> E[200K tokens]
    D --> F[1M tokens]
  • Claude Opus 4.5 – architectural planning (smart but “expensive” in tokens)
  • Gemini 3 Flash – implementation (fast, cheap, and most importantly — 1 million tokens of context)

MCP Integration

Model Context Protocol (MCP) is a way to connect external tools to AI.

ComponentPurpose
JiraTask management
Context7Library documentation
Memory BankContext preservation between sessions
BeadsAtomic task tracking

Flexible Rules

Cursor allows creating .mdc files with rules that automatically load depending on context.
Working on a React component → load React rules.
Writing a script → load Node.js rules.

Security Requirements: Working Locally

Our security team imposed strict requirements: no access to the corporate network during development and no cloning of the production database.
We built a full mocking system on MSW (Mock Service Worker):

flowchart TD
    A[Frontend App] --> B[MSW Interceptor]
    B --> C{Request Type}
    C -->|API Call| D[Mock Handlers]
    C -->|Static| E[Pass‑Through]
    D --> F[Fake Data Generators]
    F --> G[@faker-js]
    D --> H[Response]
  • 50+ handlers for all API endpoints
  • Realistic data generators using @faker-js
  • Full business‑logic emulation

Quality Gates: The Stricter, The Better

Key insight: AI needs strict constraints. Without them, it starts to “create”: sees outdated code → refactors; notices a potential vulnerability → “fixes” it; finds a style mismatch → reformats.
In practice, a simple task like “add a field to a form” can turn into a PR with 100 000 lines.

Our Quality‑Gates Pipeline

flowchart TD
    A[Commit] --> B[commitlint]
    B -->|Pass| C[ESLint]
    B -->|Fail| X[❌ Rejected]
    C -->|Pass| D[TypeScript]
    C -->|Fail| X
    D -->|Pass| E[Vitest]
    D -->|Fail| X
    E -->|Pass| F[Secretlint]
    E -->|Fail| X
    F -->|Pass| G[✅ Push]
    F -->|Fail| X
ToolRole
commitlintChecks commit‑message format
ESLintStrict TypeScript rules, import order
TypeScriptStrict mode, no any
VitestUnit tests must pass
SecretlintDetects accidentally committed secrets

If the code doesn’t pass these checks, the commit never happens.

The Context Problem: The Main Pain Point

Context is the thing that almost killed the entire project.

  • With a simple 10‑file app, AI sees the whole project and works perfectly.
  • With a million‑line, three‑year‑old codebase, AI only sees a fragment—the tip of the iceberg.
Project SizeTokensAI Effective Work Time
Tutorial project100 KUnlimited
Medium product500 K2‑3 hours
Enterprise (3+ years)1 M+20‑30 minutes

After ~30 minutes, AI starts to “forget,” repeats mistakes, proposes already‑rejected solutions, and breaks things that were just working.

Four Rakes We Stepped On

Rake #1: “It Worked on a Simple Example”

Experiment: Ask an analyst to create a registration form on a clean boilerplate (minimal React project, reference rules, 10 files).

Result: 15 minutes, everything works perfectly.

Same task on the real project: nothing works. AI gets confused by dependencies, uses outdated patterns, and conflicts with existing code.

Lesson: It’s not that AI is “dumb”; it’s the lack of context.

Rake #2: AI “Fixed” the Entire Project

Task: add one feature. AI completed it and:

  • Replaced all any with specific types
  • “Fixed” potential vulnerabilities
  • Reformatted half the project
  • Updated outdated dependencies

Result: PR with 100 000+ lines; GitLab couldn’t even display the diff. We spent two weeks untangling it, and the product was broken.

Lesson: Explicitly limit the scope of AI work with rules.

Rake #3: Token Limitation

We didn’t initially realize most models have a context window limited to ~100 K tokens. When the codebase exceeds that, the model can’t see the whole picture, leading to the problems described above.

Rake #4: Over‑reliance on AI for Decision‑Making

We let the model suggest architectural changes without human review. The model’s “optimizations” conflicted with long‑standing design decisions, causing regressions.

Lesson: Keep humans in the loop for any high‑impact decision.

Takeaways

  1. Context is king. For large codebases, slice the problem into well‑defined, context‑bounded tasks.
  2. Strict quality gates are non‑negotiable; they prevent AI from slipping unchecked changes into the repo.
  3. Model selection matters. Use a cheap, high‑token model for implementation and a more capable (but expensive) model for planning.
  4. Security‑first mocking lets you develop locally without exposing production data.
  5. Human oversight remains essential, especially for architectural or security‑critical changes.

By embracing these principles, we turned a potentially chaotic AI‑assisted workflow into a reliable, repeatable process that scales to enterprise‑size codebases.

200K Tokens – Not Enough for Enterprise

  • For an Enterprise project, 200 K tokens are only enough for 3‑5 iterations.
  • After that the AI starts “forgetting” the beginning of the conversation, proposes solutions you’ve already rejected, and repeats mistakes.

Lesson: For Enterprise work you need models with at least 1 million tokens of context.

Rake #4 – Auto‑Mode Is a Trap

Cursor can automatically select a model. It sounds convenient, but in practice it often picks a cheap model with a small context window.

We wasted a lot of time before we understood that for serious work you need to manually select the model.

Lesson: Use Claude Opus for planning and Gemini Flash for implementation. Never rely on auto‑mode.

How We Solved the Context Problem

After all the “rakes” we built a system that isn’t perfect but works.

Two‑Level Task Tracking

flowchart TD
    subgraph Jira [Jira – Team Level]
        J1[VP‑385: Add Registration Form]
    end

    subgraph Beads [Beads – Atomic Level]
        B1[bd‑1: Review UserForm.tsx]
        B2[bd‑2: Add email field]
        B3[bd‑3: Write test]
    end

    J1 --> B1
    B1 --> B2
    B2 --> B3
  • Jira – top‑level tasks for the team (e.g., VP‑385: Add registration form).
  • Beads – atomic tasks for the AI:
    • bd‑1: Review file UserForm.tsx
    • bd‑2: Add email field
    • bd‑3: Write test

Beads are stored locally and synced with Git, so the AI always knows which step it stopped at.

Memory Bank

An “external memory” for the AI that stores:

  • Current context – what we’re working on now
  • Progress – what’s already done
  • Research – findings and references
  • Archives – completed tasks

When the AI “forgets” something, it can pull the needed information from the Memory Bank and restore its understanding.

Model Combination

We split the work between two models:

ModelRoleReason
Claude Opus 4.5Architect – creates plans, writes specs, conducts reviews200 K token window is enough for planning
Gemini 3 FlashExecutor – implements code according to the plan1 M token window lets it work for hours without losing the thread
flowchart LR
    A[Task] --> B[Opus: Planning]
    B --> C[Opus: Spec / TZ]
    C --> D[Gemini: Implementation]
    D --> E[Opus: Review]
    E -->|Issues| D
    E -->|OK| F[Done]

Cycle: Opus plans → Gemini implements → Opus reviews.

Project Statistics (1.5 weeks on feature/msw-mocks)

MetricValue
Commits425
Files changed672
Lines added+85 000
Lines removed–11 000
Tests added~200
Tokens spent1.5 billion

Implemented features

  • ✅ Full MSW mocking system (50+ handlers)
  • ✅ Schedule timeline with Gantt chart
  • ✅ Quality Gates (ESLint, TypeScript, Husky)
  • ✅ Beads integration
  • ✅ 200+ unit tests

Comparison: Traditional Development vs. AI‑Assisted Development

ParameterTraditionalWith AI
Time per feature2‑3 days1.5 weeks*
Code qualityDepends on developerHigh (Quality Gates)
TestsOften skipped200+ automatically
DocumentationOften noneGenerated

*Includes infrastructure setup, learning curve, and all the “rakes.”

Important nuance: The first feature is expensive. We spent 1.5 weeks learning the workflow, setting up rules, and stepping on rakes. Subsequent features take ≈10× less time.

Role Evolution

  • Analyst – no longer just “writes specs.” Becomes a junior developer who must understand SQL, work with Git, and read code at a basic level.
  • Developer – no longer just “writes code.” Becomes an architect who focuses on design patterns rather than language syntax. AI can write in Java, Node.js, Python, Go, etc., so developers become universal specialists.

Conclusions & Recommendations

What Works

  • Opus + Gemini combination – smart architect + fast executor
  • Quality Gates – stricter constraints → better results
  • Two‑level tracking – Jira for the team, Beads for the AI
  • Memory Bank – external memory to avoid losing context
  • Data mocking – complete development autonomy

What Doesn’t Work

  • Auto‑mode for model selection
  • AI without constraints (it will “fix” the entire project)
  • Models with context windows < 1 M tokens for Enterprise work

Checklist for Getting Started

  • Set up local development environment
  • Implement Quality Gates (ESLint, strict TypeScript)
  • Create a data‑mocking system
  • Connect MCP (Jira, Context7, Memory Bank)
  • Train analysts on Git and SQL
  • Choose the right models (Opus + Gemini)

Final Thoughts

The battle for context isn’t won yet. Context windows keep growing, but Enterprise projects remain too large for an AI to see them in full. We therefore need systems that help AI maintain focus: task trackers, Memory Bank, and Quality Gates.

We spent 1.5 billion tokens to reach these insights. Hopefully our experience helps you spend far fewer.

What’s your experience with AI coding in large projects? Share in the comments!

About the Author

Working on UI with React. Tools: Cursor IDE, Claude Opus 4.5, Gemini 3 Flash.

Tags: ai cursor enterprise programming devjournal

Back to Blog

Related posts

Read more »

AI SEO agencies Nordic

!Cover image for AI SEO agencies Nordichttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads...