Context Engineering Part 2: Advanced Techniques for Production AI

I assist businesses in developing and shipping high-quality software. π
Mastering COMPRESS and ISOLATE, plus real-world production challenges
Recap: Where We Left Off
In Part 1, we talked about the four ways AI agents can fail when context isn't managed well and we learnt about the first two pillars of context engineering:
WRITE β Keep notes with information outside of the context window.
SELECT β Get only the information you need for the current task.
Now let's get into the more advanced methods that set good AI systems apart from those that are ready for production.
Pillar 3: COMPRESS (Reduce Token Usage)
The main idea is to keep the most important information and get rid of or summarise the rest.
Technique 3.1: Hierarchical Summarization
Make summaries with different levels of detail:
The Zoom Lens Approach:
Imagine describing your summer vacation:
Zoom Level 1 - Ultra Wide (5 words):
- "Family trip to the beach was fun"
Zoom Level 2 - Wide (50 words):
- "Spent two weeks at the beach with my family." We swam every day, built sandcastles, ate ice cream, surfed, and saw dolphins. The best vacation ever!"
Zoom Level 3 - Medium (500 words):
- This is the whole story, including what you learnt while surfing, the funny sandcastle competition, and when you saw dolphins.
Zoom Level 4 - Full Detail (5000 words):
- Everything! Every second, every talk, every little thing, every caption on a photo...
The Smart Part:
Want a quick summary? Use Level 1
Are you sending grandma an email? Use Level 2
Keeping a diary? Level 3 is what you should use.
Making a book of pictures? Level 4
AI does the same thing!
AI does the same thing!
Full Technical Specification (5000 words):
"Our company was founded in 2010 with the mission to revolutionize
cloud testing. Over the years, we've grown from a team of 5 to 500+
employees across 12 countries..."
Medium Summary (500 words):
"Testing platform founded 2010. Team of 500+ across 12 countries.
Processes 10M+ tests daily for 10K+ customers..."
Short Summary (50 words):
"Cloud testing platform. 500+ employees, 10K+ customers, 10M+ daily tests."
Ultra Short (5 words):
"Cloud testing platform, global scale"
Load the amount of detail you need for each job!
Technique 3.2: Sliding Window with Summarization
For long conversations, keep track of the details of recent messages and summarise the older ones.
The Conversation Memory Trick:
Imagine having a 2-hour phone call with your friend:
What You Remember:
Minutes 110-120 (Just Now) - Crystal Clear:
Friend: "So should I get the blue or red shoes?"
You: "Get the blue ones, they match your jacket!"
Friend: "Good point! I'll order them tonight."
Minutes 1-109 (Earlier) - Fuzzy Summary:
"We talked about school, weekend plans, and shopping"
"Friend needs new shoes for the party"
"Budget is around $50"
You DON'T Remember:
Every single word from the first 109 minutes
Exact phrasing of everything
The tangent about weather
The Magic:
Recent stuff (last 10 minutes): Remember everything!
Older stuff (first 109 minutes): Just the important summary
Your brain didn't explode!
Claude Code implements this brilliantly with their "auto-compact" feature that triggers at 95% context window capacity.
Technique 3.3: Tool Output Compression
Some tools give back HUGE answers. Before adding to the context, compress:
The "Report Card Summary" Approach:
Think about how your teacher grades 10,000 students on a spreadsheet:
Without Compression (The Overwhelming Way):
Show me all 10,000 students:
Row 1: John Smith, Math: 92, English: 88, Science: 91...
Row 2: Sarah Jones, Math: 85, English: 93, Science: 87...
Row 3: Mike Brown, Math: 78, English: 82, Science: 85...
[... 9,997 more rows ...]
AI Context: EXPLODED! Can't fit!
With Compression (The Smart Summary):
Query returned 10,000 student records.
Key Statistics:
- Average Math score: 84.5
- Average English score: 86.2
- Top 5 students: Sarah (94.3 avg), Mike (93.1 avg)...
- Bottom 5 students: Need tutoring support
- Grade distribution: 15% A's, 35% B's, 40% C's, 10% D's
Sample records:
Row 1: John Smith (90.3 avg) - Excellent
Row 2: Sarah Jones (88.3 avg) - Very Good
Full data saved to: student_grades.xlsx
Result: AI gets the important insights (200 tokens) instead of a lot of raw data (20,000 tokens)!
Compression by Tool Type:
Code Search Results:
Raw: 50 files, 10,000 lines
Compressed: "Found in 5 key files: auth.py (lines 45-120), middleware.py (lines 23-67)..."
Database Query:
Raw: 10,000 rows
Compressed: "10,000 records. Stats: 8,500 active users, 1,500 inactive. Sample: [Row 1, Row 2]"
Log Files:
Raw: 50,000 log entries
Compressed: "23 ERROR logs (15 database timeouts, 5 API limits, 3 memory issues). First: 10:23 AM, Last: 11:42 AM"
Technique 3.4: Lossy vs Lossless Compression
Lossless Compression: Get rid of extra data without losing any information
Original: "The user wants to book a flight. The user prefers direct flights.
The user's budget is $500. The user is traveling next week."
Lossless: "User wants direct flight, $500 budget, traveling next week."
Information preserved: 100%
Token reduction: 40%
Lossy Compression: Accept some loss of information to get a big reduction
Original: 50-page technical specification with exact implementation details
Lossy: "System processes payments via Stripe. Supports credit cards,
PayPal, and Apple Pay. Handles refunds within 30 days."
Information preserved: ~60%
Token reduction: 98%
When to use each:
Lossless: Important policies, legal documents, code, and exact requirements
Lossy: General knowledge, background information, examples, and historical context
Pillar 4: ISOLATE (Focused Context per Task)
Main Idea: To keep context from getting in the way, break up concerns into focused units.
Technique 4.1: Multi-Agent Architecture
Anthropic's multi-agent research system shows that specialised agents with separate contexts work much better than single-agent systems. Their internal tests showed that "a multi-agent system with Claude Opus 4 as the main agent and Claude Sonnet 4 as subagents did 90.2% better than a single-agent Claude Opus 4."
The main point is that "subagents make compression easier by working in parallel with their own context windows and looking at different parts of the question at the same time." You can assign a narrow sub-task to each subagent's context without having to worry about unrelated information getting in the way.
Architecture Pattern:
Think of it like a group project at school:
The Teacher (Orchestrator Agent):
Reads the assignment: "Create a science fair project about volcanoes"
Makes a plan and assigns tasks to different students
The Students (Specialist Agents):
Research Student: Goes to library, finds books about volcanoes
Only carries: Library card, notebook for notes
Doesn't need: Art supplies, poster board (not their job!)
Art Student: Creates the volcano model and poster
Only carries: Paint, clay, poster board
Doesn't need: Library books (already researched!)
Data Student: Analyzes volcano eruption statistics
Only carries: Calculator, graph paper, the research notes
Doesn't need: Art supplies, library books
Quality Check Student: Reviews everything for accuracy
Only carries: the checklist, the completed work
Doesn't need: Any of the original materials
Every student has their small, focused backpack!
The teacher gathers everyone's work at the end and puts it all together to make the final project. Each student only had to remember what they were supposed to do, not the whole project!
Real-World Diagram:
Task: "Write a comprehensive market analysis report"
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator Agent β
β Context: Task description, plan, coordination β
ββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
βResearch β βFinancialβ βCompetitorβ βSynthesisβ
βAgent β βAgent β βAgent β βAgent β
β β β β β β β β
βContext: β βContext: β βContext: β βContext: β
β-Search β β-Finance β β-Competitorβ β-All β
β tools β β data β β data β β summariesβ
β-Market β β-Metrics β β-Analysis β β-Report β
β sources β β formulasβ β frameworksβ β templateβ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
Each agent has isolated, focused context β no interference, no confusion!
Trade-offs of Multi-Agent Systems:
As Anthropic's research reveals, multi-agent systems have significant benefits and costs:
Benefits:
Dramatic performance improvements (90.2% improvement in Anthropic's research eval)
Parallel execution of independent tasks
Separation of concerns and cleaner context per agent
Can handle tasks exceeding single context windows
Excel at "breadth-first queries that involve pursuing multiple independent directions simultaneously"
Costs:
"Agents usually use about four times as many tokens as chat interactions, and multi-agent systems use about fifteen times as many tokens as chats."
requires complicated logic for coordination.
Harder to build and fix
"Compound nature of errors": "One step failing can make agents go down completely different paths."
Without proper prompt engineering, there is a risk of "spawning 50 subagents for simple queries."
When to use multi-agent systems:
Anthropic found that multi-agent systems excel at:
"Valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools"
Open-ended research and exploration tasks
Tasks where multiple independent directions need exploration simultaneously
When NOT to use multi-agent systems:
"Domains that require all agents to share the same context"
Tasks "involving many dependencies between agents"
"Most coding tasks involve fewer truly parallelizable tasks than research"
Simple queries where single-agent is sufficient
Key finding: In Anthropic's BrowseComp evaluation, they found that token usage by itself explains 80% of performance variance. Multi-agent systems work primarily because they "help spend enough tokens to solve the problem" through parallel context windows.
Technique 4.2: Sandboxed Code Execution
HuggingFace's CodeAgent approach shows how to isolate data-heavy operations.
The Sandbox is Like a Workshop:
Imagine you're building a huge LEGO castle:
Without Sandbox (Everything in Your Bedroom):
10,000 LEGO pieces scattered on your bed
Instructions spread across your desk
Half-built towers blocking your closet
Photos of your progress everywhere
Can't even find your homework!
Your bedroom is a disaster!
With Sandbox (Using a Separate Workshop):
Build the entire castle in the garage (workshop/sandbox)
Keep all 10,000 LEGO pieces there
All the mess stays in the garage
When you're done, bring ONE THING to your bedroom:
A photo of the finished castle
A note: "Built awesome castle, used 10,000 pieces, stored in garage"
Your bedroom (AI's context) only sees:
β Small photo (100 KB)
β Short note (50 words)
The garage (sandbox) holds:
Entire castle
All the pieces
All the instructions
Progress photos
Benefits:
Your bedroom stays clean (AI context stays manageable)
You can build huge things (work with massive datasets)
Everything is saved in the garage (data persists)
You can show others just the photo (not the whole castle)
Technique 4.3: State-Based Context Isolation
The Three-Drawer System:
Imagine your desk has three drawers with different rules:
Top Drawer (ALWAYS Open):
Current homework assignment
Today's schedule
What you did in the last 5 minutes
This drawer is always visible. The AI sees this every time.
Middle Drawer (Open ONLY When Needed):
Full conversation history from last week
Research notes from previous projects
Detailed data and analysis
This drawer opens only when specifically asked. Most of the time it stays closed to keep your desk uncluttered.
Bottom Drawer (NEVER Show to AI):
System secrets and passwords
Technical performance stats
Internal tracking numbers
This drawer is locked. The AI never sees what's inside.
Why this works:
AI's "desk" (context) only shows the top drawer (clean and focused!)
Need more info? Open middle drawer temporarily
Never clutter the workspace with locked drawer stuff
Everything is organized and easy to find
Advanced Context Engineering Patterns
Now that you know what the four pillars are, let's look at some more advanced patterns that are used in production systems:
Pattern 1: Context Tiering
Following industry best practices, organise information according to levels of importance:
The Five-Level Information Tower:
Think of information like floors in a building β higher floors are more important:
Tier 0 - The Foundation (NEVER expires):
"Who am I?" (The AI's identity)
"What am I allowed to do?" (Safety rules)
"What can I do?" (Core abilities)
Must ALWAYS load β This is like wearing clothes; you never skip it!
Tier 1 - The Ground Floor (Lasts 30 days):
Company policies
Product documentation
How things work
Must ALWAYS load - Like bringing your student ID to school
Tier 2 - Second Floor (Lasts 7 days):
This week's special offers
Temporary promotions
Current A/B tests
Load if backpack has room - Nice to have, not critical
Tier 3 - Third Floor (Lasts 24 hours):
Today's conversation with this user
What we're working on right now
User's preferences for this session
Load if backpack has room β useful but optional
Tier 4 - The Rooftop (lasts 5 minutes):
Quick calculations
Temporary results from just now
Things you'll throw away soon
Load if backpack has room - Very temporary
How it works:
Start at the foundation (Tier 0) β must pack this!
Add Ground Floor (Tier 1) β must pack this too!
Got room? Add Second Floor (Tier 2)
Still got room? Add Third Floor (Tier 3)
Any space left? Add Rooftop (Tier 4)
The AI packs its backpack from most important to least important, stopping when the backpack is full!
Pattern 2: Long-Horizon Conversation Management
Anthropic's production experience provides critical insights for managing extended conversations:
"Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms."
The Relay Race Strategy for Super Long Conversations:
Imagine you're running a marathon (26 miles), but you can only run 5 miles before getting tired:
The Old Way (Doomed to Fail):
Try to run all 26 miles yourself
Get exhausted at mile 5
Collapse! Can't finish
The Smart Way (Relay Race):
Runner 1 (Miles 1-5):
Runs fresh and energetic!
At mile 5: Writes summary note
"Passed 3 water stations"
"Route goes through park, then downtown"
"Current pace: 8 min/mile"
Saves note to locker
Passes baton to Runner 2
Runner 2 (Miles 6-10):
Starts fresh!
Carries: Just the summary note (light!)
Doesn't carry: Every detail from miles 1-5 (too heavy!)
At mile 10: Adds to the note, saves to locker
Passes baton to Runner 3
Runners 3, 4, 5... Continue the pattern
The Magic:
Each runner only remembers their 5-mile section (small backpack!)
Important info saved in locker (external memory)
If needed, any runner can check the locker
The marathon gets finished!
Anthropic's Three-Part Strategy:
Phase Summarisation: "Finished Phase 1: Found 10 sources on topic X" (store summary, forget details)
Fresh Context Spawning: When the backpack is full, a new AI is spawned with a clean backpack and a summary note.
Memory Retrieval: Need more information from Phase 1? Look in the locker! Don't always carry it around.
This is how AI can talk to people with hundreds of messages without losing track of what they are saying!
Pattern 3: Context Chunking Strategies
Breaking big documents into smart pieces. Three different strategies:
Strategy 1: Fixed-Size Chunking (Like Pizza Slices)
Imagine cutting a pizza:
Cut into equal slices (8 slices, each the same size)
Simple and predictable
But sometimes you cut through the middle of a pepperoni! (loses meaning)
Strategy 2: Semantic Chunking (Like Chapters in a Book)
Imagine organising a story:
Chapter 1: "The Beginning" (complete thought)
Chapter 2: "The Adventure" (complete thought)
Chapter 3: "The Ending" (complete thought)
Don't cut in the middle of a sentence! Cut where ideas naturally end (like between paragraphs).
Strategy 3: Structure-Aware Chunking (Like Sorting by Type)
Imagine organising your toy room:
For LEGO Sets:
Keep each castle set together (don't mix pieces!)
Keep each car set together
Label each: "Castle Set #42, pieces 1-200"
For Books:
Organize by chapter
Each chapter is one chunk
Label: "Harry Potter, Chapter 5: Diagon Alley"
The Smart Part:
LEGO gets organized by "sets"
Books get organized by "chapters"
Each type gets chunked in the way that makes sense for that type!
This way, when AI searches for "castle", it finds the whole castle set, not random LEGO pieces mixed with car parts!
Pattern 4: Context Caching
The Homework Answer Sheet Strategy:
Imagine your maths homework is really hard. Problem 1 takes you 30 minutes to solve!
Without Caching (The Slow Way):
Monday: Solve Problem 1 (30 minutes)
Tuesday: Teacher asks same question again β Solve Problem 1 again (30 minutes)
Wednesday: Same question AGAIN β Solve Problem 1 again (30 minutes)
Total time: 90 minutes for the same answer!
With Caching (The Smart Way):
Monday: Solve Problem 1 (30 minutes) β Write answer in your notebook
Tuesday: Teacher asks same question β Check notebook (5 seconds!)
Wednesday: Same question β Check notebook (5 seconds!)
Total time: 30 minutes and 10 seconds!
The Expiration Rule:
Fresh answers (from today) β Use from notebook
Old answers (from last month) β Might be wrong now; solve again
When to Throw Away Old Answers:
Teacher changes the problem β Delete old answer and compute new one
New formula introduced β Delete related answers
The answer is more than 1 hour old β Might be outdated; check again
The AI does this with information:
Expensive task (takes 10 seconds) β Save result
Same task again (takes 0.01 seconds) β Use saved result!
1000Γ faster!
Production Challenges for Multi-Agent Systems
Building multi-agent systems that work in production requires solving challenges beyond basic context engineering. Anthropic's engineering team shares critical lessons from deploying their research system:
Challenge 1: Stateful Errors Compound
The Problem:
"Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents."
Unlike traditional software where you can restart on error, agents can't restart from the beginning β it's "expensive and frustrating for users"
The Solution Explained:
The Video Game Save Point Strategy:
Imagine playing a video game with 20 levels:
Without Checkpoints (The Nightmare):
Play from Level 1 to Level 18
Game crashes at Level 18
Start over from Level 1
Takes 2 hours to get back to where you were!
With Checkpoints (The Smart Way):
β Level 5 completed β Auto-save!
β Level 10 completed β Auto-save!
β Level 15 completed β Auto-save!
Game crashes at Level 18
Restart from Level 15 save point!
Only replay 3 levels (10 minutes)
When Things Go Wrong:
Scenario 1 - Tool Breaks:
AI tries to use a hammer
The hammer is broken!
AI says: "Okay, I'll use a screwdriver instead."
Adapts and continues!
Scenario 2 - System Crashes:
Working on Step 18 of 20
System crashes
Load last save (Step 15)
Resume from there, not from Step 1!
Key insight from Anthropic: "Letting the agent know when a tool is failing and letting it adapt works surprisingly well." The AI is smart enough to find another way β just tell it what's broken!
Challenge 2: Non-Deterministic Debugging
The Problem:
"Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder."
Users say, "The AI didn't find obvious information" but when you try, it works fine. What happened?
The Solution: The Detective's Notebook (Without Reading Private Diaries)
The problem is like:
Imagine your robot toy sometimes goes left and sometimes goes right, even with the same button press. How do you fix it if you can't predict what it'll do?
The Solution β Track Patterns, Not Content:
Instead of reading every private conversation (creepy!), track the patterns:
What We Track:
Decisions Made:
"Used Google 73% of the time, Wikipedia 20%, ignored other tools 7%"
"Created 3 helper robots on average for complex tasks"
"Chose Strategy A vs Strategy B split: 60/40"
Interaction Patterns:
"Main robot β Helper robot handoff took 2 seconds on average"
"Used Tool 1, then Tool 2, then back to Tool 1 (inefficient!)"
"Context grew from 1000 words β 5000 words β 20,000 words"
Performance Stats:
"Each search took 1.5 seconds"
"Tool X failed 5% of the time"
"Average task: 15 steps, 3 minutes"
Privacy Protected:
We see: "User asked about topic category: Travel"
We DON'T see: "User asked about honeymoon in Paris"
Anthropic emphasizes: "We monitor agent decision patterns and interaction structuresβall without monitoring the contents of individual conversations, to maintain user privacy."
The Detective Work:
The pattern shows: When the context is more than 100k words, AI starts repeating old actions.
Fix: Add checkpoint to summarize when reaching 100k
Problem solved! No need to read private conversations.
Challenge 3: Deployment Coordination
The Problem:
"Agent systems are highly stateful webs of prompts, tools, and execution logic that run almost continuously. This means that whenever we deploy updates, agents might be anywhere in their process."
You can't update all agents simultaneously without breaking running tasks.
The Solution: The Two-Playground Strategy
The Problem Explained:
Imagine a theme park where 100 people are on different rides:
Person 1: Halfway through the rollercoaster
Person 2: Just started the carousel
Person 3: Almost done with the ferris wheel
Now you want to upgrade all the rides with new features. But you can't:
Stop everyone mid-ride (they'd be angry!)
Swap rides while people are on them (dangerous!)
Make everyone start over (frustrating!)
Rainbow Deployment (The Smart Way):
Step 1: Build a second, upgraded theme park next door
Step 2: Make a simple rule:
Anyone CURRENTLY on a ride? β Finish on OLD theme park
Anyone NEW arriving? β Send to NEW theme park
Step 3: Wait patiently.
Old park: People gradually finish and leave
New park: New visitors are having fun with upgrades!
Step 4: When the old park is empty:
Close it down
Everyone's now in the new park!
Nobody's ride was interrupted!
This is exactly how Anthropic deploys updates: "Gradually shifting traffic from old to new versions while keeping both running simultaneously" so no one's work gets interrupted.
Challenge 4: Synchronous Bottlenecks
The Current State:
Anthropic notes that currently their "lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding."
The Problem:
The lead agent can't steer subagents mid-execution
Subagents can't coordinate with each other
The entire system blocked by slowest subagent
Missed opportunities for dynamic parallelism
The Future:
Asynchronous execution enabling concurrent work
Agents creating new subagents on-demand
Dynamic coordination during execution
But adds complexity: "result coordination, state consistency, and error propagation"
Lessons from Anthropic's Multi-Agent System
1. Think Like Your Agents
Build simulations with exact prompts and tools, and watch agents work step-by-step. This "immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools."
2. Teach the Orchestrator How to Delegate
Vague instructions like "research the semiconductor shortage" led to duplicated work and gaps. Instead, each subagent needs:
Clear objective
Output format specification
Tool and source guidance
Explicit task boundaries
3. Scale Effort to Query Complexity
Embed scaling rules in prompts:
Simple fact-finding: 1 agent, 3-10 tool calls
Direct comparisons: 2-4 subagents, 10-15 calls each
Complex research: 10+ subagents with divided responsibilities
4. Tool Design is Critical
"Agent-tool interfaces are as critical as human-computer interfaces." The right tool makes tasks efficient; often it's strictly necessary.
5. The Last Mile is Most of the Journey
"Codebases that work on developer machines require significant engineering to become reliable production systems... For all the reasons described in this post, the gap between prototype and production is often wider than anticipated."
Common Mistakes and How to Fix Them
Mistake 1: Treating All Context Equally
β Wrong: Load everything with equal priority
β Right: Prioritize critical info; load optional info only if space permits
The Backpack Analogy:
Don't pack your winter coat and beach toys equally for a summer trip
Pack summer essentials first; add extras if there's room
Mistake 2: Static Context Management
β Wrong: Use the same context for every task.
β Right: Adapt context to each task's needs
The Analogy:
Don't bring your entire closet to school
Gym class? Bring gym clothes
Art class? Bring art supplies
Math class? Bring a calculator.
Mistake 3: No Context Lifecycle Management
β Wrong: Keep adding context forever, never removing
β Right: Regularly clean up old, irrelevant context
The Analogy:
Don't keep last week's lunch leftovers in your backpack
Remove old items, add fresh ones
Mistake 4: Ignoring Context Versioning
β Wrong: Overwrite information without tracking changes
β Right: Keep version history so you can roll back.
The Analogy:
Like having "Track Changes" in Word documents
Can see what changed and when
Can undo if something breaks
Mistake 5: No Context Observability
β Wrong: Treat context as a black box
β Right: Monitor what's in context, measure effectiveness
The Analogy:
Like checking your backpack weight before hiking
Too heavy? Remove something
Missing essentials? Add them
Measuring Success: Is Your Context Engineering Working?
Track these metrics to know if you're on the right track:
Efficiency Metrics
Context Utilisation:
How much of the available context window are you using?
Target: 70-90% (not too empty, not overflowing)
Information Density:
How many unique facts per 1000 tokens?
Higher density = better packing
Retrieval Precision:
How many retrieved chunks were actually used?
Target: >80% precision (don't retrieve junk)
Context Freshness:
Average age of context items
Fewer stale items = better
Redundancy Rate:
How much duplicate information?
Lower redundancy = more efficient
Quality Metrics
Relevance Score:
How much loaded context was actually referenced in the response?
Target: >70% relevance
Sufficiency Score:
Did the AI have enough information to answer properly?
Check for incomplete or uncertain answers
Consistency Score:
Any contradictions in the context?
Detect conflicting information automatically
How LambdaTest is Applying All Four Pillars
At LambdaTest, we've embraced context engineering as a core principle across our AI-powered systems. Here's our high-level approach:
WRITE:
Critical information is stored in structured formats that enable fast retrieval, efficient filtering, and version tracking.
SELECT:
We implement smart context selection that loads only relevant information per task, uses semantic search for large knowledge bases, and applies metadata filtering.
COMPRESS:
We break complex workflows into focused stages, each with minimal, targeted context, preventing context overflow and improving output quality.
ISOLATE:
We use separation of concerns where different components handle different aspects of workflows, each with clean, focused context boundaries.
The Results:
While we can't share exact numbers, we're seeing:
Dramatically improved accuracy
Significant reduction in processing time
Better cost efficiency
More consistent outputs
Higher user satisfaction
Conclusion: The Art Meets Science
Context engineering is where the art of AI system design meets the science of optimisation.
The Art:
Understanding user needs and workflows
Designing intuitive information architectures
Balancing competing priorities (speed vs accuracy)
Creating elegant solutions to complex problems
The Science:
Measuring token usage and costs
Optimizing retrieval algorithms
Testing different strategies empirically
Analyzing performance data
The evidence is clear: as Drew Breunig's research compilation shows, even frontier models with million-token context windows suffer from context poisoning, distraction, confusion, and clash. Simply having a large context window doesn't solve the problem β you need thoughtful context engineering.
Key Takeaways from Part 2:
COMPRESS saves tokens while preserving meaning
ISOLATE prevents interference between different concerns
Production is hard β prototype success doesn't guarantee production reliability
Measure everything β you can't optimize what you don't track
Learn from failures β track patterns to identify and fix issues
The Four Pillars Together:
WRITE - Organize and save information
SELECT β Retrieve only what's relevant
COMPRESS β Make it smaller without losing meaning
ISOLATE β Separate concerns to prevent interference
Remember: An AI's context window is like a backpack. Pack smart, not heavy.
At LambdaTest, we're committed to applying these principles across our AI-Native systems, continuously pushing the boundaries of what's possible when context is engineered thoughtfully.
Further Reading & References
Essential Resources
How Long Contexts Fail (and How to Fix Them) - Drew Breunig's essential guide covering the four context failure modes with extensive research citations, including DeepMind, Berkeley, Microsoft/Salesforce, and Databricks studies
Context Engineering for AI Agents β A comprehensive guide covering the four pillars (WRITE, SELECT, COMPRESS, ISOLATE) with implementation patterns
Anthropic's Multi-Agent Research System β Deep dive into building production multi-agent systems (90.2% performance improvement)
Anthropic's Effective Context Engineering β Additional strategies from the Claude team
Microsoft's AI Context Engineering Guide β Beginner-friendly introduction
Daffodil Software: Context Engineering Best Practices - Industry best practices
Research Papers & Studies
DeepMind Gemini 2.5 Technical Report: Context poisoning in game-playing agents
Anthropic Multi-Agent Eval: 90.2% performance improvement over single-agent
Berkeley Function-Calling Leaderboard: Every model performs worse with more tools
Microsoft/Salesforce Sharded Prompts Study: 39% performance drop from context clash
Anthropic BrowseComp Evaluation: Token usage explains 80% of performance variance
Hugging Face CodeAgent Paper: Sandboxed execution for context isolation



