Skip to main content

Command Palette

Search for a command to run...

Context Engineering Part 2: Advanced Techniques for Production AI

Updated
β€’23 min read
Context Engineering Part 2: Advanced Techniques for Production AI
S

I assist businesses in developing and shipping high-quality software. πŸš€

Mastering COMPRESS and ISOLATE, plus real-world production challenges


Recap: Where We Left Off

In Part 1, we talked about the four ways AI agents can fail when context isn't managed well and we learnt about the first two pillars of context engineering:

  • WRITE – Keep notes with information outside of the context window.

  • SELECT – Get only the information you need for the current task.

Now let's get into the more advanced methods that set good AI systems apart from those that are ready for production.


Pillar 3: COMPRESS (Reduce Token Usage)

The main idea is to keep the most important information and get rid of or summarise the rest.

Technique 3.1: Hierarchical Summarization

Make summaries with different levels of detail:

The Zoom Lens Approach:

Imagine describing your summer vacation:

Zoom Level 1 - Ultra Wide (5 words):

  • "Family trip to the beach was fun"

Zoom Level 2 - Wide (50 words):

  • "Spent two weeks at the beach with my family." We swam every day, built sandcastles, ate ice cream, surfed, and saw dolphins. The best vacation ever!"

Zoom Level 3 - Medium (500 words):

  • This is the whole story, including what you learnt while surfing, the funny sandcastle competition, and when you saw dolphins.

Zoom Level 4 - Full Detail (5000 words):

  • Everything! Every second, every talk, every little thing, every caption on a photo...

The Smart Part:

  • Want a quick summary? Use Level 1

  • Are you sending grandma an email? Use Level 2

  • Keeping a diary? Level 3 is what you should use.

  • Making a book of pictures? Level 4

  • AI does the same thing!

AI does the same thing!

Full Technical Specification (5000 words):
"Our company was founded in 2010 with the mission to revolutionize
cloud testing. Over the years, we've grown from a team of 5 to 500+
employees across 12 countries..."

Medium Summary (500 words):
"Testing platform founded 2010. Team of 500+ across 12 countries.
Processes 10M+ tests daily for 10K+ customers..."

Short Summary (50 words):
"Cloud testing platform. 500+ employees, 10K+ customers, 10M+ daily tests."

Ultra Short (5 words):
"Cloud testing platform, global scale"

Load the amount of detail you need for each job!

Technique 3.2: Sliding Window with Summarization

For long conversations, keep track of the details of recent messages and summarise the older ones.

The Conversation Memory Trick:

Imagine having a 2-hour phone call with your friend:

What You Remember:

Minutes 110-120 (Just Now) - Crystal Clear:

  • Friend: "So should I get the blue or red shoes?"

  • You: "Get the blue ones, they match your jacket!"

  • Friend: "Good point! I'll order them tonight."

Minutes 1-109 (Earlier) - Fuzzy Summary:

  • "We talked about school, weekend plans, and shopping"

  • "Friend needs new shoes for the party"

  • "Budget is around $50"

You DON'T Remember:

  • Every single word from the first 109 minutes

  • Exact phrasing of everything

  • The tangent about weather

The Magic:

  • Recent stuff (last 10 minutes): Remember everything!

  • Older stuff (first 109 minutes): Just the important summary

  • Your brain didn't explode!

Claude Code implements this brilliantly with their "auto-compact" feature that triggers at 95% context window capacity.

Technique 3.3: Tool Output Compression

Some tools give back HUGE answers. Before adding to the context, compress:

The "Report Card Summary" Approach:

Think about how your teacher grades 10,000 students on a spreadsheet:

Without Compression (The Overwhelming Way):

Show me all 10,000 students:
Row 1: John Smith, Math: 92, English: 88, Science: 91...
Row 2: Sarah Jones, Math: 85, English: 93, Science: 87...
Row 3: Mike Brown, Math: 78, English: 82, Science: 85...
[... 9,997 more rows ...]

AI Context: EXPLODED! Can't fit!

With Compression (The Smart Summary):

Query returned 10,000 student records.

Key Statistics:
- Average Math score: 84.5
- Average English score: 86.2
- Top 5 students: Sarah (94.3 avg), Mike (93.1 avg)...
- Bottom 5 students: Need tutoring support
- Grade distribution: 15% A's, 35% B's, 40% C's, 10% D's

Sample records:
Row 1: John Smith (90.3 avg) - Excellent
Row 2: Sarah Jones (88.3 avg) - Very Good

Full data saved to: student_grades.xlsx

Result: AI gets the important insights (200 tokens) instead of a lot of raw data (20,000 tokens)!

Compression by Tool Type:

Code Search Results:

  • Raw: 50 files, 10,000 lines

  • Compressed: "Found in 5 key files: auth.py (lines 45-120), middleware.py (lines 23-67)..."

Database Query:

  • Raw: 10,000 rows

  • Compressed: "10,000 records. Stats: 8,500 active users, 1,500 inactive. Sample: [Row 1, Row 2]"

Log Files:

  • Raw: 50,000 log entries

  • Compressed: "23 ERROR logs (15 database timeouts, 5 API limits, 3 memory issues). First: 10:23 AM, Last: 11:42 AM"

Technique 3.4: Lossy vs Lossless Compression

Lossless Compression: Get rid of extra data without losing any information

Original: "The user wants to book a flight. The user prefers direct flights.
           The user's budget is $500. The user is traveling next week."

Lossless: "User wants direct flight, $500 budget, traveling next week."

Information preserved: 100%
Token reduction: 40%

Lossy Compression: Accept some loss of information to get a big reduction

Original: 50-page technical specification with exact implementation details

Lossy: "System processes payments via Stripe. Supports credit cards,
        PayPal, and Apple Pay. Handles refunds within 30 days."

Information preserved: ~60%
Token reduction: 98%

When to use each:

  • Lossless: Important policies, legal documents, code, and exact requirements

  • Lossy: General knowledge, background information, examples, and historical context


Pillar 4: ISOLATE (Focused Context per Task)

Main Idea: To keep context from getting in the way, break up concerns into focused units.

Technique 4.1: Multi-Agent Architecture

Anthropic's multi-agent research system shows that specialised agents with separate contexts work much better than single-agent systems. Their internal tests showed that "a multi-agent system with Claude Opus 4 as the main agent and Claude Sonnet 4 as subagents did 90.2% better than a single-agent Claude Opus 4."

The main point is that "subagents make compression easier by working in parallel with their own context windows and looking at different parts of the question at the same time." You can assign a narrow sub-task to each subagent's context without having to worry about unrelated information getting in the way.

Architecture Pattern:

Think of it like a group project at school:

The Teacher (Orchestrator Agent):

  • Reads the assignment: "Create a science fair project about volcanoes"

  • Makes a plan and assigns tasks to different students

The Students (Specialist Agents):

  • Research Student: Goes to library, finds books about volcanoes

    • Only carries: Library card, notebook for notes

    • Doesn't need: Art supplies, poster board (not their job!)

  • Art Student: Creates the volcano model and poster

    • Only carries: Paint, clay, poster board

    • Doesn't need: Library books (already researched!)

  • Data Student: Analyzes volcano eruption statistics

    • Only carries: Calculator, graph paper, the research notes

    • Doesn't need: Art supplies, library books

  • Quality Check Student: Reviews everything for accuracy

    • Only carries: the checklist, the completed work

    • Doesn't need: Any of the original materials

Every student has their small, focused backpack!

The teacher gathers everyone's work at the end and puts it all together to make the final project. Each student only had to remember what they were supposed to do, not the whole project!

Real-World Diagram:

Task: "Write a comprehensive market analysis report"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Orchestrator Agent                               β”‚
β”‚ Context: Task description, plan, coordination   β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚         β”‚         β”‚         β”‚
     β–Ό         β–Ό         β–Ό         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Research β”‚ β”‚Financialβ”‚ β”‚Competitorβ”‚ β”‚Synthesisβ”‚
β”‚Agent    β”‚ β”‚Agent    β”‚ β”‚Agent     β”‚ β”‚Agent    β”‚
β”‚         β”‚ β”‚         β”‚ β”‚          β”‚ β”‚         β”‚
β”‚Context: β”‚ β”‚Context: β”‚ β”‚Context:  β”‚ β”‚Context: β”‚
β”‚-Search  β”‚ β”‚-Finance β”‚ β”‚-Competitorβ”‚ β”‚-All     β”‚
β”‚ tools   β”‚ β”‚ data    β”‚ β”‚ data     β”‚ β”‚ summariesβ”‚
β”‚-Market  β”‚ β”‚-Metrics β”‚ β”‚-Analysis β”‚ β”‚-Report  β”‚
β”‚ sources β”‚ β”‚ formulasβ”‚ β”‚ frameworksβ”‚ β”‚ templateβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each agent has isolated, focused context – no interference, no confusion!

Trade-offs of Multi-Agent Systems:

As Anthropic's research reveals, multi-agent systems have significant benefits and costs:

Benefits:

  • Dramatic performance improvements (90.2% improvement in Anthropic's research eval)

  • Parallel execution of independent tasks

  • Separation of concerns and cleaner context per agent

  • Can handle tasks exceeding single context windows

  • Excel at "breadth-first queries that involve pursuing multiple independent directions simultaneously"

Costs:

  • "Agents usually use about four times as many tokens as chat interactions, and multi-agent systems use about fifteen times as many tokens as chats."

  • requires complicated logic for coordination.

  • Harder to build and fix

  • "Compound nature of errors": "One step failing can make agents go down completely different paths."

  • Without proper prompt engineering, there is a risk of "spawning 50 subagents for simple queries."

When to use multi-agent systems:

Anthropic found that multi-agent systems excel at:

  • "Valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools"

  • Open-ended research and exploration tasks

  • Tasks where multiple independent directions need exploration simultaneously

When NOT to use multi-agent systems:

  • "Domains that require all agents to share the same context"

  • Tasks "involving many dependencies between agents"

  • "Most coding tasks involve fewer truly parallelizable tasks than research"

  • Simple queries where single-agent is sufficient

Key finding: In Anthropic's BrowseComp evaluation, they found that token usage by itself explains 80% of performance variance. Multi-agent systems work primarily because they "help spend enough tokens to solve the problem" through parallel context windows.

Technique 4.2: Sandboxed Code Execution

HuggingFace's CodeAgent approach shows how to isolate data-heavy operations.

The Sandbox is Like a Workshop:

Imagine you're building a huge LEGO castle:

Without Sandbox (Everything in Your Bedroom):

  • 10,000 LEGO pieces scattered on your bed

  • Instructions spread across your desk

  • Half-built towers blocking your closet

  • Photos of your progress everywhere

  • Can't even find your homework!

  • Your bedroom is a disaster!

With Sandbox (Using a Separate Workshop):

  • Build the entire castle in the garage (workshop/sandbox)

  • Keep all 10,000 LEGO pieces there

  • All the mess stays in the garage

  • When you're done, bring ONE THING to your bedroom:

    • A photo of the finished castle

    • A note: "Built awesome castle, used 10,000 pieces, stored in garage"

Your bedroom (AI's context) only sees:

  • βœ… Small photo (100 KB)

  • βœ… Short note (50 words)

The garage (sandbox) holds:

  • Entire castle

  • All the pieces

  • All the instructions

  • Progress photos

Benefits:

  • Your bedroom stays clean (AI context stays manageable)

  • You can build huge things (work with massive datasets)

  • Everything is saved in the garage (data persists)

  • You can show others just the photo (not the whole castle)

Technique 4.3: State-Based Context Isolation

The Three-Drawer System:

Imagine your desk has three drawers with different rules:

Top Drawer (ALWAYS Open):

  • Current homework assignment

  • Today's schedule

  • What you did in the last 5 minutes

This drawer is always visible. The AI sees this every time.

Middle Drawer (Open ONLY When Needed):

  • Full conversation history from last week

  • Research notes from previous projects

  • Detailed data and analysis

This drawer opens only when specifically asked. Most of the time it stays closed to keep your desk uncluttered.

Bottom Drawer (NEVER Show to AI):

  • System secrets and passwords

  • Technical performance stats

  • Internal tracking numbers

This drawer is locked. The AI never sees what's inside.

Why this works:

  • AI's "desk" (context) only shows the top drawer (clean and focused!)

  • Need more info? Open middle drawer temporarily

  • Never clutter the workspace with locked drawer stuff

  • Everything is organized and easy to find


Advanced Context Engineering Patterns

Now that you know what the four pillars are, let's look at some more advanced patterns that are used in production systems:

Pattern 1: Context Tiering

Following industry best practices, organise information according to levels of importance:

The Five-Level Information Tower:

Think of information like floors in a building – higher floors are more important:

Tier 0 - The Foundation (NEVER expires):

  • "Who am I?" (The AI's identity)

  • "What am I allowed to do?" (Safety rules)

  • "What can I do?" (Core abilities)

  • Must ALWAYS load – This is like wearing clothes; you never skip it!

Tier 1 - The Ground Floor (Lasts 30 days):

  • Company policies

  • Product documentation

  • How things work

  • Must ALWAYS load - Like bringing your student ID to school

Tier 2 - Second Floor (Lasts 7 days):

  • This week's special offers

  • Temporary promotions

  • Current A/B tests

  • Load if backpack has room - Nice to have, not critical

Tier 3 - Third Floor (Lasts 24 hours):

  • Today's conversation with this user

  • What we're working on right now

  • User's preferences for this session

  • Load if backpack has room – useful but optional

Tier 4 - The Rooftop (lasts 5 minutes):

  • Quick calculations

  • Temporary results from just now

  • Things you'll throw away soon

  • Load if backpack has room - Very temporary

How it works:

  1. Start at the foundation (Tier 0) – must pack this!

  2. Add Ground Floor (Tier 1) – must pack this too!

  3. Got room? Add Second Floor (Tier 2)

  4. Still got room? Add Third Floor (Tier 3)

  5. Any space left? Add Rooftop (Tier 4)

The AI packs its backpack from most important to least important, stopping when the backpack is full!

Pattern 2: Long-Horizon Conversation Management

Anthropic's production experience provides critical insights for managing extended conversations:

"Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms."

The Relay Race Strategy for Super Long Conversations:

Imagine you're running a marathon (26 miles), but you can only run 5 miles before getting tired:

The Old Way (Doomed to Fail):

  • Try to run all 26 miles yourself

  • Get exhausted at mile 5

  • Collapse! Can't finish

The Smart Way (Relay Race):

Runner 1 (Miles 1-5):

  • Runs fresh and energetic!

  • At mile 5: Writes summary note

    • "Passed 3 water stations"

    • "Route goes through park, then downtown"

    • "Current pace: 8 min/mile"

  • Saves note to locker

  • Passes baton to Runner 2

Runner 2 (Miles 6-10):

  • Starts fresh!

  • Carries: Just the summary note (light!)

  • Doesn't carry: Every detail from miles 1-5 (too heavy!)

  • At mile 10: Adds to the note, saves to locker

  • Passes baton to Runner 3

Runners 3, 4, 5... Continue the pattern

The Magic:

  • Each runner only remembers their 5-mile section (small backpack!)

  • Important info saved in locker (external memory)

  • If needed, any runner can check the locker

  • The marathon gets finished!

Anthropic's Three-Part Strategy:

  1. Phase Summarisation: "Finished Phase 1: Found 10 sources on topic X" (store summary, forget details)

  2. Fresh Context Spawning: When the backpack is full, a new AI is spawned with a clean backpack and a summary note.

  3. Memory Retrieval: Need more information from Phase 1? Look in the locker! Don't always carry it around.

    This is how AI can talk to people with hundreds of messages without losing track of what they are saying!

Pattern 3: Context Chunking Strategies

Breaking big documents into smart pieces. Three different strategies:

Strategy 1: Fixed-Size Chunking (Like Pizza Slices)

Imagine cutting a pizza:

  • Cut into equal slices (8 slices, each the same size)

  • Simple and predictable

  • But sometimes you cut through the middle of a pepperoni! (loses meaning)

Strategy 2: Semantic Chunking (Like Chapters in a Book)

Imagine organising a story:

  • Chapter 1: "The Beginning" (complete thought)

  • Chapter 2: "The Adventure" (complete thought)

  • Chapter 3: "The Ending" (complete thought)

Don't cut in the middle of a sentence! Cut where ideas naturally end (like between paragraphs).

Strategy 3: Structure-Aware Chunking (Like Sorting by Type)

Imagine organising your toy room:

For LEGO Sets:

  • Keep each castle set together (don't mix pieces!)

  • Keep each car set together

  • Label each: "Castle Set #42, pieces 1-200"

For Books:

  • Organize by chapter

  • Each chapter is one chunk

  • Label: "Harry Potter, Chapter 5: Diagon Alley"

The Smart Part:

  • LEGO gets organized by "sets"

  • Books get organized by "chapters"

  • Each type gets chunked in the way that makes sense for that type!

This way, when AI searches for "castle", it finds the whole castle set, not random LEGO pieces mixed with car parts!

Pattern 4: Context Caching

The Homework Answer Sheet Strategy:

Imagine your maths homework is really hard. Problem 1 takes you 30 minutes to solve!

Without Caching (The Slow Way):

  • Monday: Solve Problem 1 (30 minutes)

  • Tuesday: Teacher asks same question again β†’ Solve Problem 1 again (30 minutes)

  • Wednesday: Same question AGAIN β†’ Solve Problem 1 again (30 minutes)

  • Total time: 90 minutes for the same answer!

With Caching (The Smart Way):

  • Monday: Solve Problem 1 (30 minutes) β†’ Write answer in your notebook

  • Tuesday: Teacher asks same question β†’ Check notebook (5 seconds!)

  • Wednesday: Same question β†’ Check notebook (5 seconds!)

  • Total time: 30 minutes and 10 seconds!

The Expiration Rule:

  • Fresh answers (from today) β†’ Use from notebook

  • Old answers (from last month) β†’ Might be wrong now; solve again

When to Throw Away Old Answers:

  • Teacher changes the problem β†’ Delete old answer and compute new one

  • New formula introduced β†’ Delete related answers

  • The answer is more than 1 hour old β†’ Might be outdated; check again

The AI does this with information:

  • Expensive task (takes 10 seconds) β†’ Save result

  • Same task again (takes 0.01 seconds) β†’ Use saved result!

  • 1000Γ— faster!


Production Challenges for Multi-Agent Systems

Building multi-agent systems that work in production requires solving challenges beyond basic context engineering. Anthropic's engineering team shares critical lessons from deploying their research system:

Challenge 1: Stateful Errors Compound

The Problem:

"Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents."

Unlike traditional software where you can restart on error, agents can't restart from the beginning – it's "expensive and frustrating for users"

The Solution Explained:

The Video Game Save Point Strategy:

Imagine playing a video game with 20 levels:

Without Checkpoints (The Nightmare):

  • Play from Level 1 to Level 18

  • Game crashes at Level 18

  • Start over from Level 1

  • Takes 2 hours to get back to where you were!

With Checkpoints (The Smart Way):

  • βœ… Level 5 completed β†’ Auto-save!

  • βœ… Level 10 completed β†’ Auto-save!

  • βœ… Level 15 completed β†’ Auto-save!

  • Game crashes at Level 18

  • Restart from Level 15 save point!

  • Only replay 3 levels (10 minutes)

When Things Go Wrong:

Scenario 1 - Tool Breaks:

  • AI tries to use a hammer

  • The hammer is broken!

  • AI says: "Okay, I'll use a screwdriver instead."

  • Adapts and continues!

Scenario 2 - System Crashes:

  • Working on Step 18 of 20

  • System crashes

  • Load last save (Step 15)

  • Resume from there, not from Step 1!

Key insight from Anthropic: "Letting the agent know when a tool is failing and letting it adapt works surprisingly well." The AI is smart enough to find another way – just tell it what's broken!

Challenge 2: Non-Deterministic Debugging

The Problem:

"Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder."

Users say, "The AI didn't find obvious information" but when you try, it works fine. What happened?

The Solution: The Detective's Notebook (Without Reading Private Diaries)

The problem is like:

Imagine your robot toy sometimes goes left and sometimes goes right, even with the same button press. How do you fix it if you can't predict what it'll do?

The Solution – Track Patterns, Not Content:

Instead of reading every private conversation (creepy!), track the patterns:

What We Track:

Decisions Made:

  • "Used Google 73% of the time, Wikipedia 20%, ignored other tools 7%"

  • "Created 3 helper robots on average for complex tasks"

  • "Chose Strategy A vs Strategy B split: 60/40"

Interaction Patterns:

  • "Main robot β†’ Helper robot handoff took 2 seconds on average"

  • "Used Tool 1, then Tool 2, then back to Tool 1 (inefficient!)"

  • "Context grew from 1000 words β†’ 5000 words β†’ 20,000 words"

Performance Stats:

  • "Each search took 1.5 seconds"

  • "Tool X failed 5% of the time"

  • "Average task: 15 steps, 3 minutes"

Privacy Protected:

  • We see: "User asked about topic category: Travel"

  • We DON'T see: "User asked about honeymoon in Paris"

Anthropic emphasizes: "We monitor agent decision patterns and interaction structuresβ€”all without monitoring the contents of individual conversations, to maintain user privacy."

The Detective Work:

  • The pattern shows: When the context is more than 100k words, AI starts repeating old actions.

  • Fix: Add checkpoint to summarize when reaching 100k

  • Problem solved! No need to read private conversations.

Challenge 3: Deployment Coordination

The Problem:

"Agent systems are highly stateful webs of prompts, tools, and execution logic that run almost continuously. This means that whenever we deploy updates, agents might be anywhere in their process."

You can't update all agents simultaneously without breaking running tasks.

The Solution: The Two-Playground Strategy

The Problem Explained:

Imagine a theme park where 100 people are on different rides:

  • Person 1: Halfway through the rollercoaster

  • Person 2: Just started the carousel

  • Person 3: Almost done with the ferris wheel

Now you want to upgrade all the rides with new features. But you can't:

  • Stop everyone mid-ride (they'd be angry!)

  • Swap rides while people are on them (dangerous!)

  • Make everyone start over (frustrating!)

Rainbow Deployment (The Smart Way):

Step 1: Build a second, upgraded theme park next door

Step 2: Make a simple rule:

  • Anyone CURRENTLY on a ride? β†’ Finish on OLD theme park

  • Anyone NEW arriving? β†’ Send to NEW theme park

Step 3: Wait patiently.

  • Old park: People gradually finish and leave

  • New park: New visitors are having fun with upgrades!

Step 4: When the old park is empty:

  • Close it down

  • Everyone's now in the new park!

Nobody's ride was interrupted!

This is exactly how Anthropic deploys updates: "Gradually shifting traffic from old to new versions while keeping both running simultaneously" so no one's work gets interrupted.

Challenge 4: Synchronous Bottlenecks

The Current State:

Anthropic notes that currently their "lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding."

The Problem:

  • The lead agent can't steer subagents mid-execution

  • Subagents can't coordinate with each other

  • The entire system blocked by slowest subagent

  • Missed opportunities for dynamic parallelism

The Future:

  • Asynchronous execution enabling concurrent work

  • Agents creating new subagents on-demand

  • Dynamic coordination during execution

  • But adds complexity: "result coordination, state consistency, and error propagation"

Lessons from Anthropic's Multi-Agent System

1. Think Like Your Agents

Build simulations with exact prompts and tools, and watch agents work step-by-step. This "immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools."

2. Teach the Orchestrator How to Delegate

Vague instructions like "research the semiconductor shortage" led to duplicated work and gaps. Instead, each subagent needs:

  • Clear objective

  • Output format specification

  • Tool and source guidance

  • Explicit task boundaries

3. Scale Effort to Query Complexity

Embed scaling rules in prompts:

  • Simple fact-finding: 1 agent, 3-10 tool calls

  • Direct comparisons: 2-4 subagents, 10-15 calls each

  • Complex research: 10+ subagents with divided responsibilities

4. Tool Design is Critical

"Agent-tool interfaces are as critical as human-computer interfaces." The right tool makes tasks efficient; often it's strictly necessary.

5. The Last Mile is Most of the Journey

"Codebases that work on developer machines require significant engineering to become reliable production systems... For all the reasons described in this post, the gap between prototype and production is often wider than anticipated."


Common Mistakes and How to Fix Them

Mistake 1: Treating All Context Equally

❌ Wrong: Load everything with equal priority

βœ… Right: Prioritize critical info; load optional info only if space permits

The Backpack Analogy:

  • Don't pack your winter coat and beach toys equally for a summer trip

  • Pack summer essentials first; add extras if there's room

Mistake 2: Static Context Management

❌ Wrong: Use the same context for every task.

βœ… Right: Adapt context to each task's needs

The Analogy:

  • Don't bring your entire closet to school

  • Gym class? Bring gym clothes

  • Art class? Bring art supplies

  • Math class? Bring a calculator.

Mistake 3: No Context Lifecycle Management

❌ Wrong: Keep adding context forever, never removing

βœ… Right: Regularly clean up old, irrelevant context

The Analogy:

  • Don't keep last week's lunch leftovers in your backpack

  • Remove old items, add fresh ones

Mistake 4: Ignoring Context Versioning

❌ Wrong: Overwrite information without tracking changes

βœ… Right: Keep version history so you can roll back.

The Analogy:

  • Like having "Track Changes" in Word documents

  • Can see what changed and when

  • Can undo if something breaks

Mistake 5: No Context Observability

❌ Wrong: Treat context as a black box

βœ… Right: Monitor what's in context, measure effectiveness

The Analogy:

  • Like checking your backpack weight before hiking

  • Too heavy? Remove something

  • Missing essentials? Add them


Measuring Success: Is Your Context Engineering Working?

Track these metrics to know if you're on the right track:

Efficiency Metrics

Context Utilisation:

  • How much of the available context window are you using?

  • Target: 70-90% (not too empty, not overflowing)

Information Density:

  • How many unique facts per 1000 tokens?

  • Higher density = better packing

Retrieval Precision:

  • How many retrieved chunks were actually used?

  • Target: >80% precision (don't retrieve junk)

Context Freshness:

  • Average age of context items

  • Fewer stale items = better

Redundancy Rate:

  • How much duplicate information?

  • Lower redundancy = more efficient

Quality Metrics

Relevance Score:

  • How much loaded context was actually referenced in the response?

  • Target: >70% relevance

Sufficiency Score:

  • Did the AI have enough information to answer properly?

  • Check for incomplete or uncertain answers

Consistency Score:

  • Any contradictions in the context?

  • Detect conflicting information automatically


How LambdaTest is Applying All Four Pillars

At LambdaTest, we've embraced context engineering as a core principle across our AI-powered systems. Here's our high-level approach:

WRITE:

Critical information is stored in structured formats that enable fast retrieval, efficient filtering, and version tracking.

SELECT:

We implement smart context selection that loads only relevant information per task, uses semantic search for large knowledge bases, and applies metadata filtering.

COMPRESS:

We break complex workflows into focused stages, each with minimal, targeted context, preventing context overflow and improving output quality.

ISOLATE:

We use separation of concerns where different components handle different aspects of workflows, each with clean, focused context boundaries.

The Results:

While we can't share exact numbers, we're seeing:

  • Dramatically improved accuracy

  • Significant reduction in processing time

  • Better cost efficiency

  • More consistent outputs

  • Higher user satisfaction


Conclusion: The Art Meets Science

Context engineering is where the art of AI system design meets the science of optimisation.

The Art:

  • Understanding user needs and workflows

  • Designing intuitive information architectures

  • Balancing competing priorities (speed vs accuracy)

  • Creating elegant solutions to complex problems

The Science:

  • Measuring token usage and costs

  • Optimizing retrieval algorithms

  • Testing different strategies empirically

  • Analyzing performance data

The evidence is clear: as Drew Breunig's research compilation shows, even frontier models with million-token context windows suffer from context poisoning, distraction, confusion, and clash. Simply having a large context window doesn't solve the problem – you need thoughtful context engineering.

Key Takeaways from Part 2:

  1. COMPRESS saves tokens while preserving meaning

  2. ISOLATE prevents interference between different concerns

  3. Production is hard – prototype success doesn't guarantee production reliability

  4. Measure everything – you can't optimize what you don't track

  5. Learn from failures – track patterns to identify and fix issues

The Four Pillars Together:

  • WRITE - Organize and save information

  • SELECT – Retrieve only what's relevant

  • COMPRESS – Make it smaller without losing meaning

  • ISOLATE – Separate concerns to prevent interference

Remember: An AI's context window is like a backpack. Pack smart, not heavy.

At LambdaTest, we're committed to applying these principles across our AI-Native systems, continuously pushing the boundaries of what's possible when context is engineered thoughtfully.


Further Reading & References

Essential Resources

  1. How Long Contexts Fail (and How to Fix Them) - Drew Breunig's essential guide covering the four context failure modes with extensive research citations, including DeepMind, Berkeley, Microsoft/Salesforce, and Databricks studies

  2. Context Engineering for AI Agents – A comprehensive guide covering the four pillars (WRITE, SELECT, COMPRESS, ISOLATE) with implementation patterns

  3. Anthropic's Multi-Agent Research System – Deep dive into building production multi-agent systems (90.2% performance improvement)

  4. Anthropic's Effective Context Engineering – Additional strategies from the Claude team

  5. Microsoft's AI Context Engineering Guide – Beginner-friendly introduction

  6. Daffodil Software: Context Engineering Best Practices - Industry best practices

Research Papers & Studies

  • DeepMind Gemini 2.5 Technical Report: Context poisoning in game-playing agents

  • Anthropic Multi-Agent Eval: 90.2% performance improvement over single-agent

  • Berkeley Function-Calling Leaderboard: Every model performs worse with more tools

  • Microsoft/Salesforce Sharded Prompts Study: 39% performance drop from context clash

  • Anthropic BrowseComp Evaluation: Token usage explains 80% of performance variance

  • Hugging Face CodeAgent Paper: Sandboxed execution for context isolation

7 views