Everyone is building AI agents. Most of them don't work. Not because the technology is bad, but because the engineering practices around agents are still immature. After deploying agents for procurement, research, customer support, and operations, here's what we've learned about making them reliable.
What Makes an Agent Different
An AI agent is not a chatbot. A chatbot responds to a single input with a single output. An agent:
- Reasons about how to accomplish a goal
- Plans a sequence of steps
- Uses tools (APIs, databases, browsers, code execution)
- Observes results and adjusts its approach
- Persists across multiple interactions
This autonomy is what makes agents powerful - and dangerous. An agent that can send emails, query databases, and make API calls can do a lot of damage if it reasons incorrectly.
Lesson 1: Define the Boundaries First
Before writing any agent code, answer these questions:
- What actions can the agent take? (Explicit allowlist, not blocklist)
- What data can it access?
- What's the maximum cost per execution?
- What decisions require human approval?
- What happens when it fails? (Graceful degradation, not silent failure)
We define these boundaries in a "agent constitution" document before development begins. It's reviewed by both the technical team and the business stakeholder.
Lesson 2: Start with Deterministic Steps
The biggest mistake in agent development is making everything AI-driven. Most workflows have steps that should be deterministic:
- Data validation? Use code, not AI.
- API calls with known parameters? Use code, not AI.
- Formatting output? Use templates, not AI.
Use AI only for the steps that genuinely require reasoning: understanding unstructured input, making judgement calls, synthesising information, and generating natural language. This "AI where needed, code where possible" approach dramatically improves reliability.
Lesson 3: Human-in-the-Loop Is Not Optional
Every production agent we've deployed has human checkpoints. The question is where to place them. Our framework:
- High-frequency, low-risk: Automated with monitoring. Human reviews exceptions. (e.g., email classification)
- Medium-frequency, medium-risk: Agent drafts, human approves. (e.g., purchase orders under £5K)
- Low-frequency, high-risk: Agent recommends, human decides. (e.g., supplier contracts, hiring decisions)
The goal is not to remove humans. It's to remove the boring parts of human work so they can focus on judgement and relationships.
Lesson 4: Observability Is Everything
When an agent makes a mistake (and it will), you need to understand exactly what happened. Every agent we deploy includes:
- Execution traces: Every step the agent took, with inputs and outputs
- Decision logs: Why the agent chose action A over action B
- Tool call logs: Every API call, database query, and external interaction
- Cost tracking: Token usage and API costs per execution
- Error classification: Was this a model error, tool error, or data error?
This observability isn't just for debugging. It's how you build trust. When stakeholders can see exactly what the agent did and why, they trust it more - and they can give better feedback for improvement.
Lesson 5: Test Like It's Software (Because It Is)
Agent development is software engineering, not prompt engineering. Our testing approach:
- Unit tests: Test individual tools and functions in isolation
- Integration tests: Test tool combinations with mock external services
- Scenario tests: Run the full agent against 20-50 realistic scenarios
- Adversarial tests: Try to break the agent with edge cases, ambiguous inputs, and conflicting instructions
- Regression tests: Re-run the full test suite when changing prompts or tools
We version-control our prompts and treat prompt changes like code changes - they go through review and testing before deployment.
The Tech Stack That Works
After trying most agent frameworks, here's what we use in production:
- Model: Claude 3.5+ for complex reasoning, GPT-4o for speed-sensitive tasks
- Framework: LangGraph for complex state machines, Claude tool-use for simpler agents
- Orchestration: Custom Python with clear step definitions
- State management: PostgreSQL for durable state, Redis for ephemeral state
- Monitoring: LangSmith or custom dashboards with structured logging
We've found that simpler is better. A well-structured agent with 5-10 tools will outperform a complex multi-agent system with 50 tools almost every time.
The Bottom Line
AI agents are genuinely transformative - when built correctly. The companies seeing real value from agents share three traits:
- They start with a clearly defined, high-value workflow
- They invest in engineering discipline (testing, monitoring, guardrails)
- They keep humans in the loop for high-stakes decisions
The agent revolution is real. But it's an engineering revolution, not a magic revolution. Treat agents like production software, and they'll deliver production-grade results.
Building AI agents for your business? Book a strategy call and we'll help you design agents that are reliable, safe, and genuinely useful.