How long does an AI implementation take?

Most single workflow implementations take 2-6 weeks from kickoff to production. Full AI transformation programmes run 6-12 weeks.

Do you work with specific AI models?

We are model-agnostic and work with all major providers including Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, Mistral, and more.

Can you deploy AI on our own servers?

Yes. Our Local & Private AI service deploys models on your own infrastructure or private cloud.

Building AI Agents That Actually Work in Production

Everyone is building AI agents. Most of them don't work. Not because the technology is bad, but because the engineering practices around agents are still immature. After deploying agents for procurement, research, customer support, and operations, here's what we've learned about making them reliable.

What Makes an Agent Different

An AI agent is not a chatbot. A chatbot responds to a single input with a single output. An agent:

Reasons about how to accomplish a goal
Plans a sequence of steps
Uses tools (APIs, databases, browsers, code execution)
Observes results and adjusts its approach
Persists across multiple interactions

This autonomy is what makes agents powerful - and dangerous. An agent that can send emails, query databases, and make API calls can do a lot of damage if it reasons incorrectly.

Lesson 1: Define the Boundaries First

Before writing any agent code, answer these questions:

What actions can the agent take? (Explicit allowlist, not blocklist)
What data can it access?
What's the maximum cost per execution?
What decisions require human approval?
What happens when it fails? (Graceful degradation, not silent failure)

We define these boundaries in a "agent constitution" document before development begins. It's reviewed by both the technical team and the business stakeholder.

Lesson 2: Start with Deterministic Steps

The biggest mistake in agent development is making everything AI-driven. Most workflows have steps that should be deterministic:

Data validation? Use code, not AI.
API calls with known parameters? Use code, not AI.
Formatting output? Use templates, not AI.

Use AI only for the steps that genuinely require reasoning: understanding unstructured input, making judgement calls, synthesising information, and generating natural language. This "AI where needed, code where possible" approach dramatically improves reliability.

Lesson 3: Human-in-the-Loop Is Not Optional

Every production agent we've deployed has human checkpoints. The question is where to place them. Our framework:

High-frequency, low-risk: Automated with monitoring. Human reviews exceptions. (e.g., email classification)
Medium-frequency, medium-risk: Agent drafts, human approves. (e.g., purchase orders under £5K)
Low-frequency, high-risk: Agent recommends, human decides. (e.g., supplier contracts, hiring decisions)

The goal is not to remove humans. It's to remove the boring parts of human work so they can focus on judgement and relationships.

Lesson 4: Observability Is Everything

When an agent makes a mistake (and it will), you need to understand exactly what happened. Every agent we deploy includes:

Execution traces: Every step the agent took, with inputs and outputs
Decision logs: Why the agent chose action A over action B
Tool call logs: Every API call, database query, and external interaction
Cost tracking: Token usage and API costs per execution
Error classification: Was this a model error, tool error, or data error?

This observability isn't just for debugging. It's how you build trust. When stakeholders can see exactly what the agent did and why, they trust it more - and they can give better feedback for improvement.

Lesson 5: Test Like It's Software (Because It Is)

Agent development is software engineering, not prompt engineering. Our testing approach:

Unit tests: Test individual tools and functions in isolation
Integration tests: Test tool combinations with mock external services
Scenario tests: Run the full agent against 20-50 realistic scenarios
Adversarial tests: Try to break the agent with edge cases, ambiguous inputs, and conflicting instructions
Regression tests: Re-run the full test suite when changing prompts or tools

We version-control our prompts and treat prompt changes like code changes - they go through review and testing before deployment.

The Tech Stack That Works

After trying most agent frameworks, here's what we use in production:

Model: Claude 3.5+ for complex reasoning, GPT-4o for speed-sensitive tasks
Framework: LangGraph for complex state machines, Claude tool-use for simpler agents
Orchestration: Custom Python with clear step definitions
State management: PostgreSQL for durable state, Redis for ephemeral state
Monitoring: LangSmith or custom dashboards with structured logging

We've found that simpler is better. A well-structured agent with 5-10 tools will outperform a complex multi-agent system with 50 tools almost every time.

The Bottom Line

AI agents are genuinely transformative - when built correctly. The companies seeing real value from agents share three traits:

They start with a clearly defined, high-value workflow
They invest in engineering discipline (testing, monitoring, guardrails)
They keep humans in the loop for high-stakes decisions

The agent revolution is real. But it's an engineering revolution, not a magic revolution. Treat agents like production software, and they'll deliver production-grade results.

Building AI agents for your business? Book a strategy call and we'll help you design agents that are reliable, safe, and genuinely useful.