Everyone at NMO who touches AI work — whether you're pitching a client, scoping a project, writing code, or reviewing someone else's architecture. You don't need prior ML experience. You do need to be comfortable hearing names like "pgvector" or "LangGraph" without googling every one.
By the end you should be able to:
- Sketch an AI agent architecture on a whiteboard and place every component in the right layer.
- Evaluate any new tool that lands in your inbox — quickly classify which layer it plays in and what it replaces.
- Cut through vendor pitches that collapse multiple layers into "one platform" without telling you what you're giving up.
- Make informed recommendations when a client asks "should we use X?"
First pass (45 min): read Part 1 cover to cover, skim Part 2 layer headers, read Part 3 worked examples, skim Part 4.
Deep pass (~3 hours): read Part 2 layer-by-layer. Each layer card is independent — you can stop between them.
Reference mode: when you hear a tool name, search the page (Ctrl+F). Every tool named lives in a specific layer with context.
This primer makes judgment calls. "Use pgvector when you have fewer than 10M vectors" is opinion — defensible, pragmatic, and much more useful than a neutral survey. Where there's a live debate between tools, we flag it explicitly. Where the industry has converged on an answer, we state it.
When a client hires us, they don't want a library tour — they want a recommendation. This document trains you to recommend with confidence and with reasons.
A neural network trained on enormous amounts of text, capable of producing human-like language. Examples: GPT-5, Claude Opus, Llama 4, Gemini 3, DeepSeek V3.
What it is not: an application. "ChatGPT" is an application that uses GPT (a model). Confusing these is like confusing "a car" with "the Toyota Corolla engine."
Two flavours:
- Closed / proprietary: access only via the maker's API. Claude, GPT, Gemini, Grok. Usually strongest at the frontier.
- Open-weight: the model weights are published. You can download and run yourself. Llama, Mistral, Qwen, DeepSeek, Gemma, Phi.
The act of running a model to produce output. Training a model is enormously expensive and rare; inference is what happens every time someone sends a prompt. When people say "inference costs" or "inference provider," they mean this.
Every ChatGPT request = one inference call. Running 1,000 prompts = 1,000 inferences. Throughput and latency are measured in tokens/second during inference.
The unit of work for LLMs. A token is roughly 3/4 of an English word — "hello" is one token, "unbelievable" is two or three. Everything is priced in tokens: input tokens (what you send) cost different from output tokens (what the model generates).
Why this matters: 1,000 tokens of Claude Opus output ≈ 7.5¢. 1,000 tokens of Claude Haiku ≈ 0.5¢. Same family, 15× cost difference — because the smaller model is much cheaper to run. Choosing the right model per task is the single biggest lever on your AI budget.
How much text the model can "see" in a single inference call — input + output combined. GPT-3.5 was 4k tokens (~3,000 words). Claude 4 is 200k (~150k words, a whole novel). Gemini 2.5 Pro is up to 2M.
Why this matters: bigger context = you can stuff more reference material in. But: longer context = slower + more expensive + more likely to lose focus on early content. "Throw everything into context" rarely beats "retrieve the right 5 chunks."
A model that converts text (or images, audio) into a list of ~1,500 numbers. Texts with similar meanings produce similar number lists. This numerical fingerprint is called an embedding or vector.
The power: once text is a vector, you can do math on meaning. Search "find documents about refunds" by converting the query to a vector and finding the closest document vectors. That's semantic search. It's the foundation of RAG.
A pattern, not a product. When the user asks a question:
- Convert the question to an embedding.
- Search a vector database for the most relevant documents (semantic search).
- Stuff the top-K documents into the LLM's context alongside the original question.
- Generate the answer with that grounding.
This lets a generic LLM answer questions about your data without retraining. It's how every "chat with your docs" product works. It's also the most common way AI becomes actually useful in an enterprise.
| Layer | Example | What you pay for |
|---|---|---|
| Model | Llama 3.3 70B | Nothing — open-weight, free to download |
| Provider | Groq · Together · Fireworks | Per-million tokens at the provider's rate |
| Application | A chatbot built on Groq | Subscription to the application itself |
Same model, three different things to buy. Once you internalise this split, every vendor pitch becomes readable.
| Technique | What it does | When to use |
|---|---|---|
| Prompting | Just write better instructions | 90% of use cases. Start here, always. |
| RAG | Inject relevant documents into context | Model needs knowledge it doesn't have — your company docs, a product manual, live data. |
| Fine-tuning | Adjust the model's weights with training examples | You need a consistent tone, a narrow format, or you're squeezing cost by making a small model imitate a big one. Expensive and slow to iterate. |
Beginners reach for fine-tuning because it sounds sophisticated. Professionals reach for better prompts and RAG because they work faster, cheaper, and cover 95% of real problems.
| Layer | Name | What lives here | Example tools |
|---|---|---|---|
| L10 | Governance | Observability, evaluation, guardrails, compliance | Langfuse, Promptfoo, Llama Guard, Presidio |
| L9 | Applications | End-user products built on everything below | ChatGPT, Cursor, Perplexity, Copilot |
| L8 | Protocols | Standards for components to talk to each other | MCP, function calling, OpenAPI, A2A |
| L7 | Orchestration | Compose models + tools + data into workflows/agents | LangChain, LangGraph, CrewAI, n8n, Temporal |
| L6 | DS & ML platforms | Where data scientists prep data, train models, deploy | Dataiku, Databricks, SageMaker, Vertex AI |
| L5 | Data | Where knowledge lives — warehouses, databases, vectors | Snowflake, Teradata, Postgres, Qdrant, Redis |
| L4 | Inference providers | APIs (or local runtimes) that run models for you | Anthropic, OpenAI, Groq, Bedrock, OpenRouter · Ollama (local) |
| L3 | Models | The neural networks themselves | Claude, GPT, Llama, Gemini, Qwen |
| L2 | Hosting | Where your orchestration code lives | AWS, Vercel, a VPS, RunPod |
| L1 | Silicon | The physical chips | NVIDIA H100, Groq LPU, Google TPU |
Because each layer has a distinct buying decision with different vendors and different competitive dynamics.
You could collapse "inference providers" into "models" — but then you can't explain why the same Llama 3.3 runs on Groq (fast) and Together (cheap) and Bedrock (compliant). You'd hide the decision that actually matters.
You could merge "protocols" and "orchestration" — but then you miss that MCP is a standards layer, chosen separately from whichever framework consumes it.
Ten layers is the minimum number that keeps the decisions visible.
The minimal agent
L3 (model) + L4 (provider) + L7 (orchestration) = a working agent. Three layers, ~200 lines of code. A weekend build.
The enterprise pattern
All ten layers. L5 (Teradata + Qdrant), L6 (Dataiku), L10 (Langfuse + Llama Guard), the rest. Months of integration.
The vendor "platform"
A product claiming to cover 6+ layers for you. Convenient at first, lock-in at scale. Readable once you know the layers.
The "AI as a feature"
L3, L4, L9 — existing product adds a "summarise" button. Notion AI, Zendesk AI. Usually OpenAI or Anthropic under the hood.
Every component you'll touch lives in exactly one of three zones. The zone determines what you can put through it without a compliance review.
- Local / on-prem — your VPS, your data centre, your firewall. Data never leaves. Default for anything regulated (PII, financial, health, gov-ID).
- Private cloud you control — single-tenant deployments inside your GCP project, behind your VPC. Data leaves the building but you control the keys, the logs, and the contract.
- Public API — Anthropic, OpenAI, Groq cloud, public SaaS endpoints. Fast, cheap, and powerful — but every prompt and response crosses the public internet to a third party. Treat with care.
Hotspot 1 — Prompt to public LLM API. Every prompt sent to Groq Cloud, Anthropic, OpenAI contains whatever you put in it. If you put customer PII, internal financials, or trade secrets into the prompt, you have just shipped them to a third party. Mitigation: pre-prompt redaction (Presidio), data-class allow-lists per route, contract review for the provider's data-retention terms, or fall back to a local model (Llama via Ollama).
Hotspot 2 — Tool calls to external SaaS. When the agent decides to "look up the customer in Salesforce" or "post to Slack," that call leaves the network too. Mitigation: every tool call goes through an MCP server that logs the request, redacts sensitive fields, and enforces an allow-list of which tenants/customers can be looked up.
- Point at the green column first. "All of this stays inside your firewall — Teradata, Dataiku, the agent code, your vector store."
- Then the amber column. "These run inside your GCP project, single-tenant, region-pinned. Your data leaves your building but stays under your contract and never reaches a multi-tenant pool."
- Then the red column. "These are the only places where your data crosses the public internet to a third party. We use them deliberately, with redaction in front, only for prompts that don't contain regulated content."
- The animated dots. "Each dot is a request travelling between components. Notice that most activity is inside the green column. The model API only sees the cleaned, redacted prompt — never the raw customer record."
Groq sits in the public-API zone (red), so you only reach for it when its specific advantage — sub-300ms first-token latency on open models — is the thing that makes or breaks the product. Here is the rule, every time:
- Real-time voice — under-300ms first-token is the difference between "natural" and "awkward" (see Voice Agent).
- Live agent assist / call-centre co-pilot — a suggestion every 4 seconds needs a sub-700ms round-trip (see Call Center · VIP).
- Inbound chat at high volume where reply speed is part of the customer experience (see Customer Care AI).
- High-throughput batch on open weights — classification / extraction / triage at hundreds of tokens/sec for cents per million.
- Cost-sensitive workloads on Llama / Qwen / DeepSeek — you want open-weight pricing and world-class speed.
- You need Claude or Gemini quality — Groq only hosts open-weight models. Frontier reasoning lives on Vertex AI / Anthropic.
- The prompt contains regulated PII you can't redact — it's a public API; data leaves your network. Use Vertex AI in your GCP project instead.
- Latency doesn't matter — for offline / batch / "report me by tomorrow" workloads, Vertex AI on Llama is cheaper and stays in your cloud.
- You need long context (1M+ tokens) — that's Gemini 2.5 Pro on Vertex AI, not Groq.
- You're inside a strict on-prem mandate — fall back to local Llama on Ollama / vLLM. Slower but never leaves the building.
| Chip | Maker | Position in 2026 |
|---|---|---|
| H100, H200, B100, B200 | NVIDIA | The default. ~90% of production inference. CUDA ecosystem is the moat. |
| A100 | NVIDIA | Previous generation. Still everywhere. Cheaper to rent. |
| TPU v5e · v5p · Trillium | Google-only. Powers Gemini. Rentable via GCP. | |
| MI300X · MI325X | AMD | Credible NVIDIA challenger. Cheaper per FLOP. Software (ROCm) still maturing. |
| LPU | Groq | Language-specific chip. Not a GPU. Deterministic, extremely low latency, 5–10× faster tokens/sec on open-weight models. Groq (the company) sells API access; you don't buy LPUs. |
| WSE-3 | Cerebras | Wafer-scale. One chip is physically the size of a cluster of GPUs. Fastest inference on large models. Niche, expensive. |
| Trainium · Inferentia | AWS | AWS-exclusive silicon. Cheap. Used inside Bedrock. |
| Neural Engine (M-series, A-series) | Apple | On-device only. Behind every "Apple Intelligence" feature. |
| Snapdragon NPU | Qualcomm | Android on-device inference. |
A GPU is general-purpose — it does graphics, crypto, scientific computing, and AI. That flexibility costs you speed. Groq built a chip that only does one thing (the math that runs LLMs) and shaved off every millisecond.
Practical consequence: Llama 3.3 70B on an H100 produces maybe 60 tokens/second. Same model on Groq: 500+ tokens/second. That's the difference between an agent that feels snappy and an agent that feels sluggish.
Trade-off: Groq only serves a curated menu of open-weight models. You cannot run your custom fine-tune. You cannot run Claude or GPT (those are closed — they run on their makers' infrastructure). You're choosing speed within a constrained model set.
Matters
- Voice agents (<300ms perceived round-trip)
- Live code completion
- High-volume batch processing (cost per million tokens)
- Air-gapped / on-prem deployments (you pick the hardware)
Doesn't matter
- Prototypes and MVPs
- Internal tools with <1,000 users
- Anything where a 2-second response is fine
- Anything running on Claude or GPT (you can't choose anyway)
AWS, GCP, Azure, Oracle Cloud, Alibaba Cloud. Everything is available; complexity is high. Pick when you need compliance stories (HIPAA, PDPL, SOC 2), when you already run 80% of your infrastructure there, or when the client dictates it.
AI-relevant services: AWS Bedrock (multi-model inference gateway), Azure OpenAI (Microsoft's GPT resell), GCP Vertex AI (Google's ML platform), AWS SageMaker, Azure ML.
RunPod · Lambda Labs · CoreWeave · Modal · Replicate · Beam · Paperspace · Fluidstack.
Use case: you need GPUs now (fine-tuning, self-hosting a specific model, experimenting) without a hyperscaler commitment. Sign up, rent an H100 by the hour, shut it down. This is where most open-source AI development happens.
Your own VPS (Hostinger, Linode, DigitalOcean, Hetzner), bare-metal servers, on-prem, Cloudflare Workers AI (edge), Vercel.
Use case: small-scale apps, data-sovereignty requirements, cost control, internal tools. A $20/month VPS can host a surprising amount of AI application code; you call out to inference providers for the heavy compute, or run a small open-weight model locally with Ollama on the same box.
The application layer is small and cheap. The inference cost is the variable. That's why hosting your app on a $20 VPS is fine for a long time — the money goes to layer 4, not layer 2.
| Family | Maker | Strength |
|---|---|---|
| Claude (Opus, Sonnet, Haiku) | Anthropic | Coding, long-context reasoning (200k+), careful tool use. Preferred by Apex and by most serious agent builders. |
| GPT-5 · GPT-4o · o-series (o1/o3/o4) | OpenAI | General-purpose, multimodal (vision + voice), math and science via o-series reasoning models. GPT-5 is the current flagship. |
| Gemini 2.5 · 3 | Up to 2M-token context (biggest), native multimodal, very cheap at scale. | |
| Grok 3 · 4 | xAI | Trained on X data, fewer guardrails, fast-moving. |
| Family | Maker | Strength |
|---|---|---|
| Llama 3.1 · 3.3 · 4 | Meta | The open-weight workhorse. Runs everywhere, fine-tunable, strong community. |
| Mistral · Mixtral · Codestral | Mistral AI (France) | EU privacy story, MoE (mixture-of-experts) efficiency, small-model quality. |
| Qwen 2.5 · 3 | Alibaba | Best open-weight coder in 2026, excellent multilingual (great for Arabic), many sizes. |
| DeepSeek V3 · R1 | DeepSeek | Cheap frontier reasoning. R1 was trained for roughly 1/20th of GPT-4's public cost estimates and matches o1-level reasoning on many benchmarks. |
| Gemma 2 · 3 | Small-model sibling of Gemini. On-device friendly. | |
| Phi-3 · Phi-4 | Microsoft | Small model, punches above its weight, good on-device. |
| Family | Maker | Strength |
|---|---|---|
| Whisper · Whisper Large v3 | OpenAI | Speech-to-text. Best-in-class transcription. Free to self-host. |
| Flux · Flux Pro | Black Forest Labs | Image generation, open-weight, high quality. Replaces Stable Diffusion for many. |
| Stable Diffusion 3.5 | Stability AI | Open image generation. |
| Sora · Runway Gen-3 · Kling | OpenAI · Runway · Kuaishou | Video generation. Early but usable. |
| text-embedding-3 · voyage-3 | OpenAI · Voyage AI | Embeddings — turn text into vectors for retrieval. (You'll use these daily in RAG.) |
| Cohere Embed · BGE-M3 | Cohere · BAAI | Alternative embedding models. BGE-M3 is open-weight and strong on multilingual. |
| Your situation | First pick |
|---|---|
| "I need the best coding model" | Claude Sonnet / Opus · Qwen 2.5 Coder for open |
| "I need the cheapest frontier-quality reasoning" | DeepSeek V3/R1 or Gemini 2.5 Flash |
| "I need 1M+ tokens of context" | Gemini 2.5 Pro |
| "I need to run it on my own hardware" | Llama 3.3 70B (general) or Qwen 2.5 Coder (coding) — fastest path is Ollama |
| "I need it to be fast enough for voice" | Llama 3.3 70B on Groq |
| "I need Arabic / multilingual strength" | Qwen 2.5 · Gemini · Claude |
| "I need strong vision (describe image, read PDFs)" | Claude Sonnet · GPT-4o · Gemini 2.5 |
| "I need cheap summarisation at scale" | Claude Haiku · Gemini Flash · GPT-4o-mini |
The model maker serves their own model — only place you can get it (plus some hyperscaler resells).
- Anthropic API — Claude. Best place for Claude. Also available on AWS Bedrock and GCP Vertex for compliance reasons.
- OpenAI API — GPT and o-series. Also sold as Azure OpenAI for enterprise.
- Google Gemini API — via Google AI Studio (dev) or Vertex AI (enterprise).
- xAI API — Grok.
One endpoint, many models. Useful when you want to A/B test or not lock in.
Providers competing on speed for open-weight models.
The software you run yourself when data can't leave your network.
ollama run llama3.3 and you have an API.Proxies that sit between your app and the real inference provider — adding caching, logging, rate-limiting, A/B testing.
| Category | Players | Use case |
|---|---|---|
| Data warehouses (OLAP) | Snowflake · Databricks · BigQuery · Teradata · Redshift · ClickHouse | Structured analytical queries over years of history. "What was our LATAM revenue by quarter for 2020-2025?" |
| Data lakes | S3 + Iceberg · Delta Lake · MinIO · Azure Data Lake | Cheap raw-file storage, often the substrate under a warehouse. |
| Operational DBs (OLTP) | PostgreSQL · MySQL · MongoDB · DynamoDB · SQL Server | Your app's live data — users, orders, tickets. Reads and writes continuously. |
| Vector DBs | Qdrant · Pinecone · Weaviate · Milvus · Chroma · pgvector · LanceDB | Store embeddings for semantic search. Foundation of RAG and agent memory. |
| Graph DBs | Neo4j · ArangoDB · Memgraph · TigerGraph | When relationships are the point — fraud rings, supply chains, org charts. |
| Cache / in-memory | Redis · KeyDB · Memcached · DragonflyDB | Sub-millisecond lookups, session state, pub/sub messaging. |
| Search engines | Elasticsearch · OpenSearch · Meilisearch · Typesense | Keyword + filter search. Often combined with vector search for hybrid retrieval. |
Teradata: the 40-year incumbent in big banks, telcos, airlines, healthcare payers. If a client has 20 years of structured history, it's probably in Teradata. Strengths: mature query optimiser, governance, predictable performance. Weaknesses: expensive, older tooling story. You don't migrate Teradata — you work with it.
Snowflake: cloud-native warehouse, separated compute + storage. Dominant with modern enterprises. Easier to use than Teradata, strong ecosystem.
Databricks: lakehouse model — warehouse + lake + ML platform in one. Preferred by data-engineering-heavy shops. Has its own MLflow, its own LLMs (DBRX), its own serving.
BigQuery: the GCP-native warehouse. Extremely cheap serverless scan. Default for any GCP-committed organisation.
ClickHouse: open-source columnar DB, blazingly fast for analytical queries on event data. Product analytics shops love it.
Pure vector search misses exact matches (product IDs, names, specific phrases). Pure keyword search misses semantic meaning ("refund policy" vs "return guidelines"). Hybrid retrieval runs both and fuses the results.
Typical stack: Elasticsearch (or OpenSearch) for keyword + BM25 ranking + Qdrant (or pgvector) for semantic. A re-ranker model (Cohere Rerank, BGE reranker) picks the final top-K. Quality jumps significantly over either alone.
| Platform | Position | Who uses it |
|---|---|---|
| Dataiku | Visual + code DS platform. ETL, feature engineering, model training, deployment — in one canvas. Strong RBAC, lineage, governance. | Enterprises where analysts and data scientists share workflows. Often sits on top of Snowflake or Teradata. |
| Databricks | Lakehouse + ML + Spark + Delta. Code-heavy. Has its own LLM features (DBRX model, Mosaic AI). | Data engineers. ML teams at scale. Shops that live in notebooks. |
| Palantir Foundry | Data integration + workflow + ontology. Operational AI, not exploratory. Very opinionated. | Large enterprises with messy data across 50 source systems. Defence, healthcare, oil & gas. |
| AWS SageMaker | Hyperscaler DS platform. Tight AWS integration. Everything from Jupyter to model serving. | AWS-committed shops. ML engineers. |
| GCP Vertex AI | Google's answer. Strong AutoML, native Gemini integration. | GCP-committed shops. |
| Azure ML | Microsoft's answer. Tight integration with Azure services and Office. | Microsoft-shop enterprises. |
| H2O.ai · DataRobot | AutoML-first. "Point at a table, get a model." Less useful for LLMs, still strong for traditional ML. | Teams without deep ML expertise. Financial services modelling. |
| MLflow · W&B · ClearML · Comet | Experiment tracking + model registry. Not a full platform — a component. | ML teams using their own compute but wanting governance. |
Dataiku is often called "Tableau for machine learning." It's a visual canvas where you drag boxes: read from Teradata → filter → join with a CSV → train a model → deploy as an API. Each box can be visual (for analysts) or Python/R (for data scientists). They share the same project.
What's valuable:
- Lineage: every column in every output can be traced back to its source.
- RBAC: who ran what, who approved deployment, who has access to which data.
- Mixed skill-levels: business analysts and senior DS work on the same flow.
- Model Ops: deployed models get monitored for drift, performance, retraining triggers.
Where it fits in the agent era: Dataiku's sweet spot is traditional ML (classification, regression, forecasting). For LLM-heavy agents, it's peripheral — you might publish a "scored customers" table from Dataiku that an agent then queries, but the agent itself is built elsewhere. Dataiku is adding LLM features, but the core strength remains traditional analytics.
What they mean: they have a DS team, they've invested in governance and lineage, they likely have 50+ projects running in production. They are enterprise, not a startup.
Implications for your pitch:
- Don't propose to replace Dataiku — you'll lose.
- Do propose to complement it with agentic workflows that consume Dataiku outputs.
- Leverage their existing lineage + RBAC — the compliance story is already built.
- MCP servers pointing to Dataiku datasets are the clean integration point.
| Framework | Philosophy | When to use |
|---|---|---|
| LangChain | The original. Huge surface area, many integrations. Often criticised as "too magic." Good for getting started, painful at scale. | Prototypes. Pattern demonstrations. |
| LangGraph | LangChain's state-machine framework. Explicit graphs of agent decisions. Much more debuggable than raw LangChain. | Multi-step reasoning with branches. Complex agent logic. |
| LlamaIndex | RAG-first. Rich tooling for document loaders, chunking, retrieval pipelines. | Data-heavy agents, "chat with your docs." |
| AutoGen (Microsoft) | Multi-agent conversations. Agents talk to each other to solve tasks. | Research, experimentation. Production less common. |
| CrewAI | Role-based multi-agent ("researcher", "writer", "editor"). Higher-level than AutoGen. | Content pipelines, structured multi-agent work. |
| Pydantic AI | Typed, minimal, Python-idiomatic. Strong structured-output support. | Production systems where schema matters. Rising fast in 2026. |
| Claude Agent SDK | Anthropic-native (formerly "Anthropic Agent SDK"). Closest to the metal. No framework overhead. | Claude-specific production agents where you want control. |
| OpenAI Swarm · OpenAI Agents SDK | OpenAI's own lightweight framework. | OpenAI-centric agents. |
| Semantic Kernel (Microsoft) | Enterprise-friendly, .NET + Python + Java. Plugin architecture. | .NET shops, enterprise Microsoft integrations. |
These aren't frameworks — they're end-user products that use all 10 layers internally. You use them; you rarely build with them.
Drag-and-drop boxes: trigger → action → action. LLM is one box among hundreds.
For pipelines measured in hours/days with retries, schedules, complex dependencies.
| Situation | Pick |
|---|---|
| Trigger from Gmail, enrich from HubSpot, post to Slack, one LLM summary in the middle | n8n |
| Agent that calls 5 tools, decides which based on user input, loops if result is unclear | LangGraph / Pydantic AI |
| Client wants "visual AI pipelines they can edit" | n8n |
| You need custom data models, state transitions, complex reasoning | LangGraph or custom code |
| Team is non-technical | n8n / Make |
| Team is senior engineering | Custom code with Claude Agent SDK / LangGraph |
Who: Anthropic, now adopted by many. What: open standard for giving LLMs structured access to tools, data, and services.
An MCP server exposes tools ("query_database", "read_file", "send_email"). Any MCP-aware client (Claude Desktop, Cursor, Claude Code, your custom agent) can discover and use them. Think: "USB for agent tools" — plug and play across vendors.
Why it matters: before MCP, wiring a tool to an agent meant writing glue code for every agent framework. After MCP, you write one server, every client works. This is becoming the industry default. Expect every major platform to ship MCP support in 2026.
Who: OpenAI introduced it in 2023; every frontier model now supports it. What: the model returns a structured JSON object saying "call function X with these arguments" instead of free-text. Your code executes it, returns the result, the model continues.
This is the raw mechanism. MCP is the standard way to package and share functions for reuse.
Who: Google's proposal (2025), others experimenting. What: lets one agent discover and call another. Still early — MCP covers most use cases, A2A is for agent-fleet scenarios.
The fallback. When no native AI protocol exists, a well-documented REST API with an OpenAPI spec is still the common ground. Most MCP servers are wrappers around existing REST APIs.
ChatGPT · Claude.ai · Gemini · Copilot · Perplexity · Pi · You.com
Cursor · Claude Code · Windsurf · Copilot · Replit Agent · Cody · Codeium
Jasper · Copy.ai · Notion AI · Mem · Lex · Writer
Intercom Fin · Zendesk AI · Decagon · Sierra · Ada
Gong · Chorus · Clay · Apollo AI · HubSpot Breeze
Otter · Fireflies · Granola · Tactiq · Glean
Perplexity · Phind · Exa · Kagi Assistant · You.com
Midjourney · Runway · Pika · Kling · Sora · Ideogram · Flux Pro
ElevenLabs · Cartesia · Deepgram · Vapi · Bland · PlayHT
Hex · Julius · Rowy · Metabase AI
Harvey · Hebbia · Spellbook · Robin AI
Abridge · Nuance DAX · Suki · OpenEvidence
Answers: "What did the agent do yesterday, how much did it cost, and where did it fail?"
Answers: "Did my new prompt make things better or worse?"
Blocks jailbreaks, prompt injection, unsafe outputs before they reach the user.
Redacts sensitive data before it reaches any LLM. Critical for GDPR, PDPL, HIPAA.
me-central2 (Dammam, KSA — actually inside the Kingdom) or on-prem Llama. Presidio in front of your LLM calls turns a non-compliant design into a compliant one.| Layer | Pick | Why |
|---|---|---|
| L1 Silicon | NVIDIA (invisible) | Whoever hosts Claude picks — not our choice. |
| L2 Hosting | Vercel (Next.js) + your GCP project (Qdrant on GKE / Cloud Run) | Vercel for the edge UI, GCP for the stateful vector DB. |
| L3 Model | Claude Haiku · text-embedding-3-small | Haiku is cheap + fast. Small embedding model — full cost per million docs. |
| L4 Inference | Anthropic API · OpenAI API | Direct, simplest. |
| L5 Data | Qdrant (vectors) · Postgres (tickets) | Qdrant for scale; Postgres for the support ticket system. |
| L6 DS platform | None | No model training. No platform needed. |
| L7 Orchestration | LangChain retrieval chain OR 80 lines of Node | Either works. If it's a one-off, skip LangChain. |
| L8 Protocol | Direct API calls | Nothing to reuse — MCP overkill here. |
| L9 Application | Chat widget embedded in Zendesk | Meet users where they are. |
| L10 Governance | Presidio (PII) · Llama Guard (safety) · Langfuse (trace) | PDPL compliance + observability from day one. |
Demo
- Just Claude + a bunch of docs stuffed into context
- No PII handling
- No eval harness
- No observability — if it breaks, you have no idea why
- Works for 50 docs, falls over at 5,000
Production
- Vector DB with hybrid retrieval + reranking
- Presidio strips PII before any external API call
- Promptfoo runs in CI, catches prompt regressions
- Langfuse traces every turn; Helicone caches common questions
- Scales to 500k docs without a rearchitecture
Look at the steps: 6 out of 7 are just "call an API on an existing SaaS." n8n has all of them as drag-in nodes. The one AI step (draft email via GPT-4o-mini) is also a drag-in node.
If you wrote this in Python, you'd be writing:
- Salesforce webhook listener (20 lines)
- Apollo API client (30 lines)
- Tier-list lookup (10 lines)
- OpenAI client + prompt (20 lines)
- Slack approval Bolt app (60 lines)
- Gmail SMTP client (15 lines)
- Salesforce update call (15 lines)
- Error handling, retries, logging (100+ lines)
That's a week of work. In n8n: an afternoon. And non-technical team members can edit it without fear.
| Layer | Pick |
|---|---|
| L2 Hosting | Self-hosted n8n on a $10/month VPS, OR n8n Cloud |
| L3 Model | GPT-4o-mini (cheap, good enough for outreach emails) |
| L4 Inference | OpenAI (via n8n node) |
| L5 Data | Salesforce is the source of truth; n8n holds no state |
| L7 Orchestration | n8n — the whole story lives here |
| L9 Application | Salesforce + Slack + Gmail (existing tools) |
| L10 Governance | Human-in-loop approval is the guardrail |
If the logic gets more adaptive — "if lead responded to a previous email, personalise based on that thread" or "if the company website uses React, mention React-specific case studies" — you're past n8n's sweet spot. Rebuild in LangGraph or custom code.
The line: n8n is a decision tree. An agent loops and decides. When you start building decision trees inside n8n that are 50 boxes deep, switch tools.
me-central2 for KSA), and never reaches Google's multi-tenant model pool.Teradata holds the history
20 years of finance data. ~$50M/year licence. You do not migrate this. You talk to it.
Dataiku already has a semantic layer
Column names, metric definitions, hierarchies, RBAC. If the agent queried raw Teradata it would hallucinate column names and bypass all the governance the enterprise spent millions building. Querying via Dataiku inherits all of it.
The agent queries Dataiku via MCP
An MCP server exposes Dataiku datasets as agent tools. The agent doesn't know or care that the underlying store is Teradata — just "query revenue_by_region_quarter for LATAM 2020-2025."
Claude Opus for the reasoning step
Planning which queries to run, narrating findings in business English, handling "drill deeper" follow-ups — this is where you want the frontier model. Cheaper models would produce shallower analysis.
| Layer | Pick |
|---|---|
| L2 Hosting | Client's existing infrastructure (on-prem or your GCP project) |
| L3 Model | Claude Opus or Gemini 2.5 Pro (reasoning) + Gemini Flash (chart captions) |
| L4 Inference | GCP Vertex AI (Claude or Gemini, single-tenant, region-pinned for compliance) |
| L5 Data | Teradata (history) via Dataiku |
| L6 DS Platform | Dataiku — semantic layer + RBAC |
| L7 Orchestration | LangGraph (structured multi-step reasoning) |
| L8 Protocol | MCP server for Dataiku — the clean integration point |
| L9 Application | Slack bot (CFO already lives there) |
| L10 Governance | Langfuse (trace) · Presidio (PII in logs) · Dataiku's own RBAC |
Mistake 1: "Let's migrate Teradata to Snowflake for the AI project." No. Never propose a multi-million-dollar data migration as part of an AI project. Build on top.
Mistake 2: "The agent will write raw SQL against Teradata." It will hallucinate table names, miss business definitions, and bypass RBAC. Always go through the semantic layer.
Mistake 3: "We'll use a smaller model to save cost." Financial analysis needs careful reasoning. Going cheap here destroys the client's trust on the first wrong answer. Use Opus/GPT-4.
Mistake 4: "Skip MCP, call Dataiku directly." Then every future AI product the client adopts will have to re-integrate. MCP server once, reused forever.
Voice feels natural under 300ms round-trip, stilted at 500ms, broken above 800ms. Here's the latency budget:
| Step | Typical latency | Notes |
|---|---|---|
| Network (caller → server) | ~40 ms | Geography-dependent |
| STT (streaming) | ~100 ms | Deepgram partial transcripts |
| LLM first-token (Llama on GPU) | 400–800 ms | The bottleneck |
| LLM first-token (Llama on Groq) | ~100 ms | The fix |
| TTS first-audio | ~150 ms | Cartesia is the fastest |
| Network (server → caller) | ~40 ms |
On a regular GPU, total = ~870ms. Caller experience: awkward pauses. On Groq: ~430ms. Caller experience: natural conversation. That's the entire product difference.
Two reasons:
- Latency: closed frontier models have first-token latencies of 400–1000ms. Fine for chat, too slow for voice.
- Groq doesn't host them: Claude runs only on Anthropic/Bedrock/Vertex, GPT only on OpenAI/Azure. You can't put them on an LPU.
OpenAI's Realtime API (GPT-4o) is a credible alternative — it's designed for voice specifically. But you're locked into OpenAI and the pricing gets expensive fast. Groq + Llama is the open-weight path.
| Layer | Pick |
|---|---|
| L1 Silicon | Groq LPU — the reason this works |
| L3 Model | Llama 3.3 70B (LLM) · Whisper or Deepgram Nova-2 (STT) · Cartesia (TTS) |
| L4 Inference | Groq · Deepgram · Cartesia |
| L7 Orchestration | LiveKit Agents framework (voice-native) or Vapi (managed) |
| L9 Application | Phone via Twilio + dashboard for reviewing calls |
| L10 Governance | Call recording, transcription archive, Langfuse |
Nine agents with distinct roles: Project Manager, Productizer, VPS Admin, Backend Dev, Frontend Dev, Data Scientist, Security, HR, Marketing. Each has its own system prompt, toolset, and responsibilities. An orchestrator queues tasks, spawns short-lived worker containers, collects results, respects a capacity cap, escalates decisions to a human via Telegram.
| Layer | Pick | Rationale |
|---|---|---|
| L1 Silicon | Invisible | Not self-hosting inference. |
| L2 Hosting | Single $20/month VPS | Single operator. Small scale. Cheap. |
| L3 Model | Claude Opus + Sonnet | Best at coding + long-context reasoning. Consistent. |
| L4 Inference | Anthropic (via Claude Max subscription) | Session-based billing dodges per-token surprise. |
| L5 Data | PostgreSQL + pgvector + Redis | Single-user scale — pgvector is enough. No second DB. |
| L6 DS Platform | None | This is a software-shipping team, not a data-science team. |
| L7 Orchestration | FastAPI + Claude Agent SDK + Claude Code CLI | LangChain/CrewAI would add mass without buying much. |
| L8 Protocol | Direct API calls now; MCP later | MCP is industry-wide direction for tool integration. |
| L9 Application | Next.js dashboard + Telegram bot + daily email | Three channels matching three contexts (desk, mobile, morning). |
| L10 Governance | Built-in audit log, capacity governor, weekly team-health review | Designed in from day one, not bolted on. |
Teradata holds the customer history
Every prior order, ticket, payment, and product registration. Twenty years for some customers. You do not want the agent guessing — it answers from the truth.
Dataiku exposes it as a clean semantic layer
"customer_360" view: name, tier, lifetime value, open tickets, last interaction, churn risk. The agent gets one tidy object via an MCP tool — never sees raw Teradata tables, never hallucinates column names, inherits the existing RBAC.
LangChain orchestrates the loop
Classify intent → fetch context → check policy → draft → validate → respond → log. LangGraph models the state machine explicitly so you can replay any conversation deterministically. Open source, runs in your VPC.
Groq runs the language model — fast and cheap
Llama 3.3 70B at hundreds of tokens per second on Groq's LPU. Customer-care replies need to feel instant; Groq is 3–5× faster than the same model anywhere else. Trade-off: prompts go to Groq Cloud, so PII is redacted before they leave (Hotspot 1 from the Live Ecosystem Map).
A confidence gate before any action
Refunds, account changes, and escalations only execute if (a) the agent's self-rated confidence is ≥ 0.85, (b) the action is on the per-tier allow-list, and (c) it's within the per-customer rate limit. Else: hand to human with the draft attached.
| Layer | Pick | Zone |
|---|---|---|
| L2 Hosting | Your GCP project (GKE / Cloud Run) or on-prem K8s | Local |
| L3 Model | Llama 3.3 70B (replies) · Llama 3 8B (intent classify) | OS |
| L4 Inference | Groq Cloud for speed · Ollama on-prem for fallback / regulated tenants | Public / Local |
| L5 Data | Teradata (history) · Postgres+pgvector (KB) | Local |
| L6 DS Platform | Dataiku — customer_360 semantic + churn-risk feature | Local |
| L7 Orchestration | LangChain + LangGraph (state machine + tool calling) | Local |
| L8 Protocol | MCP servers: customer_history, order_actions, kb_search, escalate | Local |
| L9 Application | Webchat widget · WhatsApp Business API · email gateway · Zendesk for human handoff | Mixed |
| L10 Governance | Presidio (PII redaction) · Langfuse (traces) · Llama Guard (output filter) · in-house audit_log | Local |
- Llama 3.3 70B (Meta, OS). You can run it on Groq Cloud today and switch to your own GPU box tomorrow without rewriting a line. No model lock-in.
- LangChain / LangGraph (OS). The orchestration is your code. You can audit it, fork it, or swap it if a vendor framework would be faster. Vendor lock-in for orchestration is the most expensive lock-in to escape later.
- Presidio (Microsoft, OS). PII detection runs on-prem before any prompt leaves. If you used a SaaS redactor, you'd be sending raw PII to that SaaS first — defeats the point.
- pgvector (Postgres extension, OS). Your KB embeddings live in your existing Postgres. No third vector DB to license / monitor / back up.
- Llama Guard (Meta, OS). Runs locally. Catches policy violations (hate speech, PII leak in output) without sending the response to a moderation API.
me-central2 for KSA / Dammam) so the data never leaves your jurisdiction.Latency budget is brutal
A suggestion that lands 6 seconds after the customer's question is useless — the human already moved on. Groq's LPU + Llama 3.3 8B for the suggestion loop hits sub-700ms on a 4-second sliding transcript window. This is exactly why Groq is in the stack.
Speech recognition stays on-prem
VIP calls contain account numbers, PINs, deal terms. Use Whisper-large-v3 on a local GPU; never stream audio to a cloud STT. The transcript that goes to the LLM is already partially redacted by a regex filter (account numbers masked).
Dataiku's churn-risk & tier scores drive escalation
If the live sentiment turns negative and the customer is in the top decile of LTV, the AI silently pages the team lead — no waiting for the customer to ask.
Two model tiers
Llama 3 8B on Groq for the every-4-second loop (cheap, fast). Claude Opus or Gemini 2.5 Pro on Vertex AI in your own GCP project for the post-call wrap-up — better at writing summaries the relationship manager actually trusts. The mix matters.
| Layer | Pick | Zone |
|---|---|---|
| L2 Hosting | On-prem GPU box for Whisper · your GCP project (GKE) for the agent loop · Vertex AI region-pinned | Local / Priv |
| L3 Model | Whisper-large-v3 (ASR) · Llama 3 8B (live loop) · Claude Opus or Gemini 2.5 Pro (wrap-up) | OS + Frontier |
| L4 Inference | Local GPU (Whisper) · Groq (live Llama loop) · GCP Vertex AI (Claude / Gemini wrap-up) | Local + Public + Priv |
| L5 Data | Teradata (relationship history) · Postgres (live call state) | Local |
| L6 DS Platform | Dataiku — customer_360, churn_risk, lifetime_value features | Local |
| L7 Orchestration | LangGraph (4-second loop) + custom websocket runner for the agent screen | Local |
| L8 Protocol | MCP servers: customer_360, open_tickets, page_supervisor, book_followup | Local |
| L9 Application | Genesys / your existing telephony · agent-desktop side-panel (Vue) · supervisor pager (Telegram or Slack) | Mixed |
| L10 Governance | Hard rule: AI never speaks · transcripts retained per regulator · Langfuse self-hosted · per-rep accuracy dashboard | Local |
me-central2 for KSA / Dammam) · no PII in summary by template.A long-form instruction defining the agent's role, scope, tone, constraints, escalation triggers, and output format. Usually 500–5,000 words. Rewritten many times during development.
"You are a senior customer support agent. You answer only from the provided context. If unsure, escalate to a human. Never promise refunds — offer to file a request."
Composed of: system prompt + conversation history + relevant retrieved documents (RAG) + tool descriptions + user's current message. Fits inside the model's context window budget.
Functions the agent can call. Examples: search_database(query), create_ticket(data), send_email(to, subject, body). Each tool has a name, a description, a parameter schema, and a handler that executes it.
Key insight: the agent's capability is defined by its tools. A smart LLM with no tools is just a chatbot. A modest LLM with the right tools can run a business.
Two scopes:
- Short-term: the current conversation. Lives in the context window.
- Long-term: persistent across sessions. Usually vectors in Qdrant/pgvector, retrieved by relevance on each new conversation.
Agents don't just answer once. They loop: pick tool → call it → observe result → decide next action → repeat until done. The loop is where agent logic gets complex.
Guardrails: max iterations, timeout, budget cap. Without these, a buggy agent loops forever burning tokens.
Tests that run against the agent. Can be: exact-match against gold answers, LLM-as-judge scoring (another LLM rates the output), human review, business metrics (tickets resolved per hour). In production, usually all of these.
Every inference call logged, every tool call recorded, every decision traced. Without this, you cannot debug. With it (Langfuse, LangSmith, Helicone), you can replay any conversation and see exactly what the model saw and why it chose what it chose.
| Pattern | Shape | Best for |
|---|---|---|
| ReAct loop | Think → Act → Observe → Think | General tool-using agents. |
| Plan-then-Execute | Planner writes full plan; executor runs steps | Complex multi-step tasks. |
| Reflexion / self-critique | Agent reviews its own output before final answer | Quality-sensitive generation. |
| Multi-agent crew | Specialised agents collaborate | Complex domains with clear sub-roles (like Apex). |
| Graph / state machine | Explicit nodes and edges | Flows where you need deterministic control. |
| Situation | First-instinct pick |
|---|---|
| Best-in-class coding | Claude Sonnet / Opus · Qwen 2.5 Coder (open) |
| Cheapest frontier-quality reasoning | DeepSeek V3/R1 · Gemini 2.5 Flash |
| 1M+ tokens of context | Gemini 2.5 Pro |
| Run on your own hardware | Llama 3.3 70B · Qwen 2.5 (run via Ollama, vLLM, or llama.cpp) |
| Voice agent speed | Llama 3.3 70B on Groq |
| Arabic / multilingual | Qwen 2.5 · Gemini · Claude |
| Strong vision | Claude Sonnet · GPT-4o · Gemini 2.5 |
| Cheap high-volume classification | Claude Haiku · Gemini Flash · GPT-4o-mini |
| Situation | First-instinct pick |
|---|---|
| Self-host an LLM on a small VPS | Ollama + Llama 3.x 8B or Qwen 2.5 Coder 7B |
| Real-time voice (under 300ms first-token) | Groq + Llama 3.3 70B · OpenAI Realtime as paid alt |
| Live agent assist / call-centre co-pilot | Groq + Llama 3 8B (sub-700ms round-trip) |
| High-throughput batch on open weights | Groq for speed · Together / Vertex AI for cost |
| Frontier reasoning (Claude / Gemini quality) | GCP Vertex AI — not Groq, Groq doesn't host them |
| Sensitive prompts you can't redact | Vertex AI in your GCP project or local Llama — not Groq (it's a public API) |
| Unified gateway across providers | LiteLLM (open) · Portkey (managed) |
| Rent an H100 for a day | RunPod · Lambda Labs · Modal |
| Enterprise with GCP commitment (your default) | GCP Vertex AI · Claude or Gemini |
| Need 1M+ token context | Gemini 2.5 Pro on Vertex AI |
| KSA data residency required | Vertex AI me-central2 (Dammam) OR on-prem Llama |
| Situation | First-instinct pick |
|---|---|
| RAG on ≤10M vectors | pgvector (reuse existing Postgres) |
| RAG on 10M–1B vectors | Qdrant (self-host) · Pinecone (managed) |
| Modern analytical warehouse | Snowflake · BigQuery (GCP) · Databricks |
| Existing Teradata estate | Work with it, don't migrate |
| Event / product analytics | ClickHouse |
| Graph relationships matter | Neo4j |
| Hybrid (keyword + vector) search | Elasticsearch + pgvector/Qdrant + Cohere Rerank |
| Situation | First-instinct pick |
|---|---|
| Visual automation, LLM is one step | n8n (self-host) · Make · Zapier |
| Custom agent, Claude-centric | Claude Agent SDK |
| Complex multi-step reasoning | LangGraph · Pydantic AI |
| Multi-agent collaboration | CrewAI · AutoGen · custom |
| RAG-heavy agent | LlamaIndex |
| Durable agent workflows (hours/days) | Temporal |
| Multi-step coding tasks | Claude Code · Cursor · Windsurf |
| Situation | First-instinct pick |
|---|---|
| Production observability, open-source | Langfuse |
| Production observability, managed | Helicone · LangSmith |
| Prompt evaluation in CI | Promptfoo |
| Block prompt injection | Llama Guard 4 (self-host) · Lakera Guard (SaaS) |
| Redact PII before LLM | Presidio (open) · Private AI (managed) |
| RAG-specific evaluation | Ragas |
me-central2 (Dammam, inside KSA) or on-prem Llama are the compliant paths.For every box you draw in the architecture, answer these three questions out loud:
- Where does the data physically sit when this component is using it? (your DC · your VPC · vendor's cloud)
- Who can read it there? (your team · your cloud provider · the vendor's employees · the public internet)
- What contract or law constrains them? (DPA · SLA · GDPR/PDPL/HIPAA · "trust me bro")
If the answers feel hand-wavy, you have a risk. If you can't answer at all, you have a problem.
me-central2 Dammam for KSA · me-central1 Doha for Qatar · europe-west1 Belgium for EU) · DPA in place · Cloud Audit Logs on · IAM scoped per service-account · model versions pinned via Model Garden · VPC Service Controls to forbid data egress.Use open source when any of the following is true:
- The data is regulated or sensitive. If you can't send it to a public API, the model has to run where you can run it — that means open weights.
- The component is on the hot path of your business logic. Orchestration, agent loops, RAG retrieval — anything you'll want to fork, debug, and customise. Vendor lock-in here is the most expensive lock-in to undo.
- Cost will explode at scale. Per-token pricing makes sense when usage is small. At a million queries a day, an open model on your hardware is 5-20× cheaper than a frontier API.
- You need predictable behaviour. Open weights don't change overnight. A vendor "model improvement" can break your evals on a Tuesday.
Use proprietary / SaaS when:
- You need the absolute best reasoning available, and the prompts are not sensitive. (Claude Opus, GPT-5-class.)
- The component is undifferentiated infrastructure you would never build yourself. (CDN, email delivery, payment processing.)
- You're prototyping and time-to-first-demo matters more than long-term cost.
| Layer | Open-source default | Proprietary when |
|---|---|---|
| L3 Models | Llama 3.3 70B · Llama 3 8B · Qwen 2.5 · DeepSeek | Claude / GPT for top-tier reasoning when prompts are not sensitive |
| L4 Inference | Ollama (local) · vLLM · TGI | Groq / GCP Vertex AI / Anthropic API for speed or scale you can't host |
| L5 Vector DB | pgvector · Qdrant · Weaviate | Pinecone if you want zero ops and aren't worried about lock-in |
| L6 ML Platform | MLflow · Metaflow · Kubeflow | Dataiku when you need a visual semantic layer + RBAC for non-coders |
| L7 Orchestration | LangChain / LangGraph · n8n · CrewAI · Temporal | Vendor agent platforms only when you accept the lock-in |
| L8 Protocols | MCP · OpenAPI | (no proprietary alternative — open is the standard) |
| L10 Governance | Langfuse self-hosted · Promptfoo · Llama Guard · Presidio | SaaS observability only for non-sensitive workloads |
The structural layer
- Orchestration · LangChain / LangGraph — this is your code, never lock it in
- Vector store · pgvector — already in your Postgres
- Local model · Llama 3.3 70B on Ollama — the regulated-data fallback
- Speech-to-text · Whisper local — never stream audio to a cloud STT
- PII redaction · Presidio — must run before prompts leave
- Output safety · Llama Guard — local moderation
- Tracing · Langfuse self-hosted — full prompt visibility, kept private
- Eval · Promptfoo — your evals, your test data, in your repo
The intelligence layer (selectively)
- Frontier reasoning · Claude Opus or GPT-5-class — only for non-sensitive prompts, only where it earns its cost
- Fast inference · Groq Cloud — for latency-bound loops, with redaction in front
- Enterprise data · Teradata / Dataiku — already paid for, don't re-platform
- Single-tenant frontier · GCP Vertex AI — Claude or Gemini inside your own GCP project, region-pinned (your default for any prompt with sensitive content)
- Long-context reasoning · Gemini 2.5 Pro via Vertex AI — when you need 1M+ tokens (whole repos, long contracts, large case files)
- Telephony · Genesys / Twilio — unless on-prem PBX is required
- Channel APIs · WhatsApp Business / SMS gateway — unavoidable for the channel itself
Frame the user job — one sentence
"Who is doing what task, and what would 'much better' look like for them?"
Customer Care AI: "An L1 support agent at our retail client handles 80 tickets a day; a well-scoped AI should resolve 50 of them end-to-end and prepare drafts for the other 30."
Decide: replace, co-pilot, or augment?
- Replace — AI handles the whole task, human is exception path. Use for high-volume, low-stakes, well-defined work (L1 ticket triage, simple RAG Q&A).
- Co-pilot — AI sits next to the human in real time. Use for high-stakes, high-judgement work (VIP call centre, financial analysis, code review).
- Augment — AI runs offline, prepares work for humans to consume. Use for batch tasks (lead enrichment, document summarisation, daily briefings).
Map the data sources — every box, every owner
Use the Data layer (L5) and DS Platform layer (L6) as your starting checklist. Don't propose any data migration — work with what exists. If Teradata holds the truth, the agent talks to Teradata (via Dataiku).
Pick the model — frontier vs open, big vs small
- Reasoning steps (planning, complex Q&A, code) → frontier (Claude Opus / GPT-class / Gemini 2.5 Pro) or top-tier open (Llama 3.3 70B).
- Long-context tasks (whole codebases, long PDFs, multi-doc analysis) → Gemini 2.5 Pro on Vertex AI for 1M+ tokens.
- Classification, extraction, simple Q&A → small open model (Llama 3 8B, Qwen 2.5 7B) or Claude Haiku / Gemini Flash. Cheaper and often faster.
- Latency-bound loops (live voice, autocomplete, live agent assist) → small open model on Groq · sub-300ms first-token. Without Groq, voice and live-assist don't work.
- Sensitive prompts → must be open weights hosted by you, or Claude/Gemini via Vertex AI in your own GCP project (single-tenant, region-pinned). Not Groq — it's a public API.
Pick the inference home — three trust zones, one rule per zone
- Local on-prem (green zone) → Ollama / vLLM on your GPU · use for any prompt you can't let leave the building. Open weights only. Slowest, but most compliant.
- Your GCP project (amber zone) → GCP Vertex AI · Claude or Gemini, region-pinned (use
me-central2for KSA / Dammam). Use for sensitive prompts that need frontier quality, or any non-time-bound workload. Default for the user's stack. - Public API (red zone) → Groq for latency-bound open-weight, Anthropic API for fastest path to Claude when prompt is non-sensitive, OpenAI API for GPT specifically. Always Presidio-redact before any call.
Pick the orchestration shape
- Single-shot LLM call — RAG chatbot, summariser. No agent. Don't over-engineer.
- Workflow (n8n) — fixed steps, mostly SaaS API calls, one or two LLM nodes. Sales-outreach archetype.
- Agent (LangGraph) — model decides which tool to call next, loops until done. Customer Care, Analytics archetype.
- Multi-agent — multiple specialised agents handing off via tasks/queue. Apex archetype. Don't reach for it unless the work genuinely splits across roles.
Define the tools (MCP servers, one per integration)
Naming convention: {system}_{verb} — e.g. customer_history_get, order_refund, kb_search, ticket_escalate. Each tool has a one-line description that the model sees, an explicit JSON schema, and a unit test that the orchestrator runs at startup.
Draw the trust map · mark every red box
If you have a red box, write the mitigation next to it (Presidio in front · contract terms · fallback to local model · etc.). If a red box has no mitigation, the architecture is not done.
Define the guardrails & the escalation gate
- Confidence threshold — below X, hand to human with the draft attached.
- Action allow-list per tier — refunds up to $X without approval, anything above goes to a human.
- Output filters — Llama Guard for unsafe content, custom regex for never-say-this-to-a-customer phrases, Presidio for PII in outputs.
- Per-conversation rate limits — kill switch if the agent loops on the same customer five times in an hour.
Build the eval set before writing the code
Source examples from real tickets / calls / emails (with PII scrubbed). Cover the happy path, the obvious failure modes, the legally-sensitive cases, and the cases where the right answer is "I don't know — escalating."
Estimate the cost & the build size · then commit
- Per-query cost = (input tokens × input price) + (output tokens × output price), summed across every model call in a typical conversation. Multiply by expected daily volume × 30. If it's offensive, switch a step to a smaller model and recompute.
- Build size = MCP tools (1 day each) + orchestration loop (1 week) + frontend / channel integration (1–3 weeks) + evals (3 days) + observability (3 days) + hardening / load-test (1 week). Typical first MVP: 4–8 engineer-weeks.
- Then commit. Write a one-page architecture sketch with the layer table, the trust map, the eval plan, the cost-per-query, and the timeline. That document is the basis of the SOW.
customer_history_get, kb_search, order_refund, order_reschedule, ticket_escalate.ollama run llama3.3.