Part 1 · Foundations
AI Ecosystem Primer
A 10-layer map of the AI stack — every tool you hear about slots into one of them.
The Stack at a Glance
Every AI product is a choice across 10 layers · hover a row
Every tool you'll hear about slots into exactly one of these layers. This diagram is your decoder ring.
Who this is for

Everyone at NMO who touches AI work — whether you're pitching a client, scoping a project, writing code, or reviewing someone else's architecture. You don't need prior ML experience. You do need to be comfortable hearing names like "pgvector" or "LangGraph" without googling every one.

By the end you should be able to:

  • Sketch an AI agent architecture on a whiteboard and place every component in the right layer.
  • Evaluate any new tool that lands in your inbox — quickly classify which layer it plays in and what it replaces.
  • Cut through vendor pitches that collapse multiple layers into "one platform" without telling you what you're giving up.
  • Make informed recommendations when a client asks "should we use X?"
How to read this

First pass (45 min): read Part 1 cover to cover, skim Part 2 layer headers, read Part 3 worked examples, skim Part 4.

Deep pass (~3 hours): read Part 2 layer-by-layer. Each layer card is independent — you can stop between them.

Reference mode: when you hear a tool name, search the page (Ctrl+F). Every tool named lives in a specific layer with context.

Opinionated, on purpose

This primer makes judgment calls. "Use pgvector when you have fewer than 10M vectors" is opinion — defensible, pragmatic, and much more useful than a neutral survey. Where there's a live debate between tools, we flag it explicitly. Where the industry has converged on an answer, we state it.

When a client hires us, they don't want a library tour — they want a recommendation. This document trains you to recommend with confidence and with reasons.

Version promiseAI tooling moves fast. This primer is stamped April 2026. Treat any tool list as "accurate now, revisit quarterly." The layer model itself is stable — new tools always slot into an existing layer.
Part 1 · Foundations
Core Concepts & Vocabulary
Before the layers, the vocabulary. If someone on your team confuses "model" with "provider" or "RAG" with "fine-tuning," this is the page to send them.
The Six Terms That Trip Everyone Up
1 · LLM (Large Language Model)

A neural network trained on enormous amounts of text, capable of producing human-like language. Examples: GPT-5, Claude Opus, Llama 4, Gemini 3, DeepSeek V3.

What it is not: an application. "ChatGPT" is an application that uses GPT (a model). Confusing these is like confusing "a car" with "the Toyota Corolla engine."

Two flavours:

  • Closed / proprietary: access only via the maker's API. Claude, GPT, Gemini, Grok. Usually strongest at the frontier.
  • Open-weight: the model weights are published. You can download and run yourself. Llama, Mistral, Qwen, DeepSeek, Gemma, Phi.
2 · Inference

The act of running a model to produce output. Training a model is enormously expensive and rare; inference is what happens every time someone sends a prompt. When people say "inference costs" or "inference provider," they mean this.

Every ChatGPT request = one inference call. Running 1,000 prompts = 1,000 inferences. Throughput and latency are measured in tokens/second during inference.

3 · Tokens

The unit of work for LLMs. A token is roughly 3/4 of an English word — "hello" is one token, "unbelievable" is two or three. Everything is priced in tokens: input tokens (what you send) cost different from output tokens (what the model generates).

Why this matters: 1,000 tokens of Claude Opus output ≈ 7.5¢. 1,000 tokens of Claude Haiku ≈ 0.5¢. Same family, 15× cost difference — because the smaller model is much cheaper to run. Choosing the right model per task is the single biggest lever on your AI budget.

4 · Context window

How much text the model can "see" in a single inference call — input + output combined. GPT-3.5 was 4k tokens (~3,000 words). Claude 4 is 200k (~150k words, a whole novel). Gemini 2.5 Pro is up to 2M.

Why this matters: bigger context = you can stuff more reference material in. But: longer context = slower + more expensive + more likely to lose focus on early content. "Throw everything into context" rarely beats "retrieve the right 5 chunks."

5 · Embeddings & vectors

A model that converts text (or images, audio) into a list of ~1,500 numbers. Texts with similar meanings produce similar number lists. This numerical fingerprint is called an embedding or vector.

The power: once text is a vector, you can do math on meaning. Search "find documents about refunds" by converting the query to a vector and finding the closest document vectors. That's semantic search. It's the foundation of RAG.

6 · RAG (Retrieval-Augmented Generation)

A pattern, not a product. When the user asks a question:

  1. Convert the question to an embedding.
  2. Search a vector database for the most relevant documents (semantic search).
  3. Stuff the top-K documents into the LLM's context alongside the original question.
  4. Generate the answer with that grounding.

This lets a generic LLM answer questions about your data without retraining. It's how every "chat with your docs" product works. It's also the most common way AI becomes actually useful in an enterprise.

Visualising Tokens — What The Model Actually Sees
1 word ≠ 1 token · tokenisation is the atomic unit of cost
Tokenisation: a sentence split into the eight tokens a model sees The human sees this sentence: "Claude is an unbelievably helpful assistant." The model sees these tokens (8 total): Claude is an un believ ably helpful assistant. 6 words → 8 tokens. "Unbelievably" splits into 3 because the tokenizer learned "un", "believ", "ably" are common sub-words.
Rule of thumb: 1,000 tokens ≈ 750 English words. Arabic, code, and rare words produce more tokens per word.
Visualising Embeddings — Meaning Becomes Math
Similar meanings → similar vectors → nearby points in space
Embeddings: text turns into numeric vectors that cluster by meaning in space Step 1 · Text becomes numbers "How do I get a refund?" "Return policy for items" "Track my shipment" "Best pizza in Riyadh" [0.23, -0.11, 0.87, ...] [0.21, -0.09, 0.85, ...] [-0.44, 0.62, 0.11, ...] [0.91, 0.34, -0.52, ...] (each vector has ~1,500 numbers) Step 2 · Math finds similar meaning refund return shipment pizza "refund" and "return policy" land close — the model has never been told they're related, yet the geometry finds it.
This is why semantic search works. A query about "refunds" retrieves documents about "return policy" — even though not a single word matches.
Visualising RAG — How The Magic Happens
"Chat with your docs" always follows this exact shape
RAG flow: user question is embedded, similar docs retrieved from a vector store, then passed with the question to the LLM to produce an answer USER "Refund policy?" EMBED [0.23, -0.11, ...] VECTOR DB SEARCH Qdrant · pgvector TOP-5 DOCS most relevant chunks LLM (Claude / GPT) — generates answer grounded in the retrieved chunks System prompt: "Answer only from the context below. Say 'I don't know' otherwise." Context: [chunk 1] [chunk 2] [chunk 3] [chunk 4] [chunk 5] | User: "Refund policy?" Grounded answer, with citations
The LLM by itself doesn't know your company's refund policy. RAG teaches it just long enough to answer — then forgets. No training required.
Model ≠ Provider ≠ Application — Drill It Home
LayerExampleWhat you pay for
ModelLlama 3.3 70BNothing — open-weight, free to download
ProviderGroq · Together · FireworksPer-million tokens at the provider's rate
ApplicationA chatbot built on GroqSubscription to the application itself

Same model, three different things to buy. Once you internalise this split, every vendor pitch becomes readable.

Agent vs Chatbot vs Automation
Chatbot
Single-turn or multi-turn conversation. User asks, model answers. No autonomy, no tool use. ChatGPT in its basic mode.
Automation
Fixed pipeline where an LLM is one step. "New Zendesk ticket → summarise with GPT → post to Slack." No decisions by the LLM about what to do next. n8n / Zapier / Make live here.
Agent
The LLM decides what to do next — which tool to call, which step to take, whether the job is done. Can loop. Can handle novel situations. More powerful, more unpredictable, harder to debug.
Rule of thumbIf the steps are predictable, build an automation. If the steps depend on the specifics of each request — build an agent. Agents cost more and break in weirder ways; use them only when automation won't do.
Fine-tuning vs Prompting vs RAG — When To Use Which
TechniqueWhat it doesWhen to use
PromptingJust write better instructions90% of use cases. Start here, always.
RAGInject relevant documents into contextModel needs knowledge it doesn't have — your company docs, a product manual, live data.
Fine-tuningAdjust the model's weights with training examplesYou need a consistent tone, a narrow format, or you're squeezing cost by making a small model imitate a big one. Expensive and slow to iterate.

Beginners reach for fine-tuning because it sounds sophisticated. Professionals reach for better prompts and RAG because they work faster, cheaper, and cover 95% of real problems.

Part 1 · Foundations
The 10-Layer Model
Every AI product — from a hobby chatbot to a trillion-dollar platform — is a set of choices across these ten layers. This is the mental model you'll use for the rest of the document, and for every AI conversation you have from now on.
Read top-down when thinking about a product · bottom-up when thinking about infrastructure
L10 · Governance L9 · Applications L8 · Protocols L7 · Orchestration L6 · DS Platforms L5 · Data L4 · Inference Providers L3 · Models L2 · Hosting L1 · Silicon
The Ten Layers at a Glance
LayerNameWhat lives hereExample tools
L10GovernanceObservability, evaluation, guardrails, complianceLangfuse, Promptfoo, Llama Guard, Presidio
L9ApplicationsEnd-user products built on everything belowChatGPT, Cursor, Perplexity, Copilot
L8ProtocolsStandards for components to talk to each otherMCP, function calling, OpenAPI, A2A
L7OrchestrationCompose models + tools + data into workflows/agentsLangChain, LangGraph, CrewAI, n8n, Temporal
L6DS & ML platformsWhere data scientists prep data, train models, deployDataiku, Databricks, SageMaker, Vertex AI
L5DataWhere knowledge lives — warehouses, databases, vectorsSnowflake, Teradata, Postgres, Qdrant, Redis
L4Inference providersAPIs (or local runtimes) that run models for youAnthropic, OpenAI, Groq, Bedrock, OpenRouter · Ollama (local)
L3ModelsThe neural networks themselvesClaude, GPT, Llama, Gemini, Qwen
L2HostingWhere your orchestration code livesAWS, Vercel, a VPS, RunPod
L1SiliconThe physical chipsNVIDIA H100, Groq LPU, Google TPU
Why Ten? Why Not Fewer?

Because each layer has a distinct buying decision with different vendors and different competitive dynamics.

You could collapse "inference providers" into "models" — but then you can't explain why the same Llama 3.3 runs on Groq (fast) and Together (cheap) and Bedrock (compliant). You'd hide the decision that actually matters.

You could merge "protocols" and "orchestration" — but then you miss that MCP is a standards layer, chosen separately from whichever framework consumes it.

Ten layers is the minimum number that keeps the decisions visible.

Common Patterns You'll Recognise

The minimal agent

L3 (model) + L4 (provider) + L7 (orchestration) = a working agent. Three layers, ~200 lines of code. A weekend build.

The enterprise pattern

All ten layers. L5 (Teradata + Qdrant), L6 (Dataiku), L10 (Langfuse + Llama Guard), the rest. Months of integration.

The vendor "platform"

A product claiming to cover 6+ layers for you. Convenient at first, lock-in at scale. Readable once you know the layers.

The "AI as a feature"

L3, L4, L9 — existing product adds a "summarise" button. Notion AI, Zendesk AI. Usually OpenAI or Anthropic under the hood.

Pattern-matching drillNext vendor pitch you hear: open Part 2 of this doc in a second tab. Map each of their capabilities to a layer. What layers do they actually cover? What's the marketing hiding? You'll read the pitch in half the time.
Part 1 · Foundations
Live Ecosystem Map
A working map of the NMO stack — what runs where, where data flows, and exactly where it leaves your network. The dots are animated on purpose: that's where your data is moving right now.
The Three Trust Zones

Every component you'll touch lives in exactly one of three zones. The zone determines what you can put through it without a compliance review.

  • Local / on-prem — your VPS, your data centre, your firewall. Data never leaves. Default for anything regulated (PII, financial, health, gov-ID).
  • Private cloud you control — single-tenant deployments inside your GCP project, behind your VPC. Data leaves the building but you control the keys, the logs, and the contract.
  • Public API — Anthropic, OpenAI, Groq cloud, public SaaS endpoints. Fast, cheap, and powerful — but every prompt and response crosses the public internet to a third party. Treat with care.
The NMO Stack — Where Each Tool Lives
Local / On-Prem
Data never leaves the perimeter
T
Teradata
L5 · Enterprise warehouse · the system of record
D
Dataiku (on-prem)
L6 · Semantic layer · RBAC · feature pipelines
L
Llama / Qwen on Ollama
L3+L4 · Local inference for sensitive prompts
λ
LangChain / LangGraph
L7 · Open-source library · runs in your code
P
Postgres + pgvector
L5 · App DB + RAG vector store
M
MCP servers (your tools)
L8 · Each integration as a tool the agent calls
Live data flows · animation = real direction of bytes
User query
LangChain agent
Agent
Dataiku → Teradata
Agent
Groq Cloud
Groq
Agent
Agent
User
Local zone
Private cloud
Public API
= bytes in flight
Private Cloud · GCP
Single-tenant inside your GCP project
V
GCP Vertex AI
L4 · Gemini · Claude · Llama · region-pinned in your GCP project
D
Dataiku Cloud
L6 · Single-tenant SaaS option
R
Cloud Run / GKE
L2 · serverless containers + managed Kubernetes for your services
Public API
Data crosses the internet to a vendor
G
Groq Cloud
L4 · Lightning-fast LPU inference for open models
A
Anthropic API
L4 · Claude direct · best reasoning
Google Gemini API
L4 · Gemini 2.5 Pro · 1–2M token context · cheap at scale
!
SaaS connectors
L9 · Salesforce / Slack / Gmail · contract-bound
The Two Risk Hotspots You Must Watch

Hotspot 1 — Prompt to public LLM API. Every prompt sent to Groq Cloud, Anthropic, OpenAI contains whatever you put in it. If you put customer PII, internal financials, or trade secrets into the prompt, you have just shipped them to a third party. Mitigation: pre-prompt redaction (Presidio), data-class allow-lists per route, contract review for the provider's data-retention terms, or fall back to a local model (Llama via Ollama).

Hotspot 2 — Tool calls to external SaaS. When the agent decides to "look up the customer in Salesforce" or "post to Slack," that call leaves the network too. Mitigation: every tool call goes through an MCP server that logs the request, redacts sensitive fields, and enforces an allow-list of which tenants/customers can be looked up.

How To Read The Diagram In A Client Meeting
  • Point at the green column first. "All of this stays inside your firewall — Teradata, Dataiku, the agent code, your vector store."
  • Then the amber column. "These run inside your GCP project, single-tenant, region-pinned. Your data leaves your building but stays under your contract and never reaches a multi-tenant pool."
  • Then the red column. "These are the only places where your data crosses the public internet to a third party. We use them deliberately, with redaction in front, only for prompts that don't contain regulated content."
  • The animated dots. "Each dot is a request travelling between components. Notice that most activity is inside the green column. The model API only sees the cleaned, redacted prompt — never the raw customer record."
When the green column isn't enoughSome clients refuse any data crossing their firewall, even redacted prompts. For them: the entire stack runs locally — Llama 3.3 70B on a GPU box, Dataiku on-prem, LangChain in your VPC, Teradata as-is. You give up some quality (frontier models are still ahead) and pay more in hardware, but the trust map collapses to a single column. Always offer this option for regulated industries.
When You Need Groq · The Decision Rule

Groq sits in the public-API zone (red), so you only reach for it when its specific advantage — sub-300ms first-token latency on open models — is the thing that makes or breaks the product. Here is the rule, every time:

Use Groq when
  • Real-time voice — under-300ms first-token is the difference between "natural" and "awkward" (see Voice Agent).
  • Live agent assist / call-centre co-pilot — a suggestion every 4 seconds needs a sub-700ms round-trip (see Call Center · VIP).
  • Inbound chat at high volume where reply speed is part of the customer experience (see Customer Care AI).
  • High-throughput batch on open weights — classification / extraction / triage at hundreds of tokens/sec for cents per million.
  • Cost-sensitive workloads on Llama / Qwen / DeepSeek — you want open-weight pricing and world-class speed.
Don't use Groq when
  • You need Claude or Gemini quality — Groq only hosts open-weight models. Frontier reasoning lives on Vertex AI / Anthropic.
  • The prompt contains regulated PII you can't redact — it's a public API; data leaves your network. Use Vertex AI in your GCP project instead.
  • Latency doesn't matter — for offline / batch / "report me by tomorrow" workloads, Vertex AI on Llama is cheaper and stays in your cloud.
  • You need long context (1M+ tokens) — that's Gemini 2.5 Pro on Vertex AI, not Groq.
  • You're inside a strict on-prem mandate — fall back to local Llama on Ollama / vLLM. Slower but never leaves the building.
Always pair Groq withPresidio in front (strip PII before the prompt leaves your VPC) and a fallback to local Llama for top-tier tenants who refuse public APIs. Without those two, Groq's speed becomes a compliance liability.
Part 2 · The Layers
Layer 1 · Silicon — where the math actually runs
The physical chips that execute matrix multiplications. Usually invisible to you — it's whoever hosts your inference who picks. But the choice shapes latency, cost, and what's even possible.
The Chip Families
ChipMakerPosition in 2026
H100, H200, B100, B200NVIDIAThe default. ~90% of production inference. CUDA ecosystem is the moat.
A100NVIDIAPrevious generation. Still everywhere. Cheaper to rent.
TPU v5e · v5p · TrilliumGoogleGoogle-only. Powers Gemini. Rentable via GCP.
MI300X · MI325XAMDCredible NVIDIA challenger. Cheaper per FLOP. Software (ROCm) still maturing.
LPUGroqLanguage-specific chip. Not a GPU. Deterministic, extremely low latency, 5–10× faster tokens/sec on open-weight models. Groq (the company) sells API access; you don't buy LPUs.
WSE-3CerebrasWafer-scale. One chip is physically the size of a cluster of GPUs. Fastest inference on large models. Niche, expensive.
Trainium · InferentiaAWSAWS-exclusive silicon. Cheap. Used inside Bedrock.
Neural Engine (M-series, A-series)AppleOn-device only. Behind every "Apple Intelligence" feature.
Snapdragon NPUQualcommAndroid on-device inference.
Why Groq Is Structurally Different

A GPU is general-purpose — it does graphics, crypto, scientific computing, and AI. That flexibility costs you speed. Groq built a chip that only does one thing (the math that runs LLMs) and shaved off every millisecond.

Practical consequence: Llama 3.3 70B on an H100 produces maybe 60 tokens/second. Same model on Groq: 500+ tokens/second. That's the difference between an agent that feels snappy and an agent that feels sluggish.

Trade-off: Groq only serves a curated menu of open-weight models. You cannot run your custom fine-tune. You cannot run Claude or GPT (those are closed — they run on their makers' infrastructure). You're choosing speed within a constrained model set.

When Silicon Choice Actually Matters To You

Matters

  • Voice agents (<300ms perceived round-trip)
  • Live code completion
  • High-volume batch processing (cost per million tokens)
  • Air-gapped / on-prem deployments (you pick the hardware)

Doesn't matter

  • Prototypes and MVPs
  • Internal tools with <1,000 users
  • Anything where a 2-second response is fine
  • Anything running on Claude or GPT (you can't choose anyway)
The hidden chip warRight now every cloud provider is scrambling to reduce their NVIDIA dependency — building their own chips (AWS Trainium, Google TPU, Microsoft Maia), or funding alternatives (Groq, Cerebras). Over the next 3 years, expect model prices to drop as alternatives mature. Don't lock in long-term contracts based on today's pricing.
Part 2 · The Layers
Layer 2 · Hosting & Infrastructure
Where your orchestration code, databases, and any self-hosted inference actually live. Separate from Layer 4 (inference providers) — you might host your app on Vercel while calling OpenAI's inference API from elsewhere.
Three Buckets
Hyperscalers — the full menu

AWS, GCP, Azure, Oracle Cloud, Alibaba Cloud. Everything is available; complexity is high. Pick when you need compliance stories (HIPAA, PDPL, SOC 2), when you already run 80% of your infrastructure there, or when the client dictates it.

AI-relevant services: AWS Bedrock (multi-model inference gateway), Azure OpenAI (Microsoft's GPT resell), GCP Vertex AI (Google's ML platform), AWS SageMaker, Azure ML.

GPU-specialised clouds — rent a GPU in 60 seconds

RunPod · Lambda Labs · CoreWeave · Modal · Replicate · Beam · Paperspace · Fluidstack.

Use case: you need GPUs now (fine-tuning, self-hosting a specific model, experimenting) without a hyperscaler commitment. Sign up, rent an H100 by the hour, shut it down. This is where most open-source AI development happens.

Self-hosted and edge

Your own VPS (Hostinger, Linode, DigitalOcean, Hetzner), bare-metal servers, on-prem, Cloudflare Workers AI (edge), Vercel.

Use case: small-scale apps, data-sovereignty requirements, cost control, internal tools. A $20/month VPS can host a surprising amount of AI application code; you call out to inference providers for the heavy compute, or run a small open-weight model locally with Ollama on the same box.

The Common Shape of an AI Application
Typical production topology
Users Vercel / VPS (app) Postgres + Redis (state) Anthropic / OpenAI API Response

The application layer is small and cheap. The inference cost is the variable. That's why hosting your app on a $20 VPS is fine for a long time — the money goes to layer 4, not layer 2.

Part 2 · The Layers
Layer 3 · Models — the brains
The actual neural networks. Model ≠ provider: Llama runs on a dozen providers; Claude runs only via Anthropic, AWS Bedrock, and GCP Vertex. The model you pick determines capability; the provider you pick determines cost, latency, and compliance.
Frontier Closed Models
FamilyMakerStrength
Claude (Opus, Sonnet, Haiku)AnthropicCoding, long-context reasoning (200k+), careful tool use. Preferred by Apex and by most serious agent builders.
GPT-5 · GPT-4o · o-series (o1/o3/o4)OpenAIGeneral-purpose, multimodal (vision + voice), math and science via o-series reasoning models. GPT-5 is the current flagship.
Gemini 2.5 · 3GoogleUp to 2M-token context (biggest), native multimodal, very cheap at scale.
Grok 3 · 4xAITrained on X data, fewer guardrails, fast-moving.
Open-Weight Models
FamilyMakerStrength
Llama 3.1 · 3.3 · 4MetaThe open-weight workhorse. Runs everywhere, fine-tunable, strong community.
Mistral · Mixtral · CodestralMistral AI (France)EU privacy story, MoE (mixture-of-experts) efficiency, small-model quality.
Qwen 2.5 · 3AlibabaBest open-weight coder in 2026, excellent multilingual (great for Arabic), many sizes.
DeepSeek V3 · R1DeepSeekCheap frontier reasoning. R1 was trained for roughly 1/20th of GPT-4's public cost estimates and matches o1-level reasoning on many benchmarks.
Gemma 2 · 3GoogleSmall-model sibling of Gemini. On-device friendly.
Phi-3 · Phi-4MicrosoftSmall model, punches above its weight, good on-device.
Specialised Models
FamilyMakerStrength
Whisper · Whisper Large v3OpenAISpeech-to-text. Best-in-class transcription. Free to self-host.
Flux · Flux ProBlack Forest LabsImage generation, open-weight, high quality. Replaces Stable Diffusion for many.
Stable Diffusion 3.5Stability AIOpen image generation.
Sora · Runway Gen-3 · KlingOpenAI · Runway · KuaishouVideo generation. Early but usable.
text-embedding-3 · voyage-3OpenAI · Voyage AIEmbeddings — turn text into vectors for retrieval. (You'll use these daily in RAG.)
Cohere Embed · BGE-M3Cohere · BAAIAlternative embedding models. BGE-M3 is open-weight and strong on multilingual.
Choosing A Model — Practical Guide
Your situationFirst pick
"I need the best coding model"Claude Sonnet / Opus · Qwen 2.5 Coder for open
"I need the cheapest frontier-quality reasoning"DeepSeek V3/R1 or Gemini 2.5 Flash
"I need 1M+ tokens of context"Gemini 2.5 Pro
"I need to run it on my own hardware"Llama 3.3 70B (general) or Qwen 2.5 Coder (coding) — fastest path is Ollama
"I need it to be fast enough for voice"Llama 3.3 70B on Groq
"I need Arabic / multilingual strength"Qwen 2.5 · Gemini · Claude
"I need strong vision (describe image, read PDFs)"Claude Sonnet · GPT-4o · Gemini 2.5
"I need cheap summarisation at scale"Claude Haiku · Gemini Flash · GPT-4o-mini
The model-cascade patternDon't use one model for everything. Use Haiku/Flash/mini for cheap high-volume tasks (classification, extraction, short summaries) and Opus/GPT/Gemini for the hard reasoning step. A well-designed agent might call Haiku 50 times for every Opus call. This is the easiest 10× cost reduction you'll find.
Part 2 · The Layers
Layer 4 · Inference Providers
The endpoint your code actually calls. Determines price, latency, which models are available, and SLA. For closed models, you have one choice per model. For open models, you have a buffet.
First-Party Providers

The model maker serves their own model — only place you can get it (plus some hyperscaler resells).

  • Anthropic API — Claude. Best place for Claude. Also available on AWS Bedrock and GCP Vertex for compliance reasons.
  • OpenAI API — GPT and o-series. Also sold as Azure OpenAI for enterprise.
  • Google Gemini API — via Google AI Studio (dev) or Vertex AI (enterprise).
  • xAI API — Grok.
Multi-Provider Gateways

One endpoint, many models. Useful when you want to A/B test or not lock in.

OpenRouter
100+ models behind a single OpenAI-compatible API. Most popular with indie builders. Pay-as-you-go, no contracts.
AWS Bedrock
Claude, Llama, Titan, Cohere, Mistral — all served from AWS. Compliance + enterprise integration story.
Azure OpenAI
GPT and o-series with Microsoft's compliance wrapper. Required for many enterprise customers.
GCP Vertex AI
Gemini + Claude + Llama via Google's cloud.
Together AI · Fireworks · Replicate
Open-weight models hosted as APIs. Cheaper than first-party, pick-your-model.
Fast-Inference Specialists

Providers competing on speed for open-weight models.

Groq
LPU-based. 500+ tokens/second on Llama 3.3 70B. The go-to for voice agents and live UX.
Cerebras
Wafer-scale. Similar speeds to Groq. Bigger context windows on some models.
SambaNova
Third contender in the speed race.
Self-Hosted Inference Runtimes

The software you run yourself when data can't leave your network.

Ollama
Simplest. Great for local dev and small-scale. ollama run llama3.3 and you have an API.
vLLM
Production-grade serving. What you'd use for real throughput.
TGI
Hugging Face's inference server. Solid alternative to vLLM.
llama.cpp
CPU-optimised. Runs a 7B model on a Raspberry Pi. On-device / edge play.
Observability Gateways

Proxies that sit between your app and the real inference provider — adding caching, logging, rate-limiting, A/B testing.

LiteLLM
Unify 100+ providers behind one OpenAI-compatible interface. Open-source. Essential if you want to swap providers later.
Portkey · Helicone
Commercial variants with richer dashboards and caching.
Practical stackUse LiteLLM as your gateway. Point it at Anthropic for Claude, OpenAI for GPT, Groq for fast open-weight, and Together for cheap open-weight. Your code calls LiteLLM; LiteLLM decides. You can swap any provider without touching application code.
Part 2 · The Layers
Layer 5 · Data — where knowledge lives
Every interesting agent retrieves from something. The store shapes the decision: warehouse for history, vector DB for semantic search, Postgres for app data, Redis for session state.
The Data Landscape
CategoryPlayersUse case
Data warehouses (OLAP)Snowflake · Databricks · BigQuery · Teradata · Redshift · ClickHouseStructured analytical queries over years of history. "What was our LATAM revenue by quarter for 2020-2025?"
Data lakesS3 + Iceberg · Delta Lake · MinIO · Azure Data LakeCheap raw-file storage, often the substrate under a warehouse.
Operational DBs (OLTP)PostgreSQL · MySQL · MongoDB · DynamoDB · SQL ServerYour app's live data — users, orders, tickets. Reads and writes continuously.
Vector DBsQdrant · Pinecone · Weaviate · Milvus · Chroma · pgvector · LanceDBStore embeddings for semantic search. Foundation of RAG and agent memory.
Graph DBsNeo4j · ArangoDB · Memgraph · TigerGraphWhen relationships are the point — fraud rings, supply chains, org charts.
Cache / in-memoryRedis · KeyDB · Memcached · DragonflyDBSub-millisecond lookups, session state, pub/sub messaging.
Search enginesElasticsearch · OpenSearch · Meilisearch · TypesenseKeyword + filter search. Often combined with vector search for hybrid retrieval.
Teradata vs Snowflake vs Databricks — Enterprise Warehouse Picks

Teradata: the 40-year incumbent in big banks, telcos, airlines, healthcare payers. If a client has 20 years of structured history, it's probably in Teradata. Strengths: mature query optimiser, governance, predictable performance. Weaknesses: expensive, older tooling story. You don't migrate Teradata — you work with it.

Snowflake: cloud-native warehouse, separated compute + storage. Dominant with modern enterprises. Easier to use than Teradata, strong ecosystem.

Databricks: lakehouse model — warehouse + lake + ML platform in one. Preferred by data-engineering-heavy shops. Has its own MLflow, its own LLMs (DBRX), its own serving.

BigQuery: the GCP-native warehouse. Extremely cheap serverless scan. Default for any GCP-committed organisation.

ClickHouse: open-source columnar DB, blazingly fast for analytical queries on event data. Product analytics shops love it.

Qdrant vs Pinecone vs pgvector — Vector DB Decision
pgvector
Postgres extension. Zero new infrastructure. Same ACID guarantees as the rest of your app data. Pick when: <10M vectors, you already run Postgres, you want one database not two.
Qdrant
Open-source, Rust-based, fastest pure vector DB. Payload filtering, horizontal scaling. Self-host or use their cloud. Pick when: millions-to-billions of vectors, low-latency retrieval is critical, you want open-source.
Pinecone
Managed service, pioneered the category. Fully managed, good dashboards. Pick when: you want zero operational overhead and have budget.
Weaviate
Open-source with built-in ML modules (embedding generation, classification). Python-friendly.
Milvus
Heavy-duty open-source, designed for billion-scale. Overkill for most.
Chroma
Dev-friendly, embedded. Prototypes and small apps.
Hybrid Retrieval — The Pattern You'll See Constantly

Pure vector search misses exact matches (product IDs, names, specific phrases). Pure keyword search misses semantic meaning ("refund policy" vs "return guidelines"). Hybrid retrieval runs both and fuses the results.

Typical stack: Elasticsearch (or OpenSearch) for keyword + BM25 ranking + Qdrant (or pgvector) for semantic. A re-ranker model (Cohere Rerank, BGE reranker) picks the final top-K. Quality jumps significantly over either alone.

Part 2 · The Layers
Layer 6 · Data Science & ML Platforms
A layer above raw data, below agents. This is where analysts and data scientists live — cleaning data, building pipelines, training traditional ML models, deploying. Most agent builders never touch these. Most enterprises run on them.
The Platforms
PlatformPositionWho uses it
DataikuVisual + code DS platform. ETL, feature engineering, model training, deployment — in one canvas. Strong RBAC, lineage, governance.Enterprises where analysts and data scientists share workflows. Often sits on top of Snowflake or Teradata.
DatabricksLakehouse + ML + Spark + Delta. Code-heavy. Has its own LLM features (DBRX model, Mosaic AI).Data engineers. ML teams at scale. Shops that live in notebooks.
Palantir FoundryData integration + workflow + ontology. Operational AI, not exploratory. Very opinionated.Large enterprises with messy data across 50 source systems. Defence, healthcare, oil & gas.
AWS SageMakerHyperscaler DS platform. Tight AWS integration. Everything from Jupyter to model serving.AWS-committed shops. ML engineers.
GCP Vertex AIGoogle's answer. Strong AutoML, native Gemini integration.GCP-committed shops.
Azure MLMicrosoft's answer. Tight integration with Azure services and Office.Microsoft-shop enterprises.
H2O.ai · DataRobotAutoML-first. "Point at a table, get a model." Less useful for LLMs, still strong for traditional ML.Teams without deep ML expertise. Financial services modelling.
MLflow · W&B · ClearML · CometExperiment tracking + model registry. Not a full platform — a component.ML teams using their own compute but wanting governance.
Dataiku In Depth — Why Clients Care

Dataiku is often called "Tableau for machine learning." It's a visual canvas where you drag boxes: read from Teradata → filter → join with a CSV → train a model → deploy as an API. Each box can be visual (for analysts) or Python/R (for data scientists). They share the same project.

What's valuable:

  • Lineage: every column in every output can be traced back to its source.
  • RBAC: who ran what, who approved deployment, who has access to which data.
  • Mixed skill-levels: business analysts and senior DS work on the same flow.
  • Model Ops: deployed models get monitored for drift, performance, retraining triggers.

Where it fits in the agent era: Dataiku's sweet spot is traditional ML (classification, regression, forecasting). For LLM-heavy agents, it's peripheral — you might publish a "scored customers" table from Dataiku that an agent then queries, but the agent itself is built elsewhere. Dataiku is adding LLM features, but the core strength remains traditional analytics.

When A Client Says "We Use Dataiku"

What they mean: they have a DS team, they've invested in governance and lineage, they likely have 50+ projects running in production. They are enterprise, not a startup.

Implications for your pitch:

  • Don't propose to replace Dataiku — you'll lose.
  • Do propose to complement it with agentic workflows that consume Dataiku outputs.
  • Leverage their existing lineage + RBAC — the compliance story is already built.
  • MCP servers pointing to Dataiku datasets are the clean integration point.
Part 2 · The Layers
Layer 7 · Orchestration & Agent Frameworks
The layer with the most churn — new frameworks appear monthly. Turns raw LLM calls + data + tools into useful workflows. Where you'll spend most of your coding time.
Code-First Agent Frameworks
FrameworkPhilosophyWhen to use
LangChainThe original. Huge surface area, many integrations. Often criticised as "too magic." Good for getting started, painful at scale.Prototypes. Pattern demonstrations.
LangGraphLangChain's state-machine framework. Explicit graphs of agent decisions. Much more debuggable than raw LangChain.Multi-step reasoning with branches. Complex agent logic.
LlamaIndexRAG-first. Rich tooling for document loaders, chunking, retrieval pipelines.Data-heavy agents, "chat with your docs."
AutoGen (Microsoft)Multi-agent conversations. Agents talk to each other to solve tasks.Research, experimentation. Production less common.
CrewAIRole-based multi-agent ("researcher", "writer", "editor"). Higher-level than AutoGen.Content pipelines, structured multi-agent work.
Pydantic AITyped, minimal, Python-idiomatic. Strong structured-output support.Production systems where schema matters. Rising fast in 2026.
Claude Agent SDKAnthropic-native (formerly "Anthropic Agent SDK"). Closest to the metal. No framework overhead.Claude-specific production agents where you want control.
OpenAI Swarm · OpenAI Agents SDKOpenAI's own lightweight framework.OpenAI-centric agents.
Semantic Kernel (Microsoft)Enterprise-friendly, .NET + Python + Java. Plugin architecture..NET shops, enterprise Microsoft integrations.
Coding Agents (IDE / Terminal-Side)

These aren't frameworks — they're end-user products that use all 10 layers internally. You use them; you rarely build with them.

Claude Code
CLI. Terminal-native. Extensive tool use. Apex wraps this for its developer agents.
Cursor
VS Code fork with deep LLM integration. Most popular paid coding tool.
Windsurf
Cursor competitor from Codeium. Similar model, different UX.
GitHub Copilot
The original. Tightly integrated with GitHub, Codespaces.
Aider
Open-source, terminal-based, git-aware. Beloved by minimalists.
Continue.dev
Open-source VS Code / JetBrains extension. Bring your own model.
Replit Agent
Browser-based, cloud-dev-env integrated. Great for prototyping.
Visual / Low-Code Workflow Automation

Drag-and-drop boxes: trigger → action → action. LLM is one box among hundreds.

n8n
Self-hostable (fair-code licence), 400+ integrations, strong community. Sweet spot: cross-system automation where an LLM is one step. Not ideal for reasoning-heavy agents.
Make (ex-Integromat)
Visual, SaaS-only, strong integrations. Similar power to n8n, cloud-hosted only.
Zapier
Deepest SaaS catalog, weakest for custom logic. Non-technical users' default.
Pipedream
Code-friendly automation. Hybrid visual + JavaScript.
Node-RED
IoT origins, visual flow. Still used in operations, edge computing.
Heavy-Duty Workflow / Data Orchestration

For pipelines measured in hours/days with retries, schedules, complex dependencies.

Airflow
The Python data-engineering standard. Schedule ETL jobs, retries, DAGs.
Prefect · Dagster
Modern Airflow alternatives, Pythonic APIs.
Temporal
Durable workflow engine. Agent workflows that survive restarts, timeouts, retries for hours or days. Increasingly the go-to for agent systems that need to be reliable.
Argo Workflows
Kubernetes-native workflow engine.
n8n vs LangGraph — When To Pick Which
SituationPick
Trigger from Gmail, enrich from HubSpot, post to Slack, one LLM summary in the middlen8n
Agent that calls 5 tools, decides which based on user input, loops if result is unclearLangGraph / Pydantic AI
Client wants "visual AI pipelines they can edit"n8n
You need custom data models, state transitions, complex reasoningLangGraph or custom code
Team is non-technicaln8n / Make
Team is senior engineeringCustom code with Claude Agent SDK / LangGraph
The framework trapEvery few months a new framework claims to be the one that solves agents. Most projects that succeed are built on minimal code + the model provider's SDK + maybe LangGraph for complex flows. Resist the urge to adopt the newest shiny thing — adopt the one your team can debug at 2am.
Part 2 · The Layers
Layer 8 · Protocols — how the pieces talk
Standards for components to communicate. A few years ago there were none and every integration was bespoke. Now there are emerging standards — and they matter because they let you swap tools without rewrites.
The Main Protocols
MCP — Model Context Protocol

Who: Anthropic, now adopted by many. What: open standard for giving LLMs structured access to tools, data, and services.

An MCP server exposes tools ("query_database", "read_file", "send_email"). Any MCP-aware client (Claude Desktop, Cursor, Claude Code, your custom agent) can discover and use them. Think: "USB for agent tools" — plug and play across vendors.

Why it matters: before MCP, wiring a tool to an agent meant writing glue code for every agent framework. After MCP, you write one server, every client works. This is becoming the industry default. Expect every major platform to ship MCP support in 2026.

Function Calling (aka Tool Calling)

Who: OpenAI introduced it in 2023; every frontier model now supports it. What: the model returns a structured JSON object saying "call function X with these arguments" instead of free-text. Your code executes it, returns the result, the model continues.

This is the raw mechanism. MCP is the standard way to package and share functions for reuse.

A2A — Agent-to-Agent

Who: Google's proposal (2025), others experimenting. What: lets one agent discover and call another. Still early — MCP covers most use cases, A2A is for agent-fleet scenarios.

OpenAPI · REST · GraphQL

The fallback. When no native AI protocol exists, a well-documented REST API with an OpenAPI spec is still the common ground. Most MCP servers are wrappers around existing REST APIs.

What MCP Looks Like In Practice
An MCP server exposes tools: - query_customer_db(customer_id) - list_recent_orders(days=7) - escalate_ticket(ticket_id, reason) Your agent framework (LangGraph, Claude Code, a custom thing) connects to the MCP server and automatically discovers these tools. The agent decides when to call each. The server runs them. You didn't write tool-integration code in the agent. You wrote an MCP server once. Every agent can use it.
Why this is important for NMOWhen pitching clients, propose that integrations be built as MCP servers. This future-proofs the work — the MCP server you build for them is reusable by any AI product they adopt later (Cursor, Copilot, ChatGPT Enterprise, or our own agents). A REST integration is locked to the product that consumes it; an MCP integration is portable.
Part 2 · The Layers
Layer 9 · Applications — end-user products
What end users actually touch. Built on all the layers below. Ranges from consumer chatbots to enterprise platforms.
By Category
Chat / General Assistants

ChatGPT · Claude.ai · Gemini · Copilot · Perplexity · Pi · You.com

Coding

Cursor · Claude Code · Windsurf · Copilot · Replit Agent · Cody · Codeium

Writing / Marketing

Jasper · Copy.ai · Notion AI · Mem · Lex · Writer

Customer Support

Intercom Fin · Zendesk AI · Decagon · Sierra · Ada

Sales / CRM

Gong · Chorus · Clay · Apollo AI · HubSpot Breeze

Meetings / Knowledge

Otter · Fireflies · Granola · Tactiq · Glean

Search

Perplexity · Phind · Exa · Kagi Assistant · You.com

Image / Video

Midjourney · Runway · Pika · Kling · Sora · Ideogram · Flux Pro

Voice

ElevenLabs · Cartesia · Deepgram · Vapi · Bland · PlayHT

Data Analysis

Hex · Julius · Rowy · Metabase AI

Legal / Compliance

Harvey · Hebbia · Spellbook · Robin AI

Healthcare

Abridge · Nuance DAX · Suki · OpenEvidence

Pattern to recogniseWhen you use any of these, you're using a packaged vertical slice of the 10 layers. The company that built Cursor is making your layer-3/4/7/8 decisions for you. When evaluating "should we buy this?" vs "should we build?" — ask which layers would we own, and which would we be locked into?
Part 2 · The Layers
Layer 10 · Governance, Observability & Safety
Cross-cutting. The moment you go past demo, you need some of these or you'll be debugging blind and explaining to legal why the agent leaked PII.
The Four Sub-Layers
1 · Observability & Tracing

Answers: "What did the agent do yesterday, how much did it cost, and where did it fail?"

Langfuse
Open-source, self-hostable. Rich trace view, evaluation, prompt management. Best starting point.
LangSmith
LangChain's managed service. Excellent if you're already on LangChain.
Helicone
Proxy-based. Sits between your app and the API, captures everything. Adds caching.
Arize AI · WhyLabs
Enterprise ML observability. Stronger on drift detection, weaker on LLM-specific tracing.
Datadog AI · New Relic AI
If you already run these for your regular apps, the AI modules are easy to enable.
2 · Evaluation

Answers: "Did my new prompt make things better or worse?"

Promptfoo
Open-source, runs in CI. Define test cases as YAML, run across models. Beloved by indie builders.
Braintrust
Commercial. Dataset management, experiments, human-in-the-loop scoring.
Patronus AI · DeepEval
Full eval platforms with pre-built metrics (hallucination, faithfulness, etc).
Ragas
Specifically for RAG systems. Measures faithfulness, context relevance, answer relevance.
3 · Guardrails & Safety

Blocks jailbreaks, prompt injection, unsafe outputs before they reach the user.

Llama Guard 4 (Meta)
Open-weight safety classifier. Self-host in front of your LLM calls.
NVIDIA NeMo Guardrails
Open-source framework for defining guardrails in a declarative DSL.
Lakera Guard
Commercial. Strong on prompt-injection detection.
Prompt Security · PurpleLlama
Enterprise safety platforms.
4 · PII / Compliance

Redacts sensitive data before it reaches any LLM. Critical for GDPR, PDPL, HIPAA.

Microsoft Presidio
Open-source, highly configurable. Detects names, emails, IDs, phone numbers across a wide range of languages.
Private AI · Skyflow · Nightfall
Commercial alternatives with richer UI and enterprise features.
The compliance reality in KSAPDPL (Saudi Arabia's Personal Data Protection Law) applies to any AI system processing personal data. For client work in KSA: always discuss whether PII leaves the Kingdom. If it does, you likely need explicit consent or an exemption. The compliant paths are Vertex AI in me-central2 (Dammam, KSA — actually inside the Kingdom) or on-prem Llama. Presidio in front of your LLM calls turns a non-compliant design into a compliant one.
Part 3 · How It Fits Together
Example · RAG Customer Support Chatbot
The most common "AI in production" pattern. Users ask questions; the bot answers from the company's product docs. Every enterprise Proof-of-Concept you'll see is a variation of this.
Architecture · Live Data Flow
RAG Customer Support Chatbot architecture A user question goes through Presidio for PII redaction in your VPC, then to OpenAI for embeddings, Qdrant for semantic search, Claude Haiku for the reasoned answer, Llama Guard for output safety, and back to the user. YOUR VPC · LOCAL PUBLIC API question redact PII embed query → ← vector search top-k prompt + context → ← answer safety check final answer User Zendesk widget Presidio · PII redact runs in your VPC NestJS / API server orchestrates: embed → search → rerank → generate → safety ~80 lines of LangChain or plain Node the orchestrator Qdrant vector DB · semantic search + rerank top-20 → top-5 Llama Guard output safety filter · local OpenAI Embeddings text-embedding-3-small very small payload Claude Haiku fast · cheap · enough for RAG Anthropic API 1 2 3 4 5 6 7
A classic RAG flow. Two paths cross your network boundary (red): the embedding call (3) and the LLM call (5). Both go through Presidio first, so they only ever see redacted text. Everything else — orchestration, vector search, safety filter — stays in your VPC.
1user question lands at the API 2Presidio strips PII 3OpenAI returns an embedding vector 4Qdrant returns top-k chunks 5Claude Haiku writes the answer 6Llama Guard checks the output 7final answer back to the user
Layer-by-Layer Stack
LayerPickWhy
L1 SiliconNVIDIA (invisible)Whoever hosts Claude picks — not our choice.
L2 HostingVercel (Next.js) + your GCP project (Qdrant on GKE / Cloud Run)Vercel for the edge UI, GCP for the stateful vector DB.
L3 ModelClaude Haiku · text-embedding-3-smallHaiku is cheap + fast. Small embedding model — full cost per million docs.
L4 InferenceAnthropic API · OpenAI APIDirect, simplest.
L5 DataQdrant (vectors) · Postgres (tickets)Qdrant for scale; Postgres for the support ticket system.
L6 DS platformNoneNo model training. No platform needed.
L7 OrchestrationLangChain retrieval chain OR 80 lines of NodeEither works. If it's a one-off, skip LangChain.
L8 ProtocolDirect API callsNothing to reuse — MCP overkill here.
L9 ApplicationChat widget embedded in ZendeskMeet users where they are.
L10 GovernancePresidio (PII) · Llama Guard (safety) · Langfuse (trace)PDPL compliance + observability from day one.
What Makes This Production-Grade vs Demo-Grade

Demo

  • Just Claude + a bunch of docs stuffed into context
  • No PII handling
  • No eval harness
  • No observability — if it breaks, you have no idea why
  • Works for 50 docs, falls over at 5,000

Production

  • Vector DB with hybrid retrieval + reranking
  • Presidio strips PII before any external API call
  • Promptfoo runs in CI, catches prompt regressions
  • Langfuse traces every turn; Helicone caches common questions
  • Scales to 500k docs without a rearchitecture
The 80/20 of RAG80% of the quality comes from: (1) good chunking of the source docs, (2) a reranker after vector search, (3) a clear system prompt telling the model to only answer from the provided context and say "I don't know" otherwise. The rest is polish.
Part 3 · How It Fits Together
Example · Sales Outreach Pipeline
When a new lead hits Salesforce, enrich from Apollo, LLM-draft a personalised email, wait for human approval, send from Gmail, log back to Salesforce. The "n8n archetype" — where visual automation beats custom code.
Architecture · n8n as Central Orchestrator
Sales Outreach Pipeline architecture A new lead in Salesforce triggers an n8n workflow that enriches via Apollo, drafts an email with GPT-4o-mini, asks an SDR for approval in Slack, sends via Gmail, and logs the activity back in Salesforce. YOUR VPS · SELF-HOSTED EXISTING SaaS · EVERYTHING IS PUBLIC API webhook enrich → ← firmographics draft → ← email body approve? → ← yes send log activity n8n visual workflow orchestrator runs on your $10/mo VPS no custom code 7 drag-in nodes non-developers can edit Salesforce new lead webhook · log later Apollo firmographics enrichment GPT-4o-mini draft personalised email Slack SDR approval ask human gate Gmail send approved email Salesforce log activity row 1 2 3 4 5 6
n8n is the only piece you actually run; everything else is existing SaaS. The whole pipeline is just n8n calling 5 APIs in sequence with one human gate. This is what "no custom code" looks like.
1Salesforce fires webhook to n8n 2n8n calls Apollo for firmographics 3n8n asks GPT-4o-mini for a draft email 4n8n posts the draft to Slack for SDR approval 5on approval, n8n sends from Gmail 6n8n writes the activity row back to Salesforce
Why n8n Is The Right Tool Here

Look at the steps: 6 out of 7 are just "call an API on an existing SaaS." n8n has all of them as drag-in nodes. The one AI step (draft email via GPT-4o-mini) is also a drag-in node.

If you wrote this in Python, you'd be writing:

  • Salesforce webhook listener (20 lines)
  • Apollo API client (30 lines)
  • Tier-list lookup (10 lines)
  • OpenAI client + prompt (20 lines)
  • Slack approval Bolt app (60 lines)
  • Gmail SMTP client (15 lines)
  • Salesforce update call (15 lines)
  • Error handling, retries, logging (100+ lines)

That's a week of work. In n8n: an afternoon. And non-technical team members can edit it without fear.

Layer Usage
LayerPick
L2 HostingSelf-hosted n8n on a $10/month VPS, OR n8n Cloud
L3 ModelGPT-4o-mini (cheap, good enough for outreach emails)
L4 InferenceOpenAI (via n8n node)
L5 DataSalesforce is the source of truth; n8n holds no state
L7 Orchestrationn8n — the whole story lives here
L9 ApplicationSalesforce + Slack + Gmail (existing tools)
L10 GovernanceHuman-in-loop approval is the guardrail
When This Pattern Becomes An Agent Instead

If the logic gets more adaptive — "if lead responded to a previous email, personalise based on that thread" or "if the company website uses React, mention React-specific case studies" — you're past n8n's sweet spot. Rebuild in LangGraph or custom code.

The line: n8n is a decision tree. An agent loops and decides. When you start building decision trees inside n8n that are 50 boxes deep, switch tools.

Part 3 · How It Fits Together
Example · Enterprise Analytics Agent
The CFO asks "what drove the Q3 YoY revenue drop in LATAM?" The agent writes SQL, runs it against 20 years of Teradata history via the Dataiku semantic layer, narrates the answer with charts. The "Teradata + Dataiku + Claude" archetype.
Architecture · Live Data Flow
Enterprise Analytics Agent architecture A CFO asks a question in Slack, LangGraph plans queries with Claude or Gemini on GCP Vertex AI, an MCP server proxies the call to Dataiku which executes against Teradata, charts are rendered in matplotlib, and the model narrates the findings back to Slack. YOUR NETWORK · ON-PREM OR YOUR VPC PRIVATE CLOUD · YOUR GCP PROJECT "why LATAM Q3?" plan request → ← query plan tool call SQL via semantic layer SQL execute rows back charts + narration CFO · Slack asks in business English LangGraph Planner multi-step reasoning loop plan → query → check → narrate in your VPC · open source the orchestrator MCP Server (Dataiku tool) clean tool API · reusable Dataiku · semantic layer column names, metrics, RBAC inherited Teradata 20yr finance history · system of record GCP Vertex AI Gemini 2.5 Pro · Claude single-tenant · region-pinned in your GCP project plans + narrates stays in your cloud matplotlib charts rendered locally 1 2 3 4 5 6 7
Notice the Teradata stack on the left runs top to bottom: MCP wraps Dataiku, Dataiku resolves metric names against Teradata, Teradata returns rows. The agent never sees raw SQL or table names — just clean tool calls. The only path that leaves your VPC is to Vertex AI in your own GCP project — single-tenant, region-pinned (e.g. me-central2 for KSA), and never reaches Google's multi-tenant model pool.
1CFO asks in Slack 2LangGraph asks Claude/Gemini (Vertex AI) to plan queries 3tool call to MCP server 4Dataiku resolves the semantic query 5Teradata returns the rows 6matplotlib renders the charts; Claude narrates 7final reply with charts back to Slack
Why This Shape — Design Rationale

Teradata holds the history

20 years of finance data. ~$50M/year licence. You do not migrate this. You talk to it.

Dataiku already has a semantic layer

Column names, metric definitions, hierarchies, RBAC. If the agent queried raw Teradata it would hallucinate column names and bypass all the governance the enterprise spent millions building. Querying via Dataiku inherits all of it.

The agent queries Dataiku via MCP

An MCP server exposes Dataiku datasets as agent tools. The agent doesn't know or care that the underlying store is Teradata — just "query revenue_by_region_quarter for LATAM 2020-2025."

Claude Opus for the reasoning step

Planning which queries to run, narrating findings in business English, handling "drill deeper" follow-ups — this is where you want the frontier model. Cheaper models would produce shallower analysis.

Layer Usage
LayerPick
L2 HostingClient's existing infrastructure (on-prem or your GCP project)
L3 ModelClaude Opus or Gemini 2.5 Pro (reasoning) + Gemini Flash (chart captions)
L4 InferenceGCP Vertex AI (Claude or Gemini, single-tenant, region-pinned for compliance)
L5 DataTeradata (history) via Dataiku
L6 DS PlatformDataiku — semantic layer + RBAC
L7 OrchestrationLangGraph (structured multi-step reasoning)
L8 ProtocolMCP server for Dataiku — the clean integration point
L9 ApplicationSlack bot (CFO already lives there)
L10 GovernanceLangfuse (trace) · Presidio (PII in logs) · Dataiku's own RBAC
What A Junior Architect Gets Wrong

Mistake 1: "Let's migrate Teradata to Snowflake for the AI project." No. Never propose a multi-million-dollar data migration as part of an AI project. Build on top.

Mistake 2: "The agent will write raw SQL against Teradata." It will hallucinate table names, miss business definitions, and bypass RBAC. Always go through the semantic layer.

Mistake 3: "We'll use a smaller model to save cost." Financial analysis needs careful reasoning. Going cheap here destroys the client's trust on the first wrong answer. Use Opus/GPT-4.

Mistake 4: "Skip MCP, call Dataiku directly." Then every future AI product the client adopts will have to re-integrate. MCP server once, reused forever.

This is the NMO sweet-spot pitchClients with existing Teradata + Dataiku investments are ripe for this exact pattern. Their pain: DS teams produce insights slowly; business users can't self-serve. The agent makes the existing investment conversational — without replacing any of it. You charge for the MCP integration, the agent build, and the ongoing tuning. The client keeps every dollar they've spent on their data estate.
Part 3 · How It Fits Together
Example · Real-Time Voice Agent
Users call a phone number. The agent answers in under 300ms, holds a natural conversation, books appointments. The "where Groq actually matters" archetype.
Architecture · The Voice Pipeline (latency budget)
Real-Time Voice Agent pipeline A horizontal pipeline showing audio entering on the phone, transported by LiveKit, transcribed by Deepgram, processed by Llama 3.3 70B on Groq, voiced by Cartesia TTS, and returned to the caller. Each step shows its latency budget summing under 500 milliseconds. REAL-TIME PIPELINE · EVERY MILLISECOND COUNTS · ALL FAST SaaS ~40ms audio ~100ms ~100ms (Groq) ~150ms 📞 Caller PSTN phone audio LiveKit transport SIP/WebRTC bridge Deepgram streaming STT partial transcripts Groq · Llama 3.3 70B LPU inference first-token ~100ms vs ~500ms on a regular GPU the entire product difference Cartesia streaming TTS audio reply 📞 caller hears Total round-trip ≈ 430 ms · feels natural Without Groq the LLM step alone is ~500 ms → total ~870 ms → awkward pauses 1 2 3 4 5
A pure left-to-right pipeline. Every step's latency is annotated above its arrow. The Groq box is drawn larger because that's where the entire product wins or loses — Llama 3.3 70B on a regular GPU is too slow for natural voice; on Groq it's fast enough.
1caller dials in (PSTN) 2LiveKit bridges to streaming audio 3Deepgram streams partial transcripts 4Groq + Llama 3.3 70B replies in tokens 5Cartesia streams the audio back
Why Groq Here — And Only Here

Voice feels natural under 300ms round-trip, stilted at 500ms, broken above 800ms. Here's the latency budget:

StepTypical latencyNotes
Network (caller → server)~40 msGeography-dependent
STT (streaming)~100 msDeepgram partial transcripts
LLM first-token (Llama on GPU)400–800 msThe bottleneck
LLM first-token (Llama on Groq)~100 msThe fix
TTS first-audio~150 msCartesia is the fastest
Network (server → caller)~40 ms

On a regular GPU, total = ~870ms. Caller experience: awkward pauses. On Groq: ~430ms. Caller experience: natural conversation. That's the entire product difference.

Why Not Claude or GPT-4o

Two reasons:

  • Latency: closed frontier models have first-token latencies of 400–1000ms. Fine for chat, too slow for voice.
  • Groq doesn't host them: Claude runs only on Anthropic/Bedrock/Vertex, GPT only on OpenAI/Azure. You can't put them on an LPU.

OpenAI's Realtime API (GPT-4o) is a credible alternative — it's designed for voice specifically. But you're locked into OpenAI and the pricing gets expensive fast. Groq + Llama is the open-weight path.

Layer Usage
LayerPick
L1 SiliconGroq LPU — the reason this works
L3 ModelLlama 3.3 70B (LLM) · Whisper or Deepgram Nova-2 (STT) · Cartesia (TTS)
L4 InferenceGroq · Deepgram · Cartesia
L7 OrchestrationLiveKit Agents framework (voice-native) or Vapi (managed)
L9 ApplicationPhone via Twilio + dashboard for reviewing calls
L10 GovernanceCall recording, transcription archive, Langfuse
Voice is its own worldVoice agents have different dominant concerns than text agents — interruption handling, barge-in, endpointing, silence detection. If you scope a voice project, allocate specifically for voice expertise. It is not "chatbot with a speaker."
Part 3 · How It Fits Together
Example · Multi-Agent Product Team (like Apex)
A team of specialised agents collaborating to ship software. Shows what a fully-occupied stack looks like end-to-end — and what you can responsibly leave out.
The Architecture

Nine agents with distinct roles: Project Manager, Productizer, VPS Admin, Backend Dev, Frontend Dev, Data Scientist, Security, HR, Marketing. Each has its own system prompt, toolset, and responsibilities. An orchestrator queues tasks, spawns short-lived worker containers, collects results, respects a capacity cap, escalates decisions to a human via Telegram.

Architecture · Hub & Spoke (9 agents around an orchestrator)
Multi-Agent Product Team architecture (Apex pattern) Nine specialised agents arranged around a central orchestrator and task queue. The orchestrator spawns short-lived worker containers per task, calls Claude via the Anthropic API, and reports through a Next.js dashboard, Telegram bot, and daily email. YOUR $20/MONTH VPS · ONE OPERATOR PUBLIC API approval ask tasks · audit prompts live updates Mona Project Manager Yasmine Productizer Omar VPS Admin Karim Backend Dev Lama Frontend Dev Daoud Data Scientist Samer Security Reem HR Jihad Marketing Orchestrator + Task Queue FastAPI · Claude Agent SDK spawns short-lived worker containers 3-concurrent cap · 60% governor routes tasks · collects results the hub · everything routes through here Claude Code CLI for dev agents PostgreSQL + pgvector + Redis tasks · handoffs · audit_log · agent memory token_usage · spend_ledger · approvals single source of truth Telegram bot · Ahmed approves human-in-the-loop · approvals + alerts Next.js Dashboard + daily email desk view · morning digest Anthropic API Claude Opus + Sonnet Claude Max subscription prompts leave network
Hub-and-spoke. Nine specialised agents on the perimeter, all funnelling through the central orchestrator. The orchestrator is the only piece that talks to Claude (red, public) and the only piece that writes to the data store (Postgres). Ahmed sits above as the human approver via Telegram; the dashboard below is read-only for monitoring.
·9 agents queue tasks at the orchestrator ·orchestrator spawns short-lived worker containers ·workers call Claude via Anthropic API ·results write to Postgres + pgvector ·approvals → Telegram for Ahmed · live state → Next.js dashboard
Layer-by-Layer
LayerPickRationale
L1 SiliconInvisibleNot self-hosting inference.
L2 HostingSingle $20/month VPSSingle operator. Small scale. Cheap.
L3 ModelClaude Opus + SonnetBest at coding + long-context reasoning. Consistent.
L4 InferenceAnthropic (via Claude Max subscription)Session-based billing dodges per-token surprise.
L5 DataPostgreSQL + pgvector + RedisSingle-user scale — pgvector is enough. No second DB.
L6 DS PlatformNoneThis is a software-shipping team, not a data-science team.
L7 OrchestrationFastAPI + Claude Agent SDK + Claude Code CLILangChain/CrewAI would add mass without buying much.
L8 ProtocolDirect API calls now; MCP laterMCP is industry-wide direction for tool integration.
L9 ApplicationNext.js dashboard + Telegram bot + daily emailThree channels matching three contexts (desk, mobile, morning).
L10 GovernanceBuilt-in audit log, capacity governor, weekly team-health reviewDesigned in from day one, not bolted on.
What This Stack Deliberately Omits
No Qdrant / Pinecone
At this scale, pgvector is sufficient. Adding a second stateful service would be cost without benefit. Clear migration path if it ever does scale.
No Dataiku / Databricks
Not a data platform. Ships software, not analytics.
No Groq / fast-inference specialist
Claude is fast enough for planning + coding. No voice use case.
No LangChain / CrewAI / agent framework
Claude Agent SDK is closer to the metal. Easier to debug.
No n8n
This is the orchestrator. Two would be two sources of truth.
No Langfuse / external observability
In-house audit_log + token_usage + security_log are purpose-built. Could add Langfuse later for richer analytics.
The lessonA well-scoped AI product occupies the layers it needs and consciously leaves others empty. The discipline isn't "use everything." It's "pick the minimum that solves your problem, and know what you'll add when scale demands."
Part 3 · How It Fits Together
Example · Customer Care AI
Inbound chat / WhatsApp / email. The agent reads the customer's history, answers from a curated knowledge base, takes actions (refund, reschedule, escalate), and hands off cleanly to a human when it's out of depth. Built on the NMO stack: Dataiku · Teradata · LangChain · Groq.
Why Groq is in this stackCustomer-care replies need to feel instant. Groq runs Llama 3.3 70B at hundreds of tokens / second — 3–5× faster than the same model anywhere else — so the first token of the reply paints in under 200 ms instead of 600 ms. The trade: the prompt leaves your network, so Presidio strips PII before it ever reaches Groq, and tenants who refuse public APIs get auto-routed to local Llama on Ollama (slower but never leaves). For any non-realtime path (the post-call wrap-up, batch summarisation), use Vertex AI in your GCP project instead — cheaper and in-cloud.
Architecture · Live Data Flow
Customer Care AI — architecture and live data flow A diagram showing how a customer message flows through PII redaction, the LangChain agent, Dataiku and Teradata for context, pgvector for the knowledge base, Groq cloud for the language model reply, MCP action tools, and back to the customer. Animated dots show the live direction of data. YOUR NETWORK · LOCAL · ON-PREM OR YOUR VPC PUBLIC API raw msg clean msg customer_360 history RAG search redacted prompt → ← model reply act final reply Customer WhatsApp · Web · Email Presidio · PII strip runs on-prem · open source LangChain Agent · LangGraph classify · fetch context · check policy · draft · validate · respond orchestration runs in your VPC · open source · auditable the brain of the operation Dataiku customer_360 · RBAC Teradata 20yr customer history pgvector Knowledge base (RAG) MCP Action Tools refund · reschedule · escalate Groq Cloud LPU inference Llama 3.3 70B ~hundreds tok/s data leaves network Governance · Llama Guard (output filter) · Langfuse self-hosted (traces) · audit_log to Teradata runs across every step · catches policy violations · keeps traces private 1 2 3 4 5 6 7 8 9
Animated dots show the real direction of bytes. Coloured by lane: blue = customer data · green = local response · red = leaves your network · purple = model reply · orange = action.
1customer message arrives 2Presidio strips PII before anything else touches it 3agent fetches customer_360 from Dataiku 4Dataiku reads from Teradata 5agent searches the KB in pgvector 6redacted prompt sent to Groq (crosses zone) 7Llama 3.3 70B reply returns 8agent calls action tool if policy allows 9final answer sent back to customer
Why This Shape

Teradata holds the customer history

Every prior order, ticket, payment, and product registration. Twenty years for some customers. You do not want the agent guessing — it answers from the truth.

Dataiku exposes it as a clean semantic layer

"customer_360" view: name, tier, lifetime value, open tickets, last interaction, churn risk. The agent gets one tidy object via an MCP tool — never sees raw Teradata tables, never hallucinates column names, inherits the existing RBAC.

LangChain orchestrates the loop

Classify intent → fetch context → check policy → draft → validate → respond → log. LangGraph models the state machine explicitly so you can replay any conversation deterministically. Open source, runs in your VPC.

Groq runs the language model — fast and cheap

Llama 3.3 70B at hundreds of tokens per second on Groq's LPU. Customer-care replies need to feel instant; Groq is 3–5× faster than the same model anywhere else. Trade-off: prompts go to Groq Cloud, so PII is redacted before they leave (Hotspot 1 from the Live Ecosystem Map).

A confidence gate before any action

Refunds, account changes, and escalations only execute if (a) the agent's self-rated confidence is ≥ 0.85, (b) the action is on the per-tier allow-list, and (c) it's within the per-customer rate limit. Else: hand to human with the draft attached.

Layer Usage
LayerPickZone
L2 HostingYour GCP project (GKE / Cloud Run) or on-prem K8sLocal
L3 ModelLlama 3.3 70B (replies) · Llama 3 8B (intent classify)OS
L4 InferenceGroq Cloud for speed · Ollama on-prem for fallback / regulated tenantsPublic / Local
L5 DataTeradata (history) · Postgres+pgvector (KB)Local
L6 DS PlatformDataiku — customer_360 semantic + churn-risk featureLocal
L7 OrchestrationLangChain + LangGraph (state machine + tool calling)Local
L8 ProtocolMCP servers: customer_history, order_actions, kb_search, escalateLocal
L9 ApplicationWebchat widget · WhatsApp Business API · email gateway · Zendesk for human handoffMixed
L10 GovernancePresidio (PII redaction) · Langfuse (traces) · Llama Guard (output filter) · in-house audit_logLocal
Where Open Source Earns Its Keep
  • Llama 3.3 70B (Meta, OS). You can run it on Groq Cloud today and switch to your own GPU box tomorrow without rewriting a line. No model lock-in.
  • LangChain / LangGraph (OS). The orchestration is your code. You can audit it, fork it, or swap it if a vendor framework would be faster. Vendor lock-in for orchestration is the most expensive lock-in to escape later.
  • Presidio (Microsoft, OS). PII detection runs on-prem before any prompt leaves. If you used a SaaS redactor, you'd be sending raw PII to that SaaS first — defeats the point.
  • pgvector (Postgres extension, OS). Your KB embeddings live in your existing Postgres. No third vector DB to license / monitor / back up.
  • Llama Guard (Meta, OS). Runs locally. Catches policy violations (hate speech, PII leak in output) without sending the response to a moderation API.
The Risk Map for This Use Case
Customer history
Safe
Teradata · in your data centre
Never leaves. Agent reads via Dataiku semantic layer, gets a structured object, never raw rows.
Knowledge base
Safe
pgvector · Postgres on your VPC
Articles, FAQs, policy docs. Embedded once (on-prem embedding model), retrieved every turn.
Conversation transcript
Watch
Logged to Teradata · trace to Langfuse (self-hosted)
Full transcripts are sensitive. Self-host Langfuse — never send traces to a SaaS observability tool unless contractually allowed.
Mitigation: retention 90 days · access role-restricted · scrubbed of names/IDs in trace UI.
The prompt to Groq
Risk
Public API · crosses the internet
Whatever the agent puts in the prompt is read by Groq's infrastructure. Groq's contract says no training on inputs and short retention — but you must still not put raw PII in.
Mitigation: Presidio strips name / phone / email / national ID before send · customer_360 object passes only tier & tags, never the customer's identity.
WhatsApp Business API
Risk
Meta-hosted · subject to Meta's terms
Customers expect WhatsApp; Meta sees every message. There is no on-prem equivalent.
Mitigation: sensitive transactions (refunds, account changes) move to a magic link in your own portal — not completed in WhatsApp.
When to use this patternVolume support (chat / WhatsApp / email), policy-bounded actions (refund up to $X, reschedule, account info), and a healthy KB. Replaces the bottom 60–70% of L1 / L2 ticket volume; everything else routes to a human with the draft + history attached. Typical ROI: 40–60% reduction in handling time for resolved-by-AI tickets.
Part 3 · How It Fits Together
Example · Call Center · High-Priority Clients
A live voice line for VIP / private-banking / enterprise tier. The AI never speaks to the customer — it sits behind the human agent, listening, retrieving, and surfacing the next best action in real time. Same NMO stack with a different shape.
Design principle for VIPFor high-priority clients, the AI is a co-pilot, not a pilot. The human is on the line and stays on the line. The AI's job is to make that human dramatically smarter and faster — not to replace them. Replacement is for L1 volume; co-pilot is for VIP and complex.
Why Groq is in this stackThe live loop fires every ~4 seconds while the call is happening; the suggestion has to land on the agent's screen before the customer's next sentence. Groq + Llama 3 8B hits sub-700ms round-trip on a 4-second sliding transcript window — anything slower than that and the human has already moved on. Vertex AI / Anthropic API can't replace Groq here because they target 400–1000ms first-token latency (fine for chat, too slow for live). The post-call wrap-up does use Vertex AI (no latency pressure → stays in your GCP project) — see Diagram 2.
Architecture · Diagram 1 · The Live Loop (during the call)
Call Center VIP — live loop diagram (during the call) Audio comes in on the phone, Whisper transcribes it locally, a rolling 4-second window plus customer context from Dataiku and Teradata is sent to Groq Cloud running Llama 3 8B, the suggestion comes back, and is displayed on the human agent's screen panel. YOUR NETWORK · LOCAL · ON-PREM PUBLIC API audio transcript every 4s customer_360 redacted snippet → ← suggestion show on screen 📞 VIP call in on-prem PBX Whisper STT local GPU · audio never leaves Rolling 4s window regex masks acct # / IDs Dataiku → Teradata customer_360 · churn_risk · LTV RBAC inherited from Dataiku LangGraph Live Loop every 4s: take window + customer context → ask Groq for next-best-action → push to UI round-trip target < 700 ms co-pilot · never speaks to the customer Agent Screen · suggestion panel human accepts / edits / ignores customer hears only the human Groq Cloud LPU inference Llama 3 8B first-token ~100 ms data leaves network 1 2 3 4 5 6 7
The live loop runs continuously while the call is happening. Audio enters on the phone (1), Whisper transcribes locally (2), a 4-second rolling window is taken every loop tick (3), customer context is loaded from Dataiku/Teradata (4), the redacted snippet plus context goes to Groq (5 — the only path that leaves your network), the suggestion comes back (6), and the human agent sees it on their side panel (7).
1VIP call hits the on-prem PBX 2Whisper transcribes locally 3rolling 4s window 4customer_360 context loaded 5redacted snippet → Groq (crosses zone) 6next-best-action returns 7shown on the agent's screen
Architecture · Diagram 2 · Post-Call Wrap-Up (after hangup)
Call Center VIP — post-call wrap-up diagram After the call ends, the full redacted transcript is sent to Claude Opus or Gemini 2.5 Pro running on GCP Vertex AI inside your own GCP project, which writes a polished wrap-up summary that lands back in Teradata as part of the customer's relationship history. YOUR NETWORK · LOCAL PRIVATE CLOUD · YOUR GCP PROJECT full redacted transcript → ← polished wrap-up summary Full transcript redacted & archived locally written after hangup Teradata relationship history grows summary lands here GCP Vertex AI Claude Opus or Gemini 2.5 Pro single-tenant · region-pinned writes the wrap-up stays in your GCP project 1 2
This fires once, after the call hangs up. The full transcript (already PII-redacted) goes to Vertex AI in your own GCP project (1) — Claude Opus or Gemini 2.5 Pro writes a polished wrap-up summary that lands back in Teradata (2). Region-pinned (use me-central2 for KSA / Dammam) so the data never leaves your jurisdiction.
1full transcript sent to Vertex AI in your GCP project 2polished summary saved into Teradata
Why The Shape Differs From Customer Care

Latency budget is brutal

A suggestion that lands 6 seconds after the customer's question is useless — the human already moved on. Groq's LPU + Llama 3.3 8B for the suggestion loop hits sub-700ms on a 4-second sliding transcript window. This is exactly why Groq is in the stack.

Speech recognition stays on-prem

VIP calls contain account numbers, PINs, deal terms. Use Whisper-large-v3 on a local GPU; never stream audio to a cloud STT. The transcript that goes to the LLM is already partially redacted by a regex filter (account numbers masked).

Dataiku's churn-risk & tier scores drive escalation

If the live sentiment turns negative and the customer is in the top decile of LTV, the AI silently pages the team lead — no waiting for the customer to ask.

Two model tiers

Llama 3 8B on Groq for the every-4-second loop (cheap, fast). Claude Opus or Gemini 2.5 Pro on Vertex AI in your own GCP project for the post-call wrap-up — better at writing summaries the relationship manager actually trusts. The mix matters.

Layer Usage
LayerPickZone
L2 HostingOn-prem GPU box for Whisper · your GCP project (GKE) for the agent loop · Vertex AI region-pinnedLocal / Priv
L3 ModelWhisper-large-v3 (ASR) · Llama 3 8B (live loop) · Claude Opus or Gemini 2.5 Pro (wrap-up)OS + Frontier
L4 InferenceLocal GPU (Whisper) · Groq (live Llama loop) · GCP Vertex AI (Claude / Gemini wrap-up)Local + Public + Priv
L5 DataTeradata (relationship history) · Postgres (live call state)Local
L6 DS PlatformDataiku — customer_360, churn_risk, lifetime_value featuresLocal
L7 OrchestrationLangGraph (4-second loop) + custom websocket runner for the agent screenLocal
L8 ProtocolMCP servers: customer_360, open_tickets, page_supervisor, book_followupLocal
L9 ApplicationGenesys / your existing telephony · agent-desktop side-panel (Vue) · supervisor pager (Telegram or Slack)Mixed
L10 GovernanceHard rule: AI never speaks · transcripts retained per regulator · Langfuse self-hosted · per-rep accuracy dashboardLocal
The Risk Map for This Use Case
Live audio
Safe
Whisper on-prem GPU
Audio bytes never leave the building. Streaming STT is the seductive shortcut — don't take it.
Customer record
Safe
Teradata via Dataiku
Same as Customer Care AI — single semantic layer, same RBAC, never raw rows.
Redacted transcript chunks
Watch
Sent to Groq for the live loop
Every 4 seconds the rolling transcript window is sent to Groq Cloud for intent + next-best-action. Account numbers, IDs, and amounts are masked locally first.
Mitigation: regex + Presidio in-line · contract: zero retention · for top-tier accounts, swap to local Llama 3 8B at higher latency cost.
Wrap-up summary
Watch
Sent to Vertex AI in your GCP project
After the call ends, the redacted full transcript goes to Claude or Gemini on Vertex AI for a polished summary written into Teradata.
Mitigation: stays in your GCP project · region-pinned (me-central2 for KSA / Dammam) · no PII in summary by template.
Telephony provider
Risk
Genesys / Twilio / on-prem PBX
Whoever runs the phone line sees the call. For very-high-trust clients, on-prem PBX is the only acceptable answer.
Mitigation: contractual · or fully on-prem PBX (FreePBX / Asterisk) for KSA private-banking tier.
Why this earns the consulting feeReplacing the human is the wrong story for VIP. The right story is "your best relationship manager, every call, with twenty years of context surfaced in 600 milliseconds." That's the pitch — and it's true with this stack.
Part 4 · Practical
Anatomy of an AI Agent
Every agent, however sophisticated, has the same handful of parts. Here's the checklist. If a design is missing a part, it's probably a chatbot with ambition.
The Seven Parts at a Glance
How the seven parts fit together
Agent anatomy: inputs (system prompt, context window, tools) feed the loop, which reads and writes memory, while evaluators and observability watch every call Inputs · what the agent has 1 · System Prompt role · scope · rules 2 · Context Window history + RAG + schema 3 · Tools search · ticket · email… 5 · LOOP Think → Act → Observe → repeat until done · bounded by max-steps, timeout, budget 4 · Memory vectors · session state · feeds back into context Cross-cutting 6 · Evaluators measure quality — gold answers · LLM judge human · business KPIs 7 · Observability trace every call — Langfuse · LangSmith replay any run
Inputs (1–3) feed the Loop (5). The Loop reads/writes Memory (4). Evaluators (6) and Observability (7) watch every call.
The Seven Parts
1 · System prompt — the agent's identity

A long-form instruction defining the agent's role, scope, tone, constraints, escalation triggers, and output format. Usually 500–5,000 words. Rewritten many times during development.

"You are a senior customer support agent. You answer only from the provided context. If unsure, escalate to a human. Never promise refunds — offer to file a request."

2 · Context window — what the agent "sees"

Composed of: system prompt + conversation history + relevant retrieved documents (RAG) + tool descriptions + user's current message. Fits inside the model's context window budget.

3 · Tools — what the agent can do

Functions the agent can call. Examples: search_database(query), create_ticket(data), send_email(to, subject, body). Each tool has a name, a description, a parameter schema, and a handler that executes it.

Key insight: the agent's capability is defined by its tools. A smart LLM with no tools is just a chatbot. A modest LLM with the right tools can run a business.

4 · Memory — what the agent remembers

Two scopes:

  • Short-term: the current conversation. Lives in the context window.
  • Long-term: persistent across sessions. Usually vectors in Qdrant/pgvector, retrieved by relevance on each new conversation.
5 · Loop — the agent's autonomy

Agents don't just answer once. They loop: pick tool → call it → observe result → decide next action → repeat until done. The loop is where agent logic gets complex.

Guardrails: max iterations, timeout, budget cap. Without these, a buggy agent loops forever burning tokens.

6 · Evaluators — how you measure quality

Tests that run against the agent. Can be: exact-match against gold answers, LLM-as-judge scoring (another LLM rates the output), human review, business metrics (tickets resolved per hour). In production, usually all of these.

7 · Observability — how you see what happened

Every inference call logged, every tool call recorded, every decision traced. Without this, you cannot debug. With it (Langfuse, LangSmith, Helicone), you can replay any conversation and see exactly what the model saw and why it chose what it chose.

Common Agent Designs
PatternShapeBest for
ReAct loopThink → Act → Observe → ThinkGeneral tool-using agents.
Plan-then-ExecutePlanner writes full plan; executor runs stepsComplex multi-step tasks.
Reflexion / self-critiqueAgent reviews its own output before final answerQuality-sensitive generation.
Multi-agent crewSpecialised agents collaborateComplex domains with clear sub-roles (like Apex).
Graph / state machineExplicit nodes and edgesFlows where you need deterministic control.
Part 4 · Practical
Decision Cheat Sheet
When a client asks, when a teammate asks, when you're half-awake — here's the quick answer.
Model Selection
SituationFirst-instinct pick
Best-in-class codingClaude Sonnet / Opus · Qwen 2.5 Coder (open)
Cheapest frontier-quality reasoningDeepSeek V3/R1 · Gemini 2.5 Flash
1M+ tokens of contextGemini 2.5 Pro
Run on your own hardwareLlama 3.3 70B · Qwen 2.5 (run via Ollama, vLLM, or llama.cpp)
Voice agent speedLlama 3.3 70B on Groq
Arabic / multilingualQwen 2.5 · Gemini · Claude
Strong visionClaude Sonnet · GPT-4o · Gemini 2.5
Cheap high-volume classificationClaude Haiku · Gemini Flash · GPT-4o-mini
Infrastructure Selection
SituationFirst-instinct pick
Self-host an LLM on a small VPSOllama + Llama 3.x 8B or Qwen 2.5 Coder 7B
Real-time voice (under 300ms first-token)Groq + Llama 3.3 70B · OpenAI Realtime as paid alt
Live agent assist / call-centre co-pilotGroq + Llama 3 8B (sub-700ms round-trip)
High-throughput batch on open weightsGroq for speed · Together / Vertex AI for cost
Frontier reasoning (Claude / Gemini quality)GCP Vertex AInot Groq, Groq doesn't host them
Sensitive prompts you can't redactVertex AI in your GCP project or local Llama — not Groq (it's a public API)
Unified gateway across providersLiteLLM (open) · Portkey (managed)
Rent an H100 for a dayRunPod · Lambda Labs · Modal
Enterprise with GCP commitment (your default)GCP Vertex AI · Claude or Gemini
Need 1M+ token contextGemini 2.5 Pro on Vertex AI
KSA data residency requiredVertex AI me-central2 (Dammam) OR on-prem Llama
Data Layer Selection
SituationFirst-instinct pick
RAG on ≤10M vectorspgvector (reuse existing Postgres)
RAG on 10M–1B vectorsQdrant (self-host) · Pinecone (managed)
Modern analytical warehouseSnowflake · BigQuery (GCP) · Databricks
Existing Teradata estateWork with it, don't migrate
Event / product analyticsClickHouse
Graph relationships matterNeo4j
Hybrid (keyword + vector) searchElasticsearch + pgvector/Qdrant + Cohere Rerank
Framework / Orchestration Selection
SituationFirst-instinct pick
Visual automation, LLM is one stepn8n (self-host) · Make · Zapier
Custom agent, Claude-centricClaude Agent SDK
Complex multi-step reasoningLangGraph · Pydantic AI
Multi-agent collaborationCrewAI · AutoGen · custom
RAG-heavy agentLlamaIndex
Durable agent workflows (hours/days)Temporal
Multi-step coding tasksClaude Code · Cursor · Windsurf
Governance / Safety Selection
SituationFirst-instinct pick
Production observability, open-sourceLangfuse
Production observability, managedHelicone · LangSmith
Prompt evaluation in CIPromptfoo
Block prompt injectionLlama Guard 4 (self-host) · Lakera Guard (SaaS)
Redact PII before LLMPresidio (open) · Private AI (managed)
RAG-specific evaluationRagas
Part 4 · Practical
Common Pitfalls
Mistakes that will cost you a client or a month. Pattern-match these early.
Architectural Pitfalls
Using one model for everything
Haiku for classification + Opus for reasoning is a 10× cost cut. Teams that standardise on "the smart model for all calls" are burning money.
Skipping the vector DB and stuffing everything into context
Works with 50 docs, breaks at 5,000. RAG from day one even if it feels like overkill.
No observability
First production bug, you have no idea what the model saw. Langfuse on day one, not day 100.
Adopting every new framework
You'll spend all your time migrating and none shipping. Pick one, stick with it for 6 months, re-evaluate.
Agent when automation would do
n8n solves 70% of "AI workflow" needs. Building a LangGraph agent for predictable pipelines is over-engineering.
Automation when agent would do
If the task requires judgement, n8n's fixed decision tree becomes 200 boxes and unmaintainable.
Client & Stakeholder Pitfalls
Promising to replace their data warehouse
Never. You build on top of Teradata/Snowflake/whatever they have. Migrations are not AI projects.
Underestimating data prep
70% of an AI project is data cleanup + chunking + indexing. Budget accordingly. Tell the client this upfront.
Demo-grade vs production-grade
Demo takes a week. Production takes months. Clients see a demo and expect production next Tuesday. Set expectations explicitly.
No human-in-the-loop for sensitive actions
Sending emails, making payments, deleting records — always approval-gated. First unapproved action that goes wrong is the end of the client relationship.
PDPL / data residency blind spot
KSA client + OpenAI API + personal data = compliance violation. Ask about this on day one. Vertex AI me-central2 (Dammam, inside KSA) or on-prem Llama are the compliant paths.
Technical Pitfalls
Hardcoding prompts in application code
Every prompt tweak requires a deploy. Use a prompt management system (Langfuse prompts, Braintrust, or a simple JSON file with versioning).
No eval suite
You changed the prompt. Did it improve things or regress them? Without a Promptfoo suite, you're guessing.
Letting agents loop forever
Max iterations, timeout, and budget cap per agent run. Always.
Trusting the model's output
LLMs hallucinate. Every structured output (SQL, JSON, code) needs validation. Every factual claim needs a citation back to the source.
Secrets in prompts
API keys, customer data, connection strings — never in a prompt. Prompts get logged, shared, debugged. Secrets leak.
The meta-pitfallOver-estimating what agents can do today and under-estimating what they'll do in 12 months. Build for today's reliability (narrow scope, good evals, human oversight). Architect for tomorrow's capabilities (clean tool boundaries, swappable models, MCP everywhere).
Part 4 · Practical
Where Your Data Lives — The Trust Map
A drill-down on every place customer data can leave your network in an AI architecture, and what to put in front of it. The map you draw on a whiteboard at the start of every client engagement.
The Three Questions, Per Component

For every box you draw in the architecture, answer these three questions out loud:

  1. Where does the data physically sit when this component is using it? (your DC · your VPC · vendor's cloud)
  2. Who can read it there? (your team · your cloud provider · the vendor's employees · the public internet)
  3. What contract or law constrains them? (DPA · SLA · GDPR/PDPL/HIPAA · "trust me bro")

If the answers feel hand-wavy, you have a risk. If you can't answer at all, you have a problem.

The Component-by-Component Map
Teradata / on-prem warehouse
Safe
Your data centre
The system of record. By design, never leaves. The agent must talk to it through a semantic layer — never with raw SQL the model wrote.
Dataiku on-prem
Safe
Your data centre
Semantic layer + RBAC + feature pipelines. If hosted on-prem, all data stays. If hosted on Dataiku Cloud, it's single-tenant SaaS — read the contract.
LangChain / LangGraph code
Safe
In your application container
It's an OS library running in your process. No network calls of its own — every external call is something you wrote.
pgvector / Postgres
Safe
Your VPC / on-prem
RAG embeddings + app state. Embed locally with a sentence-transformers model on your GPU.
Ollama / Llama on your GPU
Safe
Your GPU box
Local inference. Slower and dumber than frontier models, but the prompts never leave. The fallback for any tenant who can't tolerate cloud LLMs.
Gemini / Claude via GCP Vertex AI
Watch
Your GCP project · region-pinned
Your single-tenant inference home. Anthropic / Google don't see the prompts; Vertex AI runs them on your behalf inside your GCP project. Gemini 2.5 Pro's 2M-token context handles whole codebases / long contracts; Claude 4 is available too via the Vertex Model Garden.
Mitigation: region-pinned (me-central2 Dammam for KSA · me-central1 Doha for Qatar · europe-west1 Belgium for EU) · DPA in place · Cloud Audit Logs on · IAM scoped per service-account · model versions pinned via Model Garden · VPC Service Controls to forbid data egress.
Self-hosted Langfuse / observability
Watch
Your VPC
Full traces of every LLM call — prompts, responses, tools used. Goldmine if breached.
Mitigation: never use the SaaS version for sensitive workloads · scrub PII before tracing · short retention · access role-restricted.
Anthropic API direct
Risk
Anthropic's cloud · US
Fastest path to Claude. Prompts cross the public internet to Anthropic. They have strong contracts (no training on API inputs, short retention) but they do see your prompts.
Mitigation: use only for non-sensitive prompts · move sensitive workloads to Vertex AI for the same Claude model behind your VPC · always Presidio-redact in front.
Groq Cloud
Risk
Groq's cloud · US
Fastest open-model inference on earth (LPU). Same trade as Anthropic API — prompts leave your network. Worth it for latency-bound use cases.
Mitigation: Presidio in front · contract review (zero retention claim) · fall back to local Llama for top-tier-tenants.
OpenAI API direct
Risk
OpenAI's cloud · US (mostly)
Same shape as Anthropic. Region options thinner. Avoid for regulated workloads — go via Azure OpenAI inside your tenant instead.
Mitigation: use Azure OpenAI for anything regulated · strict prompt redaction · never send raw customer rows.
SaaS connectors (Salesforce, Slack, Gmail, WhatsApp)
Risk
Each vendor's cloud
When the agent calls "lookup customer" or "post to Slack," that data moves to the SaaS vendor. They already had it (you signed up), but the agent is now writing more of it more often.
Mitigation: route through MCP servers that log + rate-limit · per-tenant allow-lists · never call from the LLM's prompt directly — always through tool definitions.
SaaS observability (Datadog · Honeycomb · cloud Langfuse)
Risk
Vendor cloud
Convenient. Also: every LLM trace they ingest contains prompts, which contain everything you redacted-or-didn't.
Mitigation: default to self-hosted for AI traces · if SaaS, scrub aggressively at the SDK boundary, not later.
The pitch we makeMost AI architectures we audit have two more red boxes than the team realised — usually a SaaS observability tool and a third-party SDK that "just helps with prompts." First deliverable on every engagement: a trust map of every component in their current state, marked safe/watch/risk, with a 30/60/90-day plan to move red boxes to amber or green.
Part 4 · Practical
Open Source — When & Why
Open source isn't a religion. It's the right answer when you need to control trust, cost, or lock-in. Here's the decision rule and the layer-by-layer pick list.
The Decision Rule

Use open source when any of the following is true:

  1. The data is regulated or sensitive. If you can't send it to a public API, the model has to run where you can run it — that means open weights.
  2. The component is on the hot path of your business logic. Orchestration, agent loops, RAG retrieval — anything you'll want to fork, debug, and customise. Vendor lock-in here is the most expensive lock-in to undo.
  3. Cost will explode at scale. Per-token pricing makes sense when usage is small. At a million queries a day, an open model on your hardware is 5-20× cheaper than a frontier API.
  4. You need predictable behaviour. Open weights don't change overnight. A vendor "model improvement" can break your evals on a Tuesday.

Use proprietary / SaaS when:

  1. You need the absolute best reasoning available, and the prompts are not sensitive. (Claude Opus, GPT-5-class.)
  2. The component is undifferentiated infrastructure you would never build yourself. (CDN, email delivery, payment processing.)
  3. You're prototyping and time-to-first-demo matters more than long-term cost.
Layer-by-Layer Picks
LayerOpen-source defaultProprietary when
L3 ModelsLlama 3.3 70B · Llama 3 8B · Qwen 2.5 · DeepSeekClaude / GPT for top-tier reasoning when prompts are not sensitive
L4 InferenceOllama (local) · vLLM · TGIGroq / GCP Vertex AI / Anthropic API for speed or scale you can't host
L5 Vector DBpgvector · Qdrant · WeaviatePinecone if you want zero ops and aren't worried about lock-in
L6 ML PlatformMLflow · Metaflow · KubeflowDataiku when you need a visual semantic layer + RBAC for non-coders
L7 OrchestrationLangChain / LangGraph · n8n · CrewAI · TemporalVendor agent platforms only when you accept the lock-in
L8 ProtocolsMCP · OpenAPI(no proprietary alternative — open is the standard)
L10 GovernanceLangfuse self-hosted · Promptfoo · Llama Guard · PresidioSaaS observability only for non-sensitive workloads
The Mix We Default To
Open source

The structural layer

  • Orchestration · LangChain / LangGraph — this is your code, never lock it in
  • Vector store · pgvector — already in your Postgres
  • Local model · Llama 3.3 70B on Ollama — the regulated-data fallback
  • Speech-to-text · Whisper local — never stream audio to a cloud STT
  • PII redaction · Presidio — must run before prompts leave
  • Output safety · Llama Guard — local moderation
  • Tracing · Langfuse self-hosted — full prompt visibility, kept private
  • Eval · Promptfoo — your evals, your test data, in your repo
Proprietary

The intelligence layer (selectively)

  • Frontier reasoning · Claude Opus or GPT-5-class — only for non-sensitive prompts, only where it earns its cost
  • Fast inference · Groq Cloud — for latency-bound loops, with redaction in front
  • Enterprise data · Teradata / Dataiku — already paid for, don't re-platform
  • Single-tenant frontier · GCP Vertex AI — Claude or Gemini inside your own GCP project, region-pinned (your default for any prompt with sensitive content)
  • Long-context reasoning · Gemini 2.5 Pro via Vertex AI — when you need 1M+ tokens (whole repos, long contracts, large case files)
  • Telephony · Genesys / Twilio — unless on-prem PBX is required
  • Channel APIs · WhatsApp Business / SMS gateway — unavoidable for the channel itself
The shape of the mixOS for everything where you'd be unhappy giving up control. Proprietary for the parts where someone is genuinely better at it than you'd ever be. The default tilt is OS — because the regret of vendor lock-in is bigger than the regret of paying for a server.
Part 5 · Build Your Own
Build Your Own AI Use Case
An 11-step template that turns "we want to do AI" into a defensible architecture, scoped, costed, with a trust map. Use this as the structure for every new use case discussion — internal or with a client.
Use this template as a worksheetOpen this page side-by-side with a fresh doc. For each step, write 2–4 lines for your specific use case. By the end you have an architecture diagram, a layer table, a trust map, a build estimate, and a pitch. Total time: ~90 minutes for a senior engineer with the client's pain understood.

Frame the user job — one sentence

If you can't say it in one sentence, you don't understand it yet.
Question

"Who is doing what task, and what would 'much better' look like for them?"

Customer Care AI: "An L1 support agent at our retail client handles 80 tickets a day; a well-scoped AI should resolve 50 of them end-to-end and prepare drafts for the other 30."

Decide: replace, co-pilot, or augment?

This single choice changes the architecture, the risk profile, and the price.
  • Replace — AI handles the whole task, human is exception path. Use for high-volume, low-stakes, well-defined work (L1 ticket triage, simple RAG Q&A).
  • Co-pilot — AI sits next to the human in real time. Use for high-stakes, high-judgement work (VIP call centre, financial analysis, code review).
  • Augment — AI runs offline, prepares work for humans to consume. Use for batch tasks (lead enrichment, document summarisation, daily briefings).

Map the data sources — every box, every owner

List every system the agent will read from or write to. For each: where it physically sits, who owns the schema, what its access pattern is.

Use the Data layer (L5) and DS Platform layer (L6) as your starting checklist. Don't propose any data migration — work with what exists. If Teradata holds the truth, the agent talks to Teradata (via Dataiku).

Pick the model — frontier vs open, big vs small

Pick by task, not by hype. Match the model's cost/latency profile to the request rate. Then pick the inference home (Anthropic API / OpenAI API / GCP Vertex AI / Groq / Ollama) based on the trust zone you need.
  • Reasoning steps (planning, complex Q&A, code) → frontier (Claude Opus / GPT-class / Gemini 2.5 Pro) or top-tier open (Llama 3.3 70B).
  • Long-context tasks (whole codebases, long PDFs, multi-doc analysis) → Gemini 2.5 Pro on Vertex AI for 1M+ tokens.
  • Classification, extraction, simple Q&A → small open model (Llama 3 8B, Qwen 2.5 7B) or Claude Haiku / Gemini Flash. Cheaper and often faster.
  • Latency-bound loops (live voice, autocomplete, live agent assist) → small open model on Groq · sub-300ms first-token. Without Groq, voice and live-assist don't work.
  • Sensitive prompts → must be open weights hosted by you, or Claude/Gemini via Vertex AI in your own GCP project (single-tenant, region-pinned). Not Groq — it's a public API.

Pick the inference home — three trust zones, one rule per zone

For every model call in your design, write its inference home next to it. The home determines the trust zone and the price.
  • Local on-prem (green zone) → Ollama / vLLM on your GPU · use for any prompt you can't let leave the building. Open weights only. Slowest, but most compliant.
  • Your GCP project (amber zone) → GCP Vertex AI · Claude or Gemini, region-pinned (use me-central2 for KSA / Dammam). Use for sensitive prompts that need frontier quality, or any non-time-bound workload. Default for the user's stack.
  • Public API (red zone) → Groq for latency-bound open-weight, Anthropic API for fastest path to Claude when prompt is non-sensitive, OpenAI API for GPT specifically. Always Presidio-redact before any call.

Pick the orchestration shape

There are exactly four shapes. Pick deliberately.
  • Single-shot LLM call — RAG chatbot, summariser. No agent. Don't over-engineer.
  • Workflow (n8n) — fixed steps, mostly SaaS API calls, one or two LLM nodes. Sales-outreach archetype.
  • Agent (LangGraph) — model decides which tool to call next, loops until done. Customer Care, Analytics archetype.
  • Multi-agent — multiple specialised agents handing off via tasks/queue. Apex archetype. Don't reach for it unless the work genuinely splits across roles.

Define the tools (MCP servers, one per integration)

Every external system the agent touches is an MCP tool with an explicit input/output schema and a logged invocation.

Naming convention: {system}_{verb} — e.g. customer_history_get, order_refund, kb_search, ticket_escalate. Each tool has a one-line description that the model sees, an explicit JSON schema, and a unit test that the orchestrator runs at startup.

Draw the trust map · mark every red box

Use the "Where Your Data Lives" trust map as a template. Every component you wrote down in steps 3–6 gets a green / amber / red badge.

If you have a red box, write the mitigation next to it (Presidio in front · contract terms · fallback to local model · etc.). If a red box has no mitigation, the architecture is not done.

Define the guardrails & the escalation gate

No AI ships without an explicit "when does the human take over?"
  • Confidence threshold — below X, hand to human with the draft attached.
  • Action allow-list per tier — refunds up to $X without approval, anything above goes to a human.
  • Output filters — Llama Guard for unsafe content, custom regex for never-say-this-to-a-customer phrases, Presidio for PII in outputs.
  • Per-conversation rate limits — kill switch if the agent loops on the same customer five times in an hour.

Build the eval set before writing the code

Twenty real examples with the ground-truth answer. Promptfoo runs them on every change. No regression-eval, no merge.

Source examples from real tickets / calls / emails (with PII scrubbed). Cover the happy path, the obvious failure modes, the legally-sensitive cases, and the cases where the right answer is "I don't know — escalating."

Estimate the cost & the build size · then commit

Two numbers: per-query cost at scale, and engineer-weeks to build.
  • Per-query cost = (input tokens × input price) + (output tokens × output price), summed across every model call in a typical conversation. Multiply by expected daily volume × 30. If it's offensive, switch a step to a smaller model and recompute.
  • Build size = MCP tools (1 day each) + orchestration loop (1 week) + frontend / channel integration (1–3 weeks) + evals (3 days) + observability (3 days) + hardening / load-test (1 week). Typical first MVP: 4–8 engineer-weeks.
  • Then commit. Write a one-page architecture sketch with the layer table, the trust map, the eval plan, the cost-per-query, and the timeline. That document is the basis of the SOW.
Worked Example — Filling The Template For "Customer Care AI"
1. User job
L1 support agent at retail client; resolve 50/80 tickets end-to-end, draft the rest.
2. Mode
Replace for L1 volume; co-pilot for the residual.
3. Data
Teradata (orders, history) via Dataiku semantic layer; pgvector (KB articles).
4. Model
Llama 3.3 70B for replies (live, latency-bound); Llama 3 8B for intent classify; Claude/Gemini for the post-conversation wrap-up.
5. Inference home
Groq for the live reply (sub-200ms first-token, the customer-experience differentiator) · Vertex AI in your GCP project for the wrap-up · local Llama on Ollama as the fallback for tenants who refuse public APIs.
6. Shape
LangGraph agent with deterministic state machine (classify → fetch → policy → draft → validate → respond).
7. Tools
customer_history_get, kb_search, order_refund, order_reschedule, ticket_escalate.
8. Trust map
Green: Teradata, Dataiku, pgvector, LangChain, Llama Guard. Amber: Vertex AI in your GCP project. Red: Groq prompt (Presidio in front), WhatsApp (sensitive flows move to portal).
9. Guardrails
Confidence ≥ 0.85 to act; refunds ≤ $200; rate limit 3 actions / customer / hour; Llama Guard + custom never-say list.
10. Evals
25 historical tickets with verified resolutions; nightly Promptfoo run; PR-blocking on regression > 5%.
11. Cost & build
~$0.012 / resolved ticket on Groq; ~6 engineer-weeks for MVP.
From this template to a client meetingThe 11 boxes above are exactly the slides you walk through in a discovery meeting. By the time you finish slide 11, the client knows what you're proposing, why each piece is there, where their data goes, and what it costs. No hand-waving. That's the difference between consulting and reselling.
Part 4 · Practical
Glossary
One-line definitions. Send this page to anyone who needs a quick lookup.
A – F
Agent
An LLM that decides its own next action, calls tools, loops until done.
Bedrock (AWS)
Multi-model inference gateway from AWS — Claude, Llama, Titan, Cohere, Mistral all behind one API.
Chunking
Splitting documents into small pieces for embedding. Chunk size massively affects RAG quality.
Context window
Max tokens a model can see in a single call. Bigger = more material; slower and more expensive.
CUDA
NVIDIA's programming toolkit. The reason NVIDIA has a software moat on GPUs.
Dataiku
Visual + code data science platform. "Tableau for ML." Common in enterprises.
Embedding
Numeric representation of meaning. A ~1500-number vector per piece of text.
Fine-tuning
Adjusting a model's weights with training examples. Harder, slower, rarely the right answer.
Function calling
Model returns JSON saying "call this function." Your code executes and returns the result.
G – L
Gemini
Google's frontier model family. Strong at long context and multimodal.
Groq
Inference provider using custom LPU chips. Dramatically faster tokens/sec on open-weight models.
GPU
Graphics Processing Unit. The default silicon for AI. NVIDIA dominates.
Guardrails
Filters on LLM input/output to block unsafe or off-topic behaviour.
Hallucination
Model producing confident-sounding false output. Mitigated via RAG, citations, validation.
Inference
Running a trained model to produce output. What every API call is.
Lakehouse
Hybrid of data lake + warehouse. Databricks' term.
LangChain
Original general-purpose agent framework. Many competitors now.
Langfuse
Open-source observability platform for LLM apps.
LLM
Large Language Model. The neural network at the heart of it all.
LiteLLM
Open-source proxy unifying 100+ LLM providers behind one OpenAI-compatible API.
LPU
Language Processing Unit. Groq's chip.
M – R
MCP
Model Context Protocol. Anthropic's open standard for agent-tool integration. Becoming the default.
Multi-agent
Multiple specialised agents collaborating, each with its own prompt and tools.
n8n
Self-hostable visual workflow automation tool. Great for LLM-in-a-pipeline use cases.
OLAP
Online Analytical Processing. Warehouse queries across lots of history.
OLTP
Online Transaction Processing. Your app's live database.
Ollama
Simplest way to self-host an LLM locally. ollama run llama3.3.
Open-weight
Model whose weights are published. You can download and run yourself.
pgvector
Postgres extension adding vector columns and similarity search. Use instead of a second DB when possible.
Prompt engineering
The craft of writing instructions that get the model to do what you want.
Qdrant
Open-source vector DB. Rust-based, fast. The go-to for non-trivial RAG.
RAG
Retrieval-Augmented Generation. Fetch relevant docs, inject into context, generate answer.
ReAct
Agent pattern: Reason + Act + Observe in a loop.
Reranker
Second-stage model that reorders retrieval results. Boosts RAG quality significantly.
S – Z
Semantic search
Search by meaning via embeddings. "Refund policy" matches "return guidelines."
Snowflake
Cloud-native data warehouse. Modern enterprise default.
System prompt
The long instruction defining an agent's identity and behaviour.
Teradata
40-year-old enterprise data warehouse. Dominant in big banks, telcos, airlines.
Temporal
Durable workflow engine. For agent workflows that survive restarts.
Token
The unit of work. ~3/4 of an English word. Everything is priced per token.
Tool use
Agent calling functions (search, email, create ticket). The capability multiplier.
TPU
Tensor Processing Unit. Google's custom AI chip.
Vector
Another word for embedding. A list of numbers representing meaning.
Vector DB
Database optimised for similarity search over embeddings.
vLLM
Production-grade inference server for open-weight models.
Warehouse
Analytical database for historical data. Snowflake, Teradata, BigQuery.
Whisper
OpenAI's speech-to-text model. Free to self-host, best-in-class quality.
Contribute backFound a term missing? Added a tool to our stack that should be here? Send a PR to the primer repo. Keeping this glossary current is everyone's job — not just whoever originally wrote it.