AI Ecosystem Primer — NMO Partners

Part 1 · Foundations

AI Ecosystem Primer

A 10-layer map of the AI stack — every tool you hear about slots into one of them.

The Stack at a Glance

Every AI product is a choice across 10 layers · hover a row

L10

Governance

Observability, evals, guardrails, PII & safety filters

Langfuse · Promptfoo · Llama Guard · Presidio

L9

Applications

End-user products built on everything below

ChatGPT · Cursor · Perplexity · Copilot · Notion AI

L8

Protocols

How agents, tools and models talk to each other

MCP · Function calling · A2A · OpenAPI

L7

Orchestration

Compose models, tools and data into workflows & agents

LangGraph · n8n · CrewAI · Temporal · Claude Agent SDK

L6

DS & ML Platforms

Prep data, train, deploy, monitor — full ML lifecycle

Dataiku · Databricks · SageMaker · Vertex AI

L5

Data

Warehouses, databases, vector stores — your AI's memory

Snowflake · Teradata · Postgres · Qdrant · Redis

L4

Inference Providers

APIs that run models on demand for you

Anthropic · OpenAI · Groq · Bedrock · OpenRouter · Ollama (local)

L3

Models

The neural networks themselves — the actual brains

Claude · GPT · Gemini · Llama · Qwen · DeepSeek

L2

Hosting

Where your code & servers live — cloud or your own iron

AWS · GCP · Azure · Vercel · VPS · RunPod

L1

Silicon

The physical chips that run the matrix multiplies

NVIDIA H100 · Groq LPU · Google TPU · AMD MI300X

Every tool you'll hear about slots into exactly one of these layers. This diagram is your decoder ring.

Who this is for

Everyone at NMO who touches AI work — whether you're pitching a client, scoping a project, writing code, or reviewing someone else's architecture. You don't need prior ML experience. You do need to be comfortable hearing names like "pgvector" or "LangGraph" without googling every one.

By the end you should be able to:

Sketch an AI agent architecture on a whiteboard and place every component in the right layer.
Evaluate any new tool that lands in your inbox — quickly classify which layer it plays in and what it replaces.
Cut through vendor pitches that collapse multiple layers into "one platform" without telling you what you're giving up.
Make informed recommendations when a client asks "should we use X?"

How to read this

First pass (45 min): read Part 1 cover to cover, skim Part 2 layer headers, read Part 3 worked examples, skim Part 4.

Deep pass (~3 hours): read Part 2 layer-by-layer. Each layer card is independent — you can stop between them.

Reference mode: when you hear a tool name, search the page (Ctrl+F). Every tool named lives in a specific layer with context.

Opinionated, on purpose

This primer makes judgment calls. "Use pgvector when you have fewer than 10M vectors" is opinion — defensible, pragmatic, and much more useful than a neutral survey. Where there's a live debate between tools, we flag it explicitly. Where the industry has converged on an answer, we state it.

When a client hires us, they don't want a library tour — they want a recommendation. This document trains you to recommend with confidence and with reasons.

Version promiseAI tooling moves fast. This primer is stamped April 2026. Treat any tool list as "accurate now, revisit quarterly." The layer model itself is stable — new tools always slot into an existing layer.

Part 1 · Foundations

Core Concepts & Vocabulary

Before the layers, the vocabulary. If someone on your team confuses "model" with "provider" or "RAG" with "fine-tuning," this is the page to send them.

The Six Terms That Trip Everyone Up

1 · LLM (Large Language Model)

A neural network trained on enormous amounts of text, capable of producing human-like language. Examples: GPT-5, Claude Opus, Llama 4, Gemini 3, DeepSeek V3.

What it is not: an application. "ChatGPT" is an application that uses GPT (a model). Confusing these is like confusing "a car" with "the Toyota Corolla engine."

Two flavours:

Closed / proprietary: access only via the maker's API. Claude, GPT, Gemini, Grok. Usually strongest at the frontier.
Open-weight: the model weights are published. You can download and run yourself. Llama, Mistral, Qwen, DeepSeek, Gemma, Phi.

2 · Inference

The act of running a model to produce output. Training a model is enormously expensive and rare; inference is what happens every time someone sends a prompt. When people say "inference costs" or "inference provider," they mean this.

Every ChatGPT request = one inference call. Running 1,000 prompts = 1,000 inferences. Throughput and latency are measured in tokens/second during inference.

3 · Tokens

The unit of work for LLMs. A token is roughly 3/4 of an English word — "hello" is one token, "unbelievable" is two or three. Everything is priced in tokens: input tokens (what you send) cost different from output tokens (what the model generates).

Why this matters: 1,000 tokens of Claude Opus output ≈ 7.5¢. 1,000 tokens of Claude Haiku ≈ 0.5¢. Same family, 15× cost difference — because the smaller model is much cheaper to run. Choosing the right model per task is the single biggest lever on your AI budget.

4 · Context window

How much text the model can "see" in a single inference call — input + output combined. GPT-3.5 was 4k tokens (~3,000 words). Claude 4 is 200k (~150k words, a whole novel). Gemini 2.5 Pro is up to 2M.

Why this matters: bigger context = you can stuff more reference material in. But: longer context = slower + more expensive + more likely to lose focus on early content. "Throw everything into context" rarely beats "retrieve the right 5 chunks."

5 · Embeddings & vectors

A model that converts text (or images, audio) into a list of ~1,500 numbers. Texts with similar meanings produce similar number lists. This numerical fingerprint is called an embedding or vector.

The power: once text is a vector, you can do math on meaning. Search "find documents about refunds" by converting the query to a vector and finding the closest document vectors. That's semantic search. It's the foundation of RAG.

6 · RAG (Retrieval-Augmented Generation)

A pattern, not a product. When the user asks a question:

Convert the question to an embedding.
Search a vector database for the most relevant documents (semantic search).
Stuff the top-K documents into the LLM's context alongside the original question.
Generate the answer with that grounding.

This lets a generic LLM answer questions about your data without retraining. It's how every "chat with your docs" product works. It's also the most common way AI becomes actually useful in an enterprise.

Visualising Tokens — What The Model Actually Sees

1 word ≠ 1 token · tokenisation is the atomic unit of cost

Rule of thumb: 1,000 tokens ≈ 750 English words. Arabic, code, and rare words produce more tokens per word.

Visualising Embeddings — Meaning Becomes Math

Similar meanings → similar vectors → nearby points in space

This is why semantic search works. A query about "refunds" retrieves documents about "return policy" — even though not a single word matches.

Visualising RAG — How The Magic Happens

"Chat with your docs" always follows this exact shape

The LLM by itself doesn't know your company's refund policy. RAG teaches it just long enough to answer — then forgets. No training required.

Model ≠ Provider ≠ Application — Drill It Home

Layer	Example	What you pay for
Model	Llama 3.3 70B	Nothing — open-weight, free to download
Provider	Groq · Together · Fireworks	Per-million tokens at the provider's rate
Application	A chatbot built on Groq	Subscription to the application itself

Same model, three different things to buy. Once you internalise this split, every vendor pitch becomes readable.

Agent vs Chatbot vs Automation

Chatbot

Single-turn or multi-turn conversation. User asks, model answers. No autonomy, no tool use. ChatGPT in its basic mode.

Automation

Fixed pipeline where an LLM is one step. "New Zendesk ticket → summarise with GPT → post to Slack." No decisions by the LLM about what to do next. n8n / Zapier / Make live here.

Agent

The LLM decides what to do next — which tool to call, which step to take, whether the job is done. Can loop. Can handle novel situations. More powerful, more unpredictable, harder to debug.

Rule of thumbIf the steps are predictable, build an automation. If the steps depend on the specifics of each request — build an agent. Agents cost more and break in weirder ways; use them only when automation won't do.

Fine-tuning vs Prompting vs RAG — When To Use Which

Technique	What it does	When to use
Prompting	Just write better instructions	90% of use cases. Start here, always.
RAG	Inject relevant documents into context	Model needs knowledge it doesn't have — your company docs, a product manual, live data.
Fine-tuning	Adjust the model's weights with training examples	You need a consistent tone, a narrow format, or you're squeezing cost by making a small model imitate a big one. Expensive and slow to iterate.

Beginners reach for fine-tuning because it sounds sophisticated. Professionals reach for better prompts and RAG because they work faster, cheaper, and cover 95% of real problems.

Part 1 · Foundations

The 10-Layer Model

Every AI product — from a hobby chatbot to a trillion-dollar platform — is a set of choices across these ten layers. This is the mental model you'll use for the rest of the document, and for every AI conversation you have from now on.

Read top-down when thinking about a product · bottom-up when thinking about infrastructure

L10 · Governance↓ L9 · Applications↓ L8 · Protocols↓ L7 · Orchestration↓ L6 · DS Platforms↓ L5 · Data↓ L4 · Inference Providers↓ L3 · Models↓ L2 · Hosting↓ L1 · Silicon

The Ten Layers at a Glance

Layer	Name	What lives here	Example tools
L10	Governance	Observability, evaluation, guardrails, compliance	Langfuse, Promptfoo, Llama Guard, Presidio
L9	Applications	End-user products built on everything below	ChatGPT, Cursor, Perplexity, Copilot
L8	Protocols	Standards for components to talk to each other	MCP, function calling, OpenAPI, A2A
L7	Orchestration	Compose models + tools + data into workflows/agents	LangChain, LangGraph, CrewAI, n8n, Temporal
L6	DS & ML platforms	Where data scientists prep data, train models, deploy	Dataiku, Databricks, SageMaker, Vertex AI
L5	Data	Where knowledge lives — warehouses, databases, vectors	Snowflake, Teradata, Postgres, Qdrant, Redis
L4	Inference providers	APIs (or local runtimes) that run models for you	Anthropic, OpenAI, Groq, Bedrock, OpenRouter · Ollama (local)
L3	Models	The neural networks themselves	Claude, GPT, Llama, Gemini, Qwen
L2	Hosting	Where your orchestration code lives	AWS, Vercel, a VPS, RunPod
L1	Silicon	The physical chips	NVIDIA H100, Groq LPU, Google TPU

Why Ten? Why Not Fewer?

Because each layer has a distinct buying decision with different vendors and different competitive dynamics.

You could collapse "inference providers" into "models" — but then you can't explain why the same Llama 3.3 runs on Groq (fast) and Together (cheap) and Bedrock (compliant). You'd hide the decision that actually matters.

You could merge "protocols" and "orchestration" — but then you miss that MCP is a standards layer, chosen separately from whichever framework consumes it.

Ten layers is the minimum number that keeps the decisions visible.

Common Patterns You'll Recognise

The minimal agent

L3 (model) + L4 (provider) + L7 (orchestration) = a working agent. Three layers, ~200 lines of code. A weekend build.

The enterprise pattern

All ten layers. L5 (Teradata + Qdrant), L6 (Dataiku), L10 (Langfuse + Llama Guard), the rest. Months of integration.

The vendor "platform"

A product claiming to cover 6+ layers for you. Convenient at first, lock-in at scale. Readable once you know the layers.

The "AI as a feature"

L3, L4, L9 — existing product adds a "summarise" button. Notion AI, Zendesk AI. Usually OpenAI or Anthropic under the hood.

Pattern-matching drillNext vendor pitch you hear: open Part 2 of this doc in a second tab. Map each of their capabilities to a layer. What layers do they actually cover? What's the marketing hiding? You'll read the pitch in half the time.

Part 1 · Foundations

Live Ecosystem Map

A working map of the NMO stack — what runs where, where data flows, and exactly where it leaves your network. The dots are animated on purpose: that's where your data is moving right now.

The Three Trust Zones

Every component you'll touch lives in exactly one of three zones. The zone determines what you can put through it without a compliance review.

Local / on-prem — your VPS, your data centre, your firewall. Data never leaves. Default for anything regulated (PII, financial, health, gov-ID).
Private cloud you control — single-tenant deployments inside your GCP project, behind your VPC. Data leaves the building but you control the keys, the logs, and the contract.
Public API — Anthropic, OpenAI, Groq cloud, public SaaS endpoints. Fast, cheap, and powerful — but every prompt and response crosses the public internet to a third party. Treat with care.

The NMO Stack — Where Each Tool Lives

Local / On-Prem

Data never leaves the perimeter

T

Teradata

L5 · Enterprise warehouse · the system of record

D

Dataiku (on-prem)

L6 · Semantic layer · RBAC · feature pipelines

L

Llama / Qwen on Ollama

L3+L4 · Local inference for sensitive prompts

λ

LangChain / LangGraph

L7 · Open-source library · runs in your code

P

Postgres + pgvector

L5 · App DB + RAG vector store

M

MCP servers (your tools)

L8 · Each integration as a tool the agent calls

Live data flows · animation = real direction of bytes

User query

LangChain agent

Agent

Dataiku → Teradata

Agent

Groq Cloud

Groq

Agent

User

Local zone

Private cloud

Public API

= bytes in flight

Private Cloud · GCP

Single-tenant inside your GCP project

V

GCP Vertex AI

L4 · Gemini · Claude · Llama · region-pinned in your GCP project

D

Dataiku Cloud

L6 · Single-tenant SaaS option

R

Cloud Run / GKE

L2 · serverless containers + managed Kubernetes for your services

Public API

Data crosses the internet to a vendor

G

Groq Cloud

L4 · Lightning-fast LPU inference for open models

A

Anthropic API

L4 · Claude direct · best reasoning

⌬

Google Gemini API

L4 · Gemini 2.5 Pro · 1–2M token context · cheap at scale

!

SaaS connectors

L9 · Salesforce / Slack / Gmail · contract-bound

The Two Risk Hotspots You Must Watch

Hotspot 1 — Prompt to public LLM API. Every prompt sent to Groq Cloud, Anthropic, OpenAI contains whatever you put in it. If you put customer PII, internal financials, or trade secrets into the prompt, you have just shipped them to a third party. Mitigation: pre-prompt redaction (Presidio), data-class allow-lists per route, contract review for the provider's data-retention terms, or fall back to a local model (Llama via Ollama).

Hotspot 2 — Tool calls to external SaaS. When the agent decides to "look up the customer in Salesforce" or "post to Slack," that call leaves the network too. Mitigation: every tool call goes through an MCP server that logs the request, redacts sensitive fields, and enforces an allow-list of which tenants/customers can be looked up.

How To Read The Diagram In A Client Meeting

Point at the green column first. "All of this stays inside your firewall — Teradata, Dataiku, the agent code, your vector store."
Then the amber column. "These run inside your GCP project, single-tenant, region-pinned. Your data leaves your building but stays under your contract and never reaches a multi-tenant pool."
Then the red column. "These are the only places where your data crosses the public internet to a third party. We use them deliberately, with redaction in front, only for prompts that don't contain regulated content."
The animated dots. "Each dot is a request travelling between components. Notice that most activity is inside the green column. The model API only sees the cleaned, redacted prompt — never the raw customer record."

When the green column isn't enoughSome clients refuse any data crossing their firewall, even redacted prompts. For them: the entire stack runs locally — Llama 3.3 70B on a GPU box, Dataiku on-prem, LangChain in your VPC, Teradata as-is. You give up some quality (frontier models are still ahead) and pay more in hardware, but the trust map collapses to a single column. Always offer this option for regulated industries.

When You Need Groq · The Decision Rule

Groq sits in the public-API zone (red), so you only reach for it when its specific advantage — sub-300ms first-token latency on open models — is the thing that makes or breaks the product. Here is the rule, every time:

Use Groq when

Real-time voice — under-300ms first-token is the difference between "natural" and "awkward" (see Voice Agent).
Live agent assist / call-centre co-pilot — a suggestion every 4 seconds needs a sub-700ms round-trip (see Call Center · VIP).
Inbound chat at high volume where reply speed is part of the customer experience (see Customer Care AI).
High-throughput batch on open weights — classification / extraction / triage at hundreds of tokens/sec for cents per million.
Cost-sensitive workloads on Llama / Qwen / DeepSeek — you want open-weight pricing and world-class speed.

Don't use Groq when

You need Claude or Gemini quality — Groq only hosts open-weight models. Frontier reasoning lives on Vertex AI / Anthropic.
The prompt contains regulated PII you can't redact — it's a public API; data leaves your network. Use Vertex AI in your GCP project instead.
Latency doesn't matter — for offline / batch / "report me by tomorrow" workloads, Vertex AI on Llama is cheaper and stays in your cloud.
You need long context (1M+ tokens) — that's Gemini 2.5 Pro on Vertex AI, not Groq.
You're inside a strict on-prem mandate — fall back to local Llama on Ollama / vLLM. Slower but never leaves the building.

Always pair Groq withPresidio in front (strip PII before the prompt leaves your VPC) and a fallback to local Llama for top-tier tenants who refuse public APIs. Without those two, Groq's speed becomes a compliance liability.

Part 2 · The Layers

Layer 1 · Silicon — where the math actually runs

The physical chips that execute matrix multiplications. Usually invisible to you — it's whoever hosts your inference who picks. But the choice shapes latency, cost, and what's even possible.

The Chip Families

Chip	Maker	Position in 2026
H100, H200, B100, B200	NVIDIA	The default. ~90% of production inference. CUDA ecosystem is the moat.
A100	NVIDIA	Previous generation. Still everywhere. Cheaper to rent.
TPU v5e · v5p · Trillium	Google	Google-only. Powers Gemini. Rentable via GCP.
MI300X · MI325X	AMD	Credible NVIDIA challenger. Cheaper per FLOP. Software (ROCm) still maturing.
LPU	Groq	Language-specific chip. Not a GPU. Deterministic, extremely low latency, 5–10× faster tokens/sec on open-weight models. Groq (the company) sells API access; you don't buy LPUs.
WSE-3	Cerebras	Wafer-scale. One chip is physically the size of a cluster of GPUs. Fastest inference on large models. Niche, expensive.
Trainium · Inferentia	AWS	AWS-exclusive silicon. Cheap. Used inside Bedrock.
Neural Engine (M-series, A-series)	Apple	On-device only. Behind every "Apple Intelligence" feature.
Snapdragon NPU	Qualcomm	Android on-device inference.

Why Groq Is Structurally Different

A GPU is general-purpose — it does graphics, crypto, scientific computing, and AI. That flexibility costs you speed. Groq built a chip that only does one thing (the math that runs LLMs) and shaved off every millisecond.

Practical consequence: Llama 3.3 70B on an H100 produces maybe 60 tokens/second. Same model on Groq: 500+ tokens/second. That's the difference between an agent that feels snappy and an agent that feels sluggish.

Trade-off: Groq only serves a curated menu of open-weight models. You cannot run your custom fine-tune. You cannot run Claude or GPT (those are closed — they run on their makers' infrastructure). You're choosing speed within a constrained model set.

When Silicon Choice Actually Matters To You

Matters

Voice agents (<300ms perceived round-trip)
Live code completion
High-volume batch processing (cost per million tokens)
Air-gapped / on-prem deployments (you pick the hardware)

Doesn't matter

Prototypes and MVPs
Internal tools with <1,000 users
Anything where a 2-second response is fine
Anything running on Claude or GPT (you can't choose anyway)

The hidden chip warRight now every cloud provider is scrambling to reduce their NVIDIA dependency — building their own chips (AWS Trainium, Google TPU, Microsoft Maia), or funding alternatives (Groq, Cerebras). Over the next 3 years, expect model prices to drop as alternatives mature. Don't lock in long-term contracts based on today's pricing.

Part 2 · The Layers

Layer 2 · Hosting & Infrastructure

Where your orchestration code, databases, and any self-hosted inference actually live. Separate from Layer 4 (inference providers) — you might host your app on Vercel while calling OpenAI's inference API from elsewhere.

Three Buckets

Hyperscalers — the full menu

AWS, GCP, Azure, Oracle Cloud, Alibaba Cloud. Everything is available; complexity is high. Pick when you need compliance stories (HIPAA, PDPL, SOC 2), when you already run 80% of your infrastructure there, or when the client dictates it.

AI-relevant services: AWS Bedrock (multi-model inference gateway), Azure OpenAI (Microsoft's GPT resell), GCP Vertex AI (Google's ML platform), AWS SageMaker, Azure ML.

GPU-specialised clouds — rent a GPU in 60 seconds

RunPod · Lambda Labs · CoreWeave · Modal · Replicate · Beam · Paperspace · Fluidstack.

Use case: you need GPUs now (fine-tuning, self-hosting a specific model, experimenting) without a hyperscaler commitment. Sign up, rent an H100 by the hour, shut it down. This is where most open-source AI development happens.

Self-hosted and edge

Your own VPS (Hostinger, Linode, DigitalOcean, Hetzner), bare-metal servers, on-prem, Cloudflare Workers AI (edge), Vercel.

Use case: small-scale apps, data-sovereignty requirements, cost control, internal tools. A $20/month VPS can host a surprising amount of AI application code; you call out to inference providers for the heavy compute, or run a small open-weight model locally with Ollama on the same box.

The Common Shape of an AI Application

Typical production topology

Users→ Vercel / VPS (app)→ Postgres + Redis (state)→ Anthropic / OpenAI API→ Response

The application layer is small and cheap. The inference cost is the variable. That's why hosting your app on a $20 VPS is fine for a long time — the money goes to layer 4, not layer 2.

Part 2 · The Layers

Layer 3 · Models — the brains

The actual neural networks. Model ≠ provider: Llama runs on a dozen providers; Claude runs only via Anthropic, AWS Bedrock, and GCP Vertex. The model you pick determines capability; the provider you pick determines cost, latency, and compliance.

Frontier Closed Models

Family	Maker	Strength
Claude (Opus, Sonnet, Haiku)	Anthropic	Coding, long-context reasoning (200k+), careful tool use. Preferred by Apex and by most serious agent builders.
GPT-5 · GPT-4o · o-series (o1/o3/o4)	OpenAI	General-purpose, multimodal (vision + voice), math and science via o-series reasoning models. GPT-5 is the current flagship.
Gemini 2.5 · 3	Google	Up to 2M-token context (biggest), native multimodal, very cheap at scale.
Grok 3 · 4	xAI	Trained on X data, fewer guardrails, fast-moving.

Open-Weight Models

Family	Maker	Strength
Llama 3.1 · 3.3 · 4	Meta	The open-weight workhorse. Runs everywhere, fine-tunable, strong community.
Mistral · Mixtral · Codestral	Mistral AI (France)	EU privacy story, MoE (mixture-of-experts) efficiency, small-model quality.
Qwen 2.5 · 3	Alibaba	Best open-weight coder in 2026, excellent multilingual (great for Arabic), many sizes.
DeepSeek V3 · R1	DeepSeek	Cheap frontier reasoning. R1 was trained for roughly 1/20th of GPT-4's public cost estimates and matches o1-level reasoning on many benchmarks.
Gemma 2 · 3	Google	Small-model sibling of Gemini. On-device friendly.
Phi-3 · Phi-4	Microsoft	Small model, punches above its weight, good on-device.

Specialised Models

Family	Maker	Strength
Whisper · Whisper Large v3	OpenAI	Speech-to-text. Best-in-class transcription. Free to self-host.
Flux · Flux Pro	Black Forest Labs	Image generation, open-weight, high quality. Replaces Stable Diffusion for many.
Stable Diffusion 3.5	Stability AI	Open image generation.
Sora · Runway Gen-3 · Kling	OpenAI · Runway · Kuaishou	Video generation. Early but usable.
text-embedding-3 · voyage-3	OpenAI · Voyage AI	Embeddings — turn text into vectors for retrieval. (You'll use these daily in RAG.)
Cohere Embed · BGE-M3	Cohere · BAAI	Alternative embedding models. BGE-M3 is open-weight and strong on multilingual.

Choosing A Model — Practical Guide

Your situation	First pick
"I need the best coding model"	Claude Sonnet / Opus · Qwen 2.5 Coder for open
"I need the cheapest frontier-quality reasoning"	DeepSeek V3/R1 or Gemini 2.5 Flash
"I need 1M+ tokens of context"	Gemini 2.5 Pro
"I need to run it on my own hardware"	Llama 3.3 70B (general) or Qwen 2.5 Coder (coding) — fastest path is Ollama
"I need it to be fast enough for voice"	Llama 3.3 70B on Groq
"I need Arabic / multilingual strength"	Qwen 2.5 · Gemini · Claude
"I need strong vision (describe image, read PDFs)"	Claude Sonnet · GPT-4o · Gemini 2.5
"I need cheap summarisation at scale"	Claude Haiku · Gemini Flash · GPT-4o-mini

The model-cascade patternDon't use one model for everything. Use Haiku/Flash/mini for cheap high-volume tasks (classification, extraction, short summaries) and Opus/GPT/Gemini for the hard reasoning step. A well-designed agent might call Haiku 50 times for every Opus call. This is the easiest 10× cost reduction you'll find.

Part 2 · The Layers

Layer 4 · Inference Providers

The endpoint your code actually calls. Determines price, latency, which models are available, and SLA. For closed models, you have one choice per model. For open models, you have a buffet.

First-Party Providers

The model maker serves their own model — only place you can get it (plus some hyperscaler resells).

Anthropic API — Claude. Best place for Claude. Also available on AWS Bedrock and GCP Vertex for compliance reasons.
OpenAI API — GPT and o-series. Also sold as Azure OpenAI for enterprise.
Google Gemini API — via Google AI Studio (dev) or Vertex AI (enterprise).
xAI API — Grok.

Multi-Provider Gateways

One endpoint, many models. Useful when you want to A/B test or not lock in.

OpenRouter

100+ models behind a single OpenAI-compatible API. Most popular with indie builders. Pay-as-you-go, no contracts.

AWS Bedrock

Claude, Llama, Titan, Cohere, Mistral — all served from AWS. Compliance + enterprise integration story.

Azure OpenAI

GPT and o-series with Microsoft's compliance wrapper. Required for many enterprise customers.

GCP Vertex AI

Gemini + Claude + Llama via Google's cloud.

Together AI · Fireworks · Replicate

Open-weight models hosted as APIs. Cheaper than first-party, pick-your-model.

Fast-Inference Specialists

Providers competing on speed for open-weight models.

Groq

LPU-based. 500+ tokens/second on Llama 3.3 70B. The go-to for voice agents and live UX.

Cerebras

Wafer-scale. Similar speeds to Groq. Bigger context windows on some models.

SambaNova

Third contender in the speed race.

Self-Hosted Inference Runtimes

The software you run yourself when data can't leave your network.

Ollama

Simplest. Great for local dev and small-scale. ollama run llama3.3 and you have an API.

vLLM

Production-grade serving. What you'd use for real throughput.

TGI

Hugging Face's inference server. Solid alternative to vLLM.

llama.cpp

CPU-optimised. Runs a 7B model on a Raspberry Pi. On-device / edge play.

Observability Gateways

Proxies that sit between your app and the real inference provider — adding caching, logging, rate-limiting, A/B testing.

LiteLLM

Unify 100+ providers behind one OpenAI-compatible interface. Open-source. Essential if you want to swap providers later.

Portkey · Helicone

Commercial variants with richer dashboards and caching.

Practical stackUse LiteLLM as your gateway. Point it at Anthropic for Claude, OpenAI for GPT, Groq for fast open-weight, and Together for cheap open-weight. Your code calls LiteLLM; LiteLLM decides. You can swap any provider without touching application code.

Part 2 · The Layers

Layer 5 · Data — where knowledge lives

Every interesting agent retrieves from something. The store shapes the decision: warehouse for history, vector DB for semantic search, Postgres for app data, Redis for session state.

The Data Landscape

Category	Players	Use case
Data warehouses (OLAP)	Snowflake · Databricks · BigQuery · Teradata · Redshift · ClickHouse	Structured analytical queries over years of history. "What was our LATAM revenue by quarter for 2020-2025?"
Data lakes	S3 + Iceberg · Delta Lake · MinIO · Azure Data Lake	Cheap raw-file storage, often the substrate under a warehouse.
Operational DBs (OLTP)	PostgreSQL · MySQL · MongoDB · DynamoDB · SQL Server	Your app's live data — users, orders, tickets. Reads and writes continuously.
Vector DBs	Qdrant · Pinecone · Weaviate · Milvus · Chroma · pgvector · LanceDB	Store embeddings for semantic search. Foundation of RAG and agent memory.
Graph DBs	Neo4j · ArangoDB · Memgraph · TigerGraph	When relationships are the point — fraud rings, supply chains, org charts.
Cache / in-memory	Redis · KeyDB · Memcached · DragonflyDB	Sub-millisecond lookups, session state, pub/sub messaging.
Search engines	Elasticsearch · OpenSearch · Meilisearch · Typesense	Keyword + filter search. Often combined with vector search for hybrid retrieval.

Teradata vs Snowflake vs Databricks — Enterprise Warehouse Picks

Teradata: the 40-year incumbent in big banks, telcos, airlines, healthcare payers. If a client has 20 years of structured history, it's probably in Teradata. Strengths: mature query optimiser, governance, predictable performance. Weaknesses: expensive, older tooling story. You don't migrate Teradata — you work with it.

Snowflake: cloud-native warehouse, separated compute + storage. Dominant with modern enterprises. Easier to use than Teradata, strong ecosystem.

Databricks: lakehouse model — warehouse + lake + ML platform in one. Preferred by data-engineering-heavy shops. Has its own MLflow, its own LLMs (DBRX), its own serving.

BigQuery: the GCP-native warehouse. Extremely cheap serverless scan. Default for any GCP-committed organisation.

ClickHouse: open-source columnar DB, blazingly fast for analytical queries on event data. Product analytics shops love it.

Qdrant vs Pinecone vs pgvector — Vector DB Decision

pgvector

Postgres extension. Zero new infrastructure. Same ACID guarantees as the rest of your app data. Pick when: <10M vectors, you already run Postgres, you want one database not two.

Qdrant

Open-source, Rust-based, fastest pure vector DB. Payload filtering, horizontal scaling. Self-host or use their cloud. Pick when: millions-to-billions of vectors, low-latency retrieval is critical, you want open-source.

Pinecone

Managed service, pioneered the category. Fully managed, good dashboards. Pick when: you want zero operational overhead and have budget.

Weaviate

Open-source with built-in ML modules (embedding generation, classification). Python-friendly.

Milvus

Heavy-duty open-source, designed for billion-scale. Overkill for most.

Chroma

Dev-friendly, embedded. Prototypes and small apps.

Hybrid Retrieval — The Pattern You'll See Constantly

Pure vector search misses exact matches (product IDs, names, specific phrases). Pure keyword search misses semantic meaning ("refund policy" vs "return guidelines"). Hybrid retrieval runs both and fuses the results.

Typical stack: Elasticsearch (or OpenSearch) for keyword + BM25 ranking + Qdrant (or pgvector) for semantic. A re-ranker model (Cohere Rerank, BGE reranker) picks the final top-K. Quality jumps significantly over either alone.

Part 2 · The Layers

Layer 6 · Data Science & ML Platforms

A layer above raw data, below agents. This is where analysts and data scientists live — cleaning data, building pipelines, training traditional ML models, deploying. Most agent builders never touch these. Most enterprises run on them.

The Platforms

Platform	Position	Who uses it
Dataiku	Visual + code DS platform. ETL, feature engineering, model training, deployment — in one canvas. Strong RBAC, lineage, governance.	Enterprises where analysts and data scientists share workflows. Often sits on top of Snowflake or Teradata.
Databricks	Lakehouse + ML + Spark + Delta. Code-heavy. Has its own LLM features (DBRX model, Mosaic AI).	Data engineers. ML teams at scale. Shops that live in notebooks.
Palantir Foundry	Data integration + workflow + ontology. Operational AI, not exploratory. Very opinionated.	Large enterprises with messy data across 50 source systems. Defence, healthcare, oil & gas.
AWS SageMaker	Hyperscaler DS platform. Tight AWS integration. Everything from Jupyter to model serving.	AWS-committed shops. ML engineers.
GCP Vertex AI	Google's answer. Strong AutoML, native Gemini integration.	GCP-committed shops.
Azure ML	Microsoft's answer. Tight integration with Azure services and Office.	Microsoft-shop enterprises.
H2O.ai · DataRobot	AutoML-first. "Point at a table, get a model." Less useful for LLMs, still strong for traditional ML.	Teams without deep ML expertise. Financial services modelling.
MLflow · W&B · ClearML · Comet	Experiment tracking + model registry. Not a full platform — a component.	ML teams using their own compute but wanting governance.

Dataiku In Depth — Why Clients Care

Dataiku is often called "Tableau for machine learning." It's a visual canvas where you drag boxes: read from Teradata → filter → join with a CSV → train a model → deploy as an API. Each box can be visual (for analysts) or Python/R (for data scientists). They share the same project.

What's valuable:

Lineage: every column in every output can be traced back to its source.
RBAC: who ran what, who approved deployment, who has access to which data.
Mixed skill-levels: business analysts and senior DS work on the same flow.
Model Ops: deployed models get monitored for drift, performance, retraining triggers.

Where it fits in the agent era: Dataiku's sweet spot is traditional ML (classification, regression, forecasting). For LLM-heavy agents, it's peripheral — you might publish a "scored customers" table from Dataiku that an agent then queries, but the agent itself is built elsewhere. Dataiku is adding LLM features, but the core strength remains traditional analytics.

When A Client Says "We Use Dataiku"

What they mean: they have a DS team, they've invested in governance and lineage, they likely have 50+ projects running in production. They are enterprise, not a startup.

Implications for your pitch:

Don't propose to replace Dataiku — you'll lose.
Do propose to complement it with agentic workflows that consume Dataiku outputs.
Leverage their existing lineage + RBAC — the compliance story is already built.
MCP servers pointing to Dataiku datasets are the clean integration point.

Part 2 · The Layers

Layer 7 · Orchestration & Agent Frameworks

The layer with the most churn — new frameworks appear monthly. Turns raw LLM calls + data + tools into useful workflows. Where you'll spend most of your coding time.

Code-First Agent Frameworks

Framework	Philosophy	When to use
LangChain	The original. Huge surface area, many integrations. Often criticised as "too magic." Good for getting started, painful at scale.	Prototypes. Pattern demonstrations.
LangGraph	LangChain's state-machine framework. Explicit graphs of agent decisions. Much more debuggable than raw LangChain.	Multi-step reasoning with branches. Complex agent logic.
LlamaIndex	RAG-first. Rich tooling for document loaders, chunking, retrieval pipelines.	Data-heavy agents, "chat with your docs."
AutoGen (Microsoft)	Multi-agent conversations. Agents talk to each other to solve tasks.	Research, experimentation. Production less common.
CrewAI	Role-based multi-agent ("researcher", "writer", "editor"). Higher-level than AutoGen.	Content pipelines, structured multi-agent work.
Pydantic AI	Typed, minimal, Python-idiomatic. Strong structured-output support.	Production systems where schema matters. Rising fast in 2026.
Claude Agent SDK	Anthropic-native (formerly "Anthropic Agent SDK"). Closest to the metal. No framework overhead.	Claude-specific production agents where you want control.
OpenAI Swarm · OpenAI Agents SDK	OpenAI's own lightweight framework.	OpenAI-centric agents.
Semantic Kernel (Microsoft)	Enterprise-friendly, .NET + Python + Java. Plugin architecture.	.NET shops, enterprise Microsoft integrations.

Coding Agents (IDE / Terminal-Side)

These aren't frameworks — they're end-user products that use all 10 layers internally. You use them; you rarely build with them.

Claude Code

CLI. Terminal-native. Extensive tool use. Apex wraps this for its developer agents.

Cursor

VS Code fork with deep LLM integration. Most popular paid coding tool.

Windsurf

Cursor competitor from Codeium. Similar model, different UX.

GitHub Copilot

The original. Tightly integrated with GitHub, Codespaces.

Aider

Open-source, terminal-based, git-aware. Beloved by minimalists.

Continue.dev

Open-source VS Code / JetBrains extension. Bring your own model.

Replit Agent

Browser-based, cloud-dev-env integrated. Great for prototyping.

Visual / Low-Code Workflow Automation

Drag-and-drop boxes: trigger → action → action. LLM is one box among hundreds.

n8n

Self-hostable (fair-code licence), 400+ integrations, strong community. Sweet spot: cross-system automation where an LLM is one step. Not ideal for reasoning-heavy agents.

Make (ex-Integromat)

Visual, SaaS-only, strong integrations. Similar power to n8n, cloud-hosted only.

Zapier

Deepest SaaS catalog, weakest for custom logic. Non-technical users' default.

Pipedream

Code-friendly automation. Hybrid visual + JavaScript.

Node-RED

IoT origins, visual flow. Still used in operations, edge computing.

Heavy-Duty Workflow / Data Orchestration

For pipelines measured in hours/days with retries, schedules, complex dependencies.

Airflow

The Python data-engineering standard. Schedule ETL jobs, retries, DAGs.

Prefect · Dagster

Modern Airflow alternatives, Pythonic APIs.

Temporal

Durable workflow engine. Agent workflows that survive restarts, timeouts, retries for hours or days. Increasingly the go-to for agent systems that need to be reliable.

Argo Workflows

Kubernetes-native workflow engine.

n8n vs LangGraph — When To Pick Which

Situation	Pick
Trigger from Gmail, enrich from HubSpot, post to Slack, one LLM summary in the middle	n8n
Agent that calls 5 tools, decides which based on user input, loops if result is unclear	LangGraph / Pydantic AI
Client wants "visual AI pipelines they can edit"	n8n
You need custom data models, state transitions, complex reasoning	LangGraph or custom code
Team is non-technical	n8n / Make
Team is senior engineering	Custom code with Claude Agent SDK / LangGraph

The framework trapEvery few months a new framework claims to be the one that solves agents. Most projects that succeed are built on minimal code + the model provider's SDK + maybe LangGraph for complex flows. Resist the urge to adopt the newest shiny thing — adopt the one your team can debug at 2am.

Part 2 · The Layers

Layer 8 · Protocols — how the pieces talk

Standards for components to communicate. A few years ago there were none and every integration was bespoke. Now there are emerging standards — and they matter because they let you swap tools without rewrites.

The Main Protocols

MCP — Model Context Protocol

Who: Anthropic, now adopted by many. What: open standard for giving LLMs structured access to tools, data, and services.

An MCP server exposes tools ("query_database", "read_file", "send_email"). Any MCP-aware client (Claude Desktop, Cursor, Claude Code, your custom agent) can discover and use them. Think: "USB for agent tools" — plug and play across vendors.

Why it matters: before MCP, wiring a tool to an agent meant writing glue code for every agent framework. After MCP, you write one server, every client works. This is becoming the industry default. Expect every major platform to ship MCP support in 2026.

Function Calling (aka Tool Calling)

Who: OpenAI introduced it in 2023; every frontier model now supports it. What: the model returns a structured JSON object saying "call function X with these arguments" instead of free-text. Your code executes it, returns the result, the model continues.

This is the raw mechanism. MCP is the standard way to package and share functions for reuse.

A2A — Agent-to-Agent

Who: Google's proposal (2025), others experimenting. What: lets one agent discover and call another. Still early — MCP covers most use cases, A2A is for agent-fleet scenarios.

OpenAPI · REST · GraphQL

The fallback. When no native AI protocol exists, a well-documented REST API with an OpenAPI spec is still the common ground. Most MCP servers are wrappers around existing REST APIs.

What MCP Looks Like In Practice

An MCP server exposes tools:
  - query_customer_db(customer_id)
  - list_recent_orders(days=7)
  - escalate_ticket(ticket_id, reason)

Your agent framework (LangGraph, Claude Code, a custom thing)
connects to the MCP server and automatically discovers these tools.
The agent decides when to call each. The server runs them.

You didn't write tool-integration code in the agent.
You wrote an MCP server once. Every agent can use it.

Why this is important for NMOWhen pitching clients, propose that integrations be built as MCP servers. This future-proofs the work — the MCP server you build for them is reusable by any AI product they adopt later (Cursor, Copilot, ChatGPT Enterprise, or our own agents). A REST integration is locked to the product that consumes it; an MCP integration is portable.

Part 2 · The Layers

Layer 9 · Applications — end-user products

What end users actually touch. Built on all the layers below. Ranges from consumer chatbots to enterprise platforms.

By Category

Chat / General Assistants

ChatGPT · Claude.ai · Gemini · Copilot · Perplexity · Pi · You.com

Coding

Cursor · Claude Code · Windsurf · Copilot · Replit Agent · Cody · Codeium

Writing / Marketing

Jasper · Copy.ai · Notion AI · Mem · Lex · Writer

Customer Support

Intercom Fin · Zendesk AI · Decagon · Sierra · Ada

Sales / CRM

Gong · Chorus · Clay · Apollo AI · HubSpot Breeze

Meetings / Knowledge

Otter · Fireflies · Granola · Tactiq · Glean

Search

Perplexity · Phind · Exa · Kagi Assistant · You.com

Image / Video

Midjourney · Runway · Pika · Kling · Sora · Ideogram · Flux Pro

Voice

ElevenLabs · Cartesia · Deepgram · Vapi · Bland · PlayHT

Data Analysis

Hex · Julius · Rowy · Metabase AI

Legal / Compliance

Harvey · Hebbia · Spellbook · Robin AI

Healthcare

Abridge · Nuance DAX · Suki · OpenEvidence

Pattern to recogniseWhen you use any of these, you're using a packaged vertical slice of the 10 layers. The company that built Cursor is making your layer-3/4/7/8 decisions for you. When evaluating "should we buy this?" vs "should we build?" — ask which layers would we own, and which would we be locked into?

Part 2 · The Layers

Layer 10 · Governance, Observability & Safety

Cross-cutting. The moment you go past demo, you need some of these or you'll be debugging blind and explaining to legal why the agent leaked PII.

The Four Sub-Layers

1 · Observability & Tracing

Answers: "What did the agent do yesterday, how much did it cost, and where did it fail?"

Langfuse

Open-source, self-hostable. Rich trace view, evaluation, prompt management. Best starting point.

LangSmith

LangChain's managed service. Excellent if you're already on LangChain.

Helicone

Proxy-based. Sits between your app and the API, captures everything. Adds caching.

Arize AI · WhyLabs

Enterprise ML observability. Stronger on drift detection, weaker on LLM-specific tracing.

Datadog AI · New Relic AI

If you already run these for your regular apps, the AI modules are easy to enable.

2 · Evaluation

Answers: "Did my new prompt make things better or worse?"

Promptfoo

Open-source, runs in CI. Define test cases as YAML, run across models. Beloved by indie builders.

Braintrust

Commercial. Dataset management, experiments, human-in-the-loop scoring.

Patronus AI · DeepEval

Full eval platforms with pre-built metrics (hallucination, faithfulness, etc).

Ragas

Specifically for RAG systems. Measures faithfulness, context relevance, answer relevance.

3 · Guardrails & Safety

Blocks jailbreaks, prompt injection, unsafe outputs before they reach the user.

Llama Guard 4 (Meta)

Open-weight safety classifier. Self-host in front of your LLM calls.

NVIDIA NeMo Guardrails

Open-source framework for defining guardrails in a declarative DSL.

Lakera Guard

Commercial. Strong on prompt-injection detection.

Prompt Security · PurpleLlama

Enterprise safety platforms.

4 · PII / Compliance

Redacts sensitive data before it reaches any LLM. Critical for GDPR, PDPL, HIPAA.

Microsoft Presidio

Open-source, highly configurable. Detects names, emails, IDs, phone numbers across a wide range of languages.

Private AI · Skyflow · Nightfall

Commercial alternatives with richer UI and enterprise features.

The compliance reality in KSAPDPL (Saudi Arabia's Personal Data Protection Law) applies to any AI system processing personal data. For client work in KSA: always discuss whether PII leaves the Kingdom. If it does, you likely need explicit consent or an exemption. The compliant paths are Vertex AI in me-central2 (Dammam, KSA — actually inside the Kingdom) or on-prem Llama. Presidio in front of your LLM calls turns a non-compliant design into a compliant one.

Part 3 · How It Fits Together

Example · RAG Customer Support Chatbot

The most common "AI in production" pattern. Users ask questions; the bot answers from the company's product docs. Every enterprise Proof-of-Concept you'll see is a variation of this.

Architecture · Live Data Flow

A classic RAG flow. Two paths cross your network boundary (red): the embedding call (3) and the LLM call (5). Both go through Presidio first, so they only ever see redacted text. Everything else — orchestration, vector search, safety filter — stays in your VPC.

1user question lands at the API 2Presidio strips PII 3OpenAI returns an embedding vector 4Qdrant returns top-k chunks 5Claude Haiku writes the answer 6Llama Guard checks the output 7final answer back to the user

Layer-by-Layer Stack

Layer	Pick	Why
L1 Silicon	NVIDIA (invisible)	Whoever hosts Claude picks — not our choice.
L2 Hosting	Vercel (Next.js) + your GCP project (Qdrant on GKE / Cloud Run)	Vercel for the edge UI, GCP for the stateful vector DB.
L3 Model	Claude Haiku · text-embedding-3-small	Haiku is cheap + fast. Small embedding model — full cost per million docs.
L4 Inference	Anthropic API · OpenAI API	Direct, simplest.
L5 Data	Qdrant (vectors) · Postgres (tickets)	Qdrant for scale; Postgres for the support ticket system.
L6 DS platform	None	No model training. No platform needed.
L7 Orchestration	LangChain retrieval chain OR 80 lines of Node	Either works. If it's a one-off, skip LangChain.
L8 Protocol	Direct API calls	Nothing to reuse — MCP overkill here.
L9 Application	Chat widget embedded in Zendesk	Meet users where they are.
L10 Governance	Presidio (PII) · Llama Guard (safety) · Langfuse (trace)	PDPL compliance + observability from day one.

What Makes This Production-Grade vs Demo-Grade

Demo

Just Claude + a bunch of docs stuffed into context
No PII handling
No eval harness
No observability — if it breaks, you have no idea why
Works for 50 docs, falls over at 5,000

Production

Vector DB with hybrid retrieval + reranking
Presidio strips PII before any external API call
Promptfoo runs in CI, catches prompt regressions
Langfuse traces every turn; Helicone caches common questions
Scales to 500k docs without a rearchitecture

The 80/20 of RAG80% of the quality comes from: (1) good chunking of the source docs, (2) a reranker after vector search, (3) a clear system prompt telling the model to only answer from the provided context and say "I don't know" otherwise. The rest is polish.

Part 3 · How It Fits Together

Example · Sales Outreach Pipeline

When a new lead hits Salesforce, enrich from Apollo, LLM-draft a personalised email, wait for human approval, send from Gmail, log back to Salesforce. The "n8n archetype" — where visual automation beats custom code.

Architecture · n8n as Central Orchestrator

n8n is the only piece you actually run; everything else is existing SaaS. The whole pipeline is just n8n calling 5 APIs in sequence with one human gate. This is what "no custom code" looks like.

1Salesforce fires webhook to n8n 2n8n calls Apollo for firmographics 3n8n asks GPT-4o-mini for a draft email 4n8n posts the draft to Slack for SDR approval 5on approval, n8n sends from Gmail 6n8n writes the activity row back to Salesforce

Why n8n Is The Right Tool Here

Look at the steps: 6 out of 7 are just "call an API on an existing SaaS." n8n has all of them as drag-in nodes. The one AI step (draft email via GPT-4o-mini) is also a drag-in node.

If you wrote this in Python, you'd be writing:

Salesforce webhook listener (20 lines)
Apollo API client (30 lines)
Tier-list lookup (10 lines)
OpenAI client + prompt (20 lines)
Slack approval Bolt app (60 lines)
Gmail SMTP client (15 lines)
Salesforce update call (15 lines)
Error handling, retries, logging (100+ lines)

That's a week of work. In n8n: an afternoon. And non-technical team members can edit it without fear.

Layer Usage

Layer	Pick
L2 Hosting	Self-hosted n8n on a $10/month VPS, OR n8n Cloud
L3 Model	GPT-4o-mini (cheap, good enough for outreach emails)
L4 Inference	OpenAI (via n8n node)
L5 Data	Salesforce is the source of truth; n8n holds no state
L7 Orchestration	n8n — the whole story lives here
L9 Application	Salesforce + Slack + Gmail (existing tools)
L10 Governance	Human-in-loop approval is the guardrail

When This Pattern Becomes An Agent Instead

If the logic gets more adaptive — "if lead responded to a previous email, personalise based on that thread" or "if the company website uses React, mention React-specific case studies" — you're past n8n's sweet spot. Rebuild in LangGraph or custom code.

The line: n8n is a decision tree. An agent loops and decides. When you start building decision trees inside n8n that are 50 boxes deep, switch tools.

Part 3 · How It Fits Together

Example · Enterprise Analytics Agent

The CFO asks "what drove the Q3 YoY revenue drop in LATAM?" The agent writes SQL, runs it against 20 years of Teradata history via the Dataiku semantic layer, narrates the answer with charts. The "Teradata + Dataiku + Claude" archetype.

Architecture · Live Data Flow

Notice the Teradata stack on the left runs top to bottom: MCP wraps Dataiku, Dataiku resolves metric names against Teradata, Teradata returns rows. The agent never sees raw SQL or table names — just clean tool calls. The only path that leaves your VPC is to Vertex AI in your own GCP project — single-tenant, region-pinned (e.g. me-central2 for KSA), and never reaches Google's multi-tenant model pool.

1CFO asks in Slack 2LangGraph asks Claude/Gemini (Vertex AI) to plan queries 3tool call to MCP server 4Dataiku resolves the semantic query 5Teradata returns the rows 6matplotlib renders the charts; Claude narrates 7final reply with charts back to Slack

Why This Shape — Design Rationale

Teradata holds the history

20 years of finance data. ~$50M/year licence. You do not migrate this. You talk to it.

Dataiku already has a semantic layer

Column names, metric definitions, hierarchies, RBAC. If the agent queried raw Teradata it would hallucinate column names and bypass all the governance the enterprise spent millions building. Querying via Dataiku inherits all of it.

The agent queries Dataiku via MCP

An MCP server exposes Dataiku datasets as agent tools. The agent doesn't know or care that the underlying store is Teradata — just "query revenue_by_region_quarter for LATAM 2020-2025."

Claude Opus for the reasoning step

Planning which queries to run, narrating findings in business English, handling "drill deeper" follow-ups — this is where you want the frontier model. Cheaper models would produce shallower analysis.

Layer Usage

Layer	Pick
L2 Hosting	Client's existing infrastructure (on-prem or your GCP project)
L3 Model	Claude Opus or Gemini 2.5 Pro (reasoning) + Gemini Flash (chart captions)
L4 Inference	GCP Vertex AI (Claude or Gemini, single-tenant, region-pinned for compliance)
L5 Data	Teradata (history) via Dataiku
L6 DS Platform	Dataiku — semantic layer + RBAC
L7 Orchestration	LangGraph (structured multi-step reasoning)
L8 Protocol	MCP server for Dataiku — the clean integration point
L9 Application	Slack bot (CFO already lives there)
L10 Governance	Langfuse (trace) · Presidio (PII in logs) · Dataiku's own RBAC

What A Junior Architect Gets Wrong

Mistake 1: "Let's migrate Teradata to Snowflake for the AI project." No. Never propose a multi-million-dollar data migration as part of an AI project. Build on top.

Mistake 2: "The agent will write raw SQL against Teradata." It will hallucinate table names, miss business definitions, and bypass RBAC. Always go through the semantic layer.

Mistake 3: "We'll use a smaller model to save cost." Financial analysis needs careful reasoning. Going cheap here destroys the client's trust on the first wrong answer. Use Opus/GPT-4.

Mistake 4: "Skip MCP, call Dataiku directly." Then every future AI product the client adopts will have to re-integrate. MCP server once, reused forever.

This is the NMO sweet-spot pitchClients with existing Teradata + Dataiku investments are ripe for this exact pattern. Their pain: DS teams produce insights slowly; business users can't self-serve. The agent makes the existing investment conversational — without replacing any of it. You charge for the MCP integration, the agent build, and the ongoing tuning. The client keeps every dollar they've spent on their data estate.

Part 3 · How It Fits Together

Example · Real-Time Voice Agent

Users call a phone number. The agent answers in under 300ms, holds a natural conversation, books appointments. The "where Groq actually matters" archetype.

Architecture · The Voice Pipeline (latency budget)

A pure left-to-right pipeline. Every step's latency is annotated above its arrow. The Groq box is drawn larger because that's where the entire product wins or loses — Llama 3.3 70B on a regular GPU is too slow for natural voice; on Groq it's fast enough.

1caller dials in (PSTN) 2LiveKit bridges to streaming audio 3Deepgram streams partial transcripts 4Groq + Llama 3.3 70B replies in tokens 5Cartesia streams the audio back

Why Groq Here — And Only Here

Voice feels natural under 300ms round-trip, stilted at 500ms, broken above 800ms. Here's the latency budget:

Step	Typical latency	Notes
Network (caller → server)	~40 ms	Geography-dependent
STT (streaming)	~100 ms	Deepgram partial transcripts
LLM first-token (Llama on GPU)	400–800 ms	The bottleneck
LLM first-token (Llama on Groq)	~100 ms	The fix
TTS first-audio	~150 ms	Cartesia is the fastest
Network (server → caller)	~40 ms

On a regular GPU, total = ~870ms. Caller experience: awkward pauses. On Groq: ~430ms. Caller experience: natural conversation. That's the entire product difference.

Why Not Claude or GPT-4o

Two reasons:

Latency: closed frontier models have first-token latencies of 400–1000ms. Fine for chat, too slow for voice.
Groq doesn't host them: Claude runs only on Anthropic/Bedrock/Vertex, GPT only on OpenAI/Azure. You can't put them on an LPU.

OpenAI's Realtime API (GPT-4o) is a credible alternative — it's designed for voice specifically. But you're locked into OpenAI and the pricing gets expensive fast. Groq + Llama is the open-weight path.

Layer Usage

Layer	Pick
L1 Silicon	Groq LPU — the reason this works
L3 Model	Llama 3.3 70B (LLM) · Whisper or Deepgram Nova-2 (STT) · Cartesia (TTS)
L4 Inference	Groq · Deepgram · Cartesia
L7 Orchestration	LiveKit Agents framework (voice-native) or Vapi (managed)
L9 Application	Phone via Twilio + dashboard for reviewing calls
L10 Governance	Call recording, transcription archive, Langfuse

Voice is its own worldVoice agents have different dominant concerns than text agents — interruption handling, barge-in, endpointing, silence detection. If you scope a voice project, allocate specifically for voice expertise. It is not "chatbot with a speaker."

Part 3 · How It Fits Together

Example · Multi-Agent Product Team (like Apex)

A team of specialised agents collaborating to ship software. Shows what a fully-occupied stack looks like end-to-end — and what you can responsibly leave out.

The Architecture

Nine agents with distinct roles: Project Manager, Productizer, VPS Admin, Backend Dev, Frontend Dev, Data Scientist, Security, HR, Marketing. Each has its own system prompt, toolset, and responsibilities. An orchestrator queues tasks, spawns short-lived worker containers, collects results, respects a capacity cap, escalates decisions to a human via Telegram.

Architecture · Hub & Spoke (9 agents around an orchestrator)

Hub-and-spoke. Nine specialised agents on the perimeter, all funnelling through the central orchestrator. The orchestrator is the only piece that talks to Claude (red, public) and the only piece that writes to the data store (Postgres). Ahmed sits above as the human approver via Telegram; the dashboard below is read-only for monitoring.

·9 agents queue tasks at the orchestrator ·orchestrator spawns short-lived worker containers ·workers call Claude via Anthropic API ·results write to Postgres + pgvector ·approvals → Telegram for Ahmed · live state → Next.js dashboard

Layer-by-Layer

Layer	Pick	Rationale
L1 Silicon	Invisible	Not self-hosting inference.
L2 Hosting	Single $20/month VPS	Single operator. Small scale. Cheap.
L3 Model	Claude Opus + Sonnet	Best at coding + long-context reasoning. Consistent.
L4 Inference	Anthropic (via Claude Max subscription)	Session-based billing dodges per-token surprise.
L5 Data	PostgreSQL + pgvector + Redis	Single-user scale — pgvector is enough. No second DB.
L6 DS Platform	None	This is a software-shipping team, not a data-science team.
L7 Orchestration	FastAPI + Claude Agent SDK + Claude Code CLI	LangChain/CrewAI would add mass without buying much.
L8 Protocol	Direct API calls now; MCP later	MCP is industry-wide direction for tool integration.
L9 Application	Next.js dashboard + Telegram bot + daily email	Three channels matching three contexts (desk, mobile, morning).
L10 Governance	Built-in audit log, capacity governor, weekly team-health review	Designed in from day one, not bolted on.

What This Stack Deliberately Omits

No Qdrant / Pinecone

At this scale, pgvector is sufficient. Adding a second stateful service would be cost without benefit. Clear migration path if it ever does scale.

No Dataiku / Databricks

Not a data platform. Ships software, not analytics.

No Groq / fast-inference specialist

Claude is fast enough for planning + coding. No voice use case.

No LangChain / CrewAI / agent framework

Claude Agent SDK is closer to the metal. Easier to debug.

No n8n

This is the orchestrator. Two would be two sources of truth.

No Langfuse / external observability

In-house audit_log + token_usage + security_log are purpose-built. Could add Langfuse later for richer analytics.

The lessonA well-scoped AI product occupies the layers it needs and consciously leaves others empty. The discipline isn't "use everything." It's "pick the minimum that solves your problem, and know what you'll add when scale demands."

Part 3 · How It Fits Together

Example · Customer Care AI

Inbound chat / WhatsApp / email. The agent reads the customer's history, answers from a curated knowledge base, takes actions (refund, reschedule, escalate), and hands off cleanly to a human when it's out of depth. Built on the NMO stack: Dataiku · Teradata · LangChain · Groq.

Why Groq is in this stackCustomer-care replies need to feel instant. Groq runs Llama 3.3 70B at hundreds of tokens / second — 3–5× faster than the same model anywhere else — so the first token of the reply paints in under 200 ms instead of 600 ms. The trade: the prompt leaves your network, so Presidio strips PII before it ever reaches Groq, and tenants who refuse public APIs get auto-routed to local Llama on Ollama (slower but never leaves). For any non-realtime path (the post-call wrap-up, batch summarisation), use Vertex AI in your GCP project instead — cheaper and in-cloud.

Architecture · Live Data Flow

Animated dots show the real direction of bytes. Coloured by lane: blue = customer data · green = local response · red = leaves your network · purple = model reply · orange = action.

1customer message arrives 2Presidio strips PII before anything else touches it 3agent fetches customer_360 from Dataiku 4Dataiku reads from Teradata 5agent searches the KB in pgvector 6redacted prompt sent to Groq (crosses zone) 7Llama 3.3 70B reply returns 8agent calls action tool if policy allows 9final answer sent back to customer

Why This Shape

Teradata holds the customer history

Every prior order, ticket, payment, and product registration. Twenty years for some customers. You do not want the agent guessing — it answers from the truth.

Dataiku exposes it as a clean semantic layer

"customer_360" view: name, tier, lifetime value, open tickets, last interaction, churn risk. The agent gets one tidy object via an MCP tool — never sees raw Teradata tables, never hallucinates column names, inherits the existing RBAC.

LangChain orchestrates the loop

Classify intent → fetch context → check policy → draft → validate → respond → log. LangGraph models the state machine explicitly so you can replay any conversation deterministically. Open source, runs in your VPC.

Groq runs the language model — fast and cheap

Llama 3.3 70B at hundreds of tokens per second on Groq's LPU. Customer-care replies need to feel instant; Groq is 3–5× faster than the same model anywhere else. Trade-off: prompts go to Groq Cloud, so PII is redacted before they leave (Hotspot 1 from the Live Ecosystem Map).

A confidence gate before any action

Refunds, account changes, and escalations only execute if (a) the agent's self-rated confidence is ≥ 0.85, (b) the action is on the per-tier allow-list, and (c) it's within the per-customer rate limit. Else: hand to human with the draft attached.

Layer Usage

Layer	Pick	Zone
L2 Hosting	Your GCP project (GKE / Cloud Run) or on-prem K8s	Local
L3 Model	Llama 3.3 70B (replies) · Llama 3 8B (intent classify)	OS
L4 Inference	Groq Cloud for speed · Ollama on-prem for fallback / regulated tenants	Public / Local
L5 Data	Teradata (history) · Postgres+pgvector (KB)	Local
L6 DS Platform	Dataiku — customer_360 semantic + churn-risk feature	Local
L7 Orchestration	LangChain + LangGraph (state machine + tool calling)	Local
L8 Protocol	MCP servers: `customer_history`, `order_actions`, `kb_search`, `escalate`	Local
L9 Application	Webchat widget · WhatsApp Business API · email gateway · Zendesk for human handoff	Mixed
L10 Governance	Presidio (PII redaction) · Langfuse (traces) · Llama Guard (output filter) · in-house audit_log	Local

Where Open Source Earns Its Keep

Llama 3.3 70B (Meta, OS). You can run it on Groq Cloud today and switch to your own GPU box tomorrow without rewriting a line. No model lock-in.
LangChain / LangGraph (OS). The orchestration is your code. You can audit it, fork it, or swap it if a vendor framework would be faster. Vendor lock-in for orchestration is the most expensive lock-in to escape later.
Presidio (Microsoft, OS). PII detection runs on-prem before any prompt leaves. If you used a SaaS redactor, you'd be sending raw PII to that SaaS first — defeats the point.
pgvector (Postgres extension, OS). Your KB embeddings live in your existing Postgres. No third vector DB to license / monitor / back up.
Llama Guard (Meta, OS). Runs locally. Catches policy violations (hate speech, PII leak in output) without sending the response to a moderation API.

The Risk Map for This Use Case

Customer history

Safe

Teradata · in your data centre

Never leaves. Agent reads via Dataiku semantic layer, gets a structured object, never raw rows.

Knowledge base

Safe

pgvector · Postgres on your VPC

Articles, FAQs, policy docs. Embedded once (on-prem embedding model), retrieved every turn.

Conversation transcript

Watch

Logged to Teradata · trace to Langfuse (self-hosted)

Full transcripts are sensitive. Self-host Langfuse — never send traces to a SaaS observability tool unless contractually allowed.

Mitigation: retention 90 days · access role-restricted · scrubbed of names/IDs in trace UI.

The prompt to Groq

Risk

Public API · crosses the internet

Whatever the agent puts in the prompt is read by Groq's infrastructure. Groq's contract says no training on inputs and short retention — but you must still not put raw PII in.

Mitigation: Presidio strips name / phone / email / national ID before send · customer_360 object passes only tier & tags, never the customer's identity.

WhatsApp Business API

Risk

Meta-hosted · subject to Meta's terms

Customers expect WhatsApp; Meta sees every message. There is no on-prem equivalent.

Mitigation: sensitive transactions (refunds, account changes) move to a magic link in your own portal — not completed in WhatsApp.

When to use this patternVolume support (chat / WhatsApp / email), policy-bounded actions (refund up to $X, reschedule, account info), and a healthy KB. Replaces the bottom 60–70% of L1 / L2 ticket volume; everything else routes to a human with the draft + history attached. Typical ROI: 40–60% reduction in handling time for resolved-by-AI tickets.

Part 3 · How It Fits Together

Example · Call Center · High-Priority Clients

A live voice line for VIP / private-banking / enterprise tier. The AI never speaks to the customer — it sits behind the human agent, listening, retrieving, and surfacing the next best action in real time. Same NMO stack with a different shape.

Design principle for VIPFor high-priority clients, the AI is a co-pilot, not a pilot. The human is on the line and stays on the line. The AI's job is to make that human dramatically smarter and faster — not to replace them. Replacement is for L1 volume; co-pilot is for VIP and complex.

Why Groq is in this stackThe live loop fires every ~4 seconds while the call is happening; the suggestion has to land on the agent's screen before the customer's next sentence. Groq + Llama 3 8B hits sub-700ms round-trip on a 4-second sliding transcript window — anything slower than that and the human has already moved on. Vertex AI / Anthropic API can't replace Groq here because they target 400–1000ms first-token latency (fine for chat, too slow for live). The post-call wrap-up does use Vertex AI (no latency pressure → stays in your GCP project) — see Diagram 2.

Architecture · Diagram 1 · The Live Loop (during the call)

The live loop runs continuously while the call is happening. Audio enters on the phone (1), Whisper transcribes locally (2), a 4-second rolling window is taken every loop tick (3), customer context is loaded from Dataiku/Teradata (4), the redacted snippet plus context goes to Groq (5 — the only path that leaves your network), the suggestion comes back (6), and the human agent sees it on their side panel (7).

1VIP call hits the on-prem PBX 2Whisper transcribes locally 3rolling 4s window 4customer_360 context loaded 5redacted snippet → Groq (crosses zone) 6next-best-action returns 7shown on the agent's screen

Architecture · Diagram 2 · Post-Call Wrap-Up (after hangup)

or Gemini 2.5 Pro single-tenant · region-pinned writes the wrap-up stays in your GCP project 1 2

This fires once, after the call hangs up. The full transcript (already PII-redacted) goes to Vertex AI in your own GCP project (1) — Claude Opus or Gemini 2.5 Pro writes a polished wrap-up summary that lands back in Teradata (2). Region-pinned (use me-central2 for KSA / Dammam) so the data never leaves your jurisdiction.

1full transcript sent to Vertex AI in your GCP project 2polished summary saved into Teradata

Why The Shape Differs From Customer Care

Latency budget is brutal

A suggestion that lands 6 seconds after the customer's question is useless — the human already moved on. Groq's LPU + Llama 3.3 8B for the suggestion loop hits sub-700ms on a 4-second sliding transcript window. This is exactly why Groq is in the stack.

Speech recognition stays on-prem

VIP calls contain account numbers, PINs, deal terms. Use Whisper-large-v3 on a local GPU; never stream audio to a cloud STT. The transcript that goes to the LLM is already partially redacted by a regex filter (account numbers masked).

Dataiku's churn-risk & tier scores drive escalation

If the live sentiment turns negative and the customer is in the top decile of LTV, the AI silently pages the team lead — no waiting for the customer to ask.

Two model tiers

Llama 3 8B on Groq for the every-4-second loop (cheap, fast). Claude Opus or Gemini 2.5 Pro on Vertex AI in your own GCP project for the post-call wrap-up — better at writing summaries the relationship manager actually trusts. The mix matters.

Layer Usage

Layer	Pick	Zone
L2 Hosting	On-prem GPU box for Whisper · your GCP project (GKE) for the agent loop · Vertex AI region-pinned	Local / Priv
L3 Model	Whisper-large-v3 (ASR) · Llama 3 8B (live loop) · Claude Opus or Gemini 2.5 Pro (wrap-up)	OS + Frontier
L4 Inference	Local GPU (Whisper) · Groq (live Llama loop) · GCP Vertex AI (Claude / Gemini wrap-up)	Local + Public + Priv
L5 Data	Teradata (relationship history) · Postgres (live call state)	Local
L6 DS Platform	Dataiku — customer_360, churn_risk, lifetime_value features	Local
L7 Orchestration	LangGraph (4-second loop) + custom websocket runner for the agent screen	Local
L8 Protocol	MCP servers: `customer_360`, `open_tickets`, `page_supervisor`, `book_followup`	Local
L9 Application	Genesys / your existing telephony · agent-desktop side-panel (Vue) · supervisor pager (Telegram or Slack)	Mixed
L10 Governance	Hard rule: AI never speaks · transcripts retained per regulator · Langfuse self-hosted · per-rep accuracy dashboard	Local

The Risk Map for This Use Case

Live audio

Safe

Whisper on-prem GPU

Audio bytes never leave the building. Streaming STT is the seductive shortcut — don't take it.

Customer record

Safe

Teradata via Dataiku

Same as Customer Care AI — single semantic layer, same RBAC, never raw rows.

Redacted transcript chunks

Watch

Sent to Groq for the live loop

Every 4 seconds the rolling transcript window is sent to Groq Cloud for intent + next-best-action. Account numbers, IDs, and amounts are masked locally first.

Mitigation: regex + Presidio in-line · contract: zero retention · for top-tier accounts, swap to local Llama 3 8B at higher latency cost.

Wrap-up summary

Watch

Sent to Vertex AI in your GCP project

After the call ends, the redacted full transcript goes to Claude or Gemini on Vertex AI for a polished summary written into Teradata.

Mitigation: stays in your GCP project · region-pinned (me-central2 for KSA / Dammam) · no PII in summary by template.

Telephony provider

Risk

Genesys / Twilio / on-prem PBX

Whoever runs the phone line sees the call. For very-high-trust clients, on-prem PBX is the only acceptable answer.

Mitigation: contractual · or fully on-prem PBX (FreePBX / Asterisk) for KSA private-banking tier.

Why this earns the consulting feeReplacing the human is the wrong story for VIP. The right story is "your best relationship manager, every call, with twenty years of context surfaced in 600 milliseconds." That's the pitch — and it's true with this stack.

Part 4 · Practical

Anatomy of an AI Agent

Every agent, however sophisticated, has the same handful of parts. Here's the checklist. If a design is missing a part, it's probably a chatbot with ambition.

The Seven Parts at a Glance

How the seven parts fit together

Inputs (1–3) feed the Loop (5). The Loop reads/writes Memory (4). Evaluators (6) and Observability (7) watch every call.

The Seven Parts

1 · System prompt — the agent's identity

A long-form instruction defining the agent's role, scope, tone, constraints, escalation triggers, and output format. Usually 500–5,000 words. Rewritten many times during development.

"You are a senior customer support agent. You answer only from the provided context. If unsure, escalate to a human. Never promise refunds — offer to file a request."

2 · Context window — what the agent "sees"

Composed of: system prompt + conversation history + relevant retrieved documents (RAG) + tool descriptions + user's current message. Fits inside the model's context window budget.

3 · Tools — what the agent can do

Functions the agent can call. Examples: search_database(query), create_ticket(data), send_email(to, subject, body). Each tool has a name, a description, a parameter schema, and a handler that executes it.

Key insight: the agent's capability is defined by its tools. A smart LLM with no tools is just a chatbot. A modest LLM with the right tools can run a business.

4 · Memory — what the agent remembers

Two scopes:

Short-term: the current conversation. Lives in the context window.
Long-term: persistent across sessions. Usually vectors in Qdrant/pgvector, retrieved by relevance on each new conversation.

5 · Loop — the agent's autonomy

Agents don't just answer once. They loop: pick tool → call it → observe result → decide next action → repeat until done. The loop is where agent logic gets complex.

Guardrails: max iterations, timeout, budget cap. Without these, a buggy agent loops forever burning tokens.

6 · Evaluators — how you measure quality

Tests that run against the agent. Can be: exact-match against gold answers, LLM-as-judge scoring (another LLM rates the output), human review, business metrics (tickets resolved per hour). In production, usually all of these.

7 · Observability — how you see what happened

Every inference call logged, every tool call recorded, every decision traced. Without this, you cannot debug. With it (Langfuse, LangSmith, Helicone), you can replay any conversation and see exactly what the model saw and why it chose what it chose.

Common Agent Designs

Pattern	Shape	Best for
ReAct loop	Think → Act → Observe → Think	General tool-using agents.
Plan-then-Execute	Planner writes full plan; executor runs steps	Complex multi-step tasks.
Reflexion / self-critique	Agent reviews its own output before final answer	Quality-sensitive generation.
Multi-agent crew	Specialised agents collaborate	Complex domains with clear sub-roles (like Apex).
Graph / state machine	Explicit nodes and edges	Flows where you need deterministic control.

Part 4 · Practical

Decision Cheat Sheet

When a client asks, when a teammate asks, when you're half-awake — here's the quick answer.

Model Selection

Situation	First-instinct pick
Best-in-class coding	Claude Sonnet / Opus · Qwen 2.5 Coder (open)
Cheapest frontier-quality reasoning	DeepSeek V3/R1 · Gemini 2.5 Flash
1M+ tokens of context	Gemini 2.5 Pro
Run on your own hardware	Llama 3.3 70B · Qwen 2.5 (run via Ollama, vLLM, or llama.cpp)
Voice agent speed	Llama 3.3 70B on Groq
Arabic / multilingual	Qwen 2.5 · Gemini · Claude
Strong vision	Claude Sonnet · GPT-4o · Gemini 2.5
Cheap high-volume classification	Claude Haiku · Gemini Flash · GPT-4o-mini

Infrastructure Selection

Situation	First-instinct pick
Self-host an LLM on a small VPS	Ollama + Llama 3.x 8B or Qwen 2.5 Coder 7B
Real-time voice (under 300ms first-token)	Groq + Llama 3.3 70B · OpenAI Realtime as paid alt
Live agent assist / call-centre co-pilot	Groq + Llama 3 8B (sub-700ms round-trip)
High-throughput batch on open weights	Groq for speed · Together / Vertex AI for cost
Frontier reasoning (Claude / Gemini quality)	GCP Vertex AI — not Groq, Groq doesn't host them
Sensitive prompts you can't redact	Vertex AI in your GCP project or local Llama — not Groq (it's a public API)
Unified gateway across providers	LiteLLM (open) · Portkey (managed)
Rent an H100 for a day	RunPod · Lambda Labs · Modal
Enterprise with GCP commitment (your default)	GCP Vertex AI · Claude or Gemini
Need 1M+ token context	Gemini 2.5 Pro on Vertex AI
KSA data residency required	Vertex AI `me-central2` (Dammam) OR on-prem Llama

Data Layer Selection

Situation	First-instinct pick
RAG on ≤10M vectors	pgvector (reuse existing Postgres)
RAG on 10M–1B vectors	Qdrant (self-host) · Pinecone (managed)
Modern analytical warehouse	Snowflake · BigQuery (GCP) · Databricks
Existing Teradata estate	Work with it, don't migrate
Event / product analytics	ClickHouse
Graph relationships matter	Neo4j
Hybrid (keyword + vector) search	Elasticsearch + pgvector/Qdrant + Cohere Rerank

Framework / Orchestration Selection

Situation	First-instinct pick
Visual automation, LLM is one step	n8n (self-host) · Make · Zapier
Custom agent, Claude-centric	Claude Agent SDK
Complex multi-step reasoning	LangGraph · Pydantic AI
Multi-agent collaboration	CrewAI · AutoGen · custom
RAG-heavy agent	LlamaIndex
Durable agent workflows (hours/days)	Temporal
Multi-step coding tasks	Claude Code · Cursor · Windsurf

Governance / Safety Selection

Situation	First-instinct pick
Production observability, open-source	Langfuse
Production observability, managed	Helicone · LangSmith
Prompt evaluation in CI	Promptfoo
Block prompt injection	Llama Guard 4 (self-host) · Lakera Guard (SaaS)
Redact PII before LLM	Presidio (open) · Private AI (managed)
RAG-specific evaluation	Ragas

Part 4 · Practical

Common Pitfalls

Mistakes that will cost you a client or a month. Pattern-match these early.

Architectural Pitfalls

Using one model for everything

Haiku for classification + Opus for reasoning is a 10× cost cut. Teams that standardise on "the smart model for all calls" are burning money.

Skipping the vector DB and stuffing everything into context

Works with 50 docs, breaks at 5,000. RAG from day one even if it feels like overkill.

No observability

First production bug, you have no idea what the model saw. Langfuse on day one, not day 100.

Adopting every new framework

You'll spend all your time migrating and none shipping. Pick one, stick with it for 6 months, re-evaluate.

Agent when automation would do

n8n solves 70% of "AI workflow" needs. Building a LangGraph agent for predictable pipelines is over-engineering.

Automation when agent would do

If the task requires judgement, n8n's fixed decision tree becomes 200 boxes and unmaintainable.

Client & Stakeholder Pitfalls

Promising to replace their data warehouse

Never. You build on top of Teradata/Snowflake/whatever they have. Migrations are not AI projects.

Underestimating data prep

70% of an AI project is data cleanup + chunking + indexing. Budget accordingly. Tell the client this upfront.

Demo-grade vs production-grade

Demo takes a week. Production takes months. Clients see a demo and expect production next Tuesday. Set expectations explicitly.

No human-in-the-loop for sensitive actions

Sending emails, making payments, deleting records — always approval-gated. First unapproved action that goes wrong is the end of the client relationship.

PDPL / data residency blind spot

KSA client + OpenAI API + personal data = compliance violation. Ask about this on day one. Vertex AI me-central2 (Dammam, inside KSA) or on-prem Llama are the compliant paths.

Technical Pitfalls

Hardcoding prompts in application code

Every prompt tweak requires a deploy. Use a prompt management system (Langfuse prompts, Braintrust, or a simple JSON file with versioning).

No eval suite

You changed the prompt. Did it improve things or regress them? Without a Promptfoo suite, you're guessing.

Letting agents loop forever

Max iterations, timeout, and budget cap per agent run. Always.

Trusting the model's output

LLMs hallucinate. Every structured output (SQL, JSON, code) needs validation. Every factual claim needs a citation back to the source.

Secrets in prompts

API keys, customer data, connection strings — never in a prompt. Prompts get logged, shared, debugged. Secrets leak.

The meta-pitfallOver-estimating what agents can do today and under-estimating what they'll do in 12 months. Build for today's reliability (narrow scope, good evals, human oversight). Architect for tomorrow's capabilities (clean tool boundaries, swappable models, MCP everywhere).

Part 4 · Practical

Where Your Data Lives — The Trust Map

A drill-down on every place customer data can leave your network in an AI architecture, and what to put in front of it. The map you draw on a whiteboard at the start of every client engagement.

The Three Questions, Per Component

For every box you draw in the architecture, answer these three questions out loud:

Where does the data physically sit when this component is using it? (your DC · your VPC · vendor's cloud)
Who can read it there? (your team · your cloud provider · the vendor's employees · the public internet)
What contract or law constrains them? (DPA · SLA · GDPR/PDPL/HIPAA · "trust me bro")

If the answers feel hand-wavy, you have a risk. If you can't answer at all, you have a problem.

The Component-by-Component Map

Teradata / on-prem warehouse

Safe

Your data centre

The system of record. By design, never leaves. The agent must talk to it through a semantic layer — never with raw SQL the model wrote.

Dataiku on-prem

Safe

Your data centre

Semantic layer + RBAC + feature pipelines. If hosted on-prem, all data stays. If hosted on Dataiku Cloud, it's single-tenant SaaS — read the contract.

LangChain / LangGraph code

Safe

In your application container

It's an OS library running in your process. No network calls of its own — every external call is something you wrote.

pgvector / Postgres

Safe

Your VPC / on-prem

RAG embeddings + app state. Embed locally with a sentence-transformers model on your GPU.

Ollama / Llama on your GPU

Safe

Your GPU box

Local inference. Slower and dumber than frontier models, but the prompts never leave. The fallback for any tenant who can't tolerate cloud LLMs.

Gemini / Claude via GCP Vertex AI

Watch

Your GCP project · region-pinned

Your single-tenant inference home. Anthropic / Google don't see the prompts; Vertex AI runs them on your behalf inside your GCP project. Gemini 2.5 Pro's 2M-token context handles whole codebases / long contracts; Claude 4 is available too via the Vertex Model Garden.

Mitigation: region-pinned (me-central2 Dammam for KSA · me-central1 Doha for Qatar · europe-west1 Belgium for EU) · DPA in place · Cloud Audit Logs on · IAM scoped per service-account · model versions pinned via Model Garden · VPC Service Controls to forbid data egress.

Self-hosted Langfuse / observability

Watch

Your VPC

Full traces of every LLM call — prompts, responses, tools used. Goldmine if breached.

Mitigation: never use the SaaS version for sensitive workloads · scrub PII before tracing · short retention · access role-restricted.

Anthropic API direct

Risk

Anthropic's cloud · US

Fastest path to Claude. Prompts cross the public internet to Anthropic. They have strong contracts (no training on API inputs, short retention) but they do see your prompts.

Mitigation: use only for non-sensitive prompts · move sensitive workloads to Vertex AI for the same Claude model behind your VPC · always Presidio-redact in front.

Groq Cloud

Risk

Groq's cloud · US

Fastest open-model inference on earth (LPU). Same trade as Anthropic API — prompts leave your network. Worth it for latency-bound use cases.

Mitigation: Presidio in front · contract review (zero retention claim) · fall back to local Llama for top-tier-tenants.

OpenAI API direct

Risk

OpenAI's cloud · US (mostly)

Same shape as Anthropic. Region options thinner. Avoid for regulated workloads — go via Azure OpenAI inside your tenant instead.

Mitigation: use Azure OpenAI for anything regulated · strict prompt redaction · never send raw customer rows.

SaaS connectors (Salesforce, Slack, Gmail, WhatsApp)

Risk

Each vendor's cloud

When the agent calls "lookup customer" or "post to Slack," that data moves to the SaaS vendor. They already had it (you signed up), but the agent is now writing more of it more often.

Mitigation: route through MCP servers that log + rate-limit · per-tenant allow-lists · never call from the LLM's prompt directly — always through tool definitions.

SaaS observability (Datadog · Honeycomb · cloud Langfuse)

Risk

Vendor cloud

Convenient. Also: every LLM trace they ingest contains prompts, which contain everything you redacted-or-didn't.

Mitigation: default to self-hosted for AI traces · if SaaS, scrub aggressively at the SDK boundary, not later.

The pitch we makeMost AI architectures we audit have two more red boxes than the team realised — usually a SaaS observability tool and a third-party SDK that "just helps with prompts." First deliverable on every engagement: a trust map of every component in their current state, marked safe/watch/risk, with a 30/60/90-day plan to move red boxes to amber or green.

Part 4 · Practical

Open Source — When & Why

Open source isn't a religion. It's the right answer when you need to control trust, cost, or lock-in. Here's the decision rule and the layer-by-layer pick list.

The Decision Rule

Use open source when any of the following is true:

The data is regulated or sensitive. If you can't send it to a public API, the model has to run where you can run it — that means open weights.
The component is on the hot path of your business logic. Orchestration, agent loops, RAG retrieval — anything you'll want to fork, debug, and customise. Vendor lock-in here is the most expensive lock-in to undo.
Cost will explode at scale. Per-token pricing makes sense when usage is small. At a million queries a day, an open model on your hardware is 5-20× cheaper than a frontier API.
You need predictable behaviour. Open weights don't change overnight. A vendor "model improvement" can break your evals on a Tuesday.

Use proprietary / SaaS when:

You need the absolute best reasoning available, and the prompts are not sensitive. (Claude Opus, GPT-5-class.)
The component is undifferentiated infrastructure you would never build yourself. (CDN, email delivery, payment processing.)
You're prototyping and time-to-first-demo matters more than long-term cost.

Layer-by-Layer Picks

Layer	Open-source default	Proprietary when
L3 Models	Llama 3.3 70B · Llama 3 8B · Qwen 2.5 · DeepSeek	Claude / GPT for top-tier reasoning when prompts are not sensitive
L4 Inference	Ollama (local) · vLLM · TGI	Groq / GCP Vertex AI / Anthropic API for speed or scale you can't host
L5 Vector DB	pgvector · Qdrant · Weaviate	Pinecone if you want zero ops and aren't worried about lock-in
L6 ML Platform	MLflow · Metaflow · Kubeflow	Dataiku when you need a visual semantic layer + RBAC for non-coders
L7 Orchestration	LangChain / LangGraph · n8n · CrewAI · Temporal	Vendor agent platforms only when you accept the lock-in
L8 Protocols	MCP · OpenAPI	(no proprietary alternative — open is the standard)
L10 Governance	Langfuse self-hosted · Promptfoo · Llama Guard · Presidio	SaaS observability only for non-sensitive workloads

The Mix We Default To

Open source

The structural layer

Orchestration · LangChain / LangGraph — this is your code, never lock it in
Vector store · pgvector — already in your Postgres
Local model · Llama 3.3 70B on Ollama — the regulated-data fallback
Speech-to-text · Whisper local — never stream audio to a cloud STT
PII redaction · Presidio — must run before prompts leave
Output safety · Llama Guard — local moderation
Tracing · Langfuse self-hosted — full prompt visibility, kept private
Eval · Promptfoo — your evals, your test data, in your repo

Proprietary

The intelligence layer (selectively)

Frontier reasoning · Claude Opus or GPT-5-class — only for non-sensitive prompts, only where it earns its cost
Fast inference · Groq Cloud — for latency-bound loops, with redaction in front
Enterprise data · Teradata / Dataiku — already paid for, don't re-platform
Single-tenant frontier · GCP Vertex AI — Claude or Gemini inside your own GCP project, region-pinned (your default for any prompt with sensitive content)
Long-context reasoning · Gemini 2.5 Pro via Vertex AI — when you need 1M+ tokens (whole repos, long contracts, large case files)
Telephony · Genesys / Twilio — unless on-prem PBX is required
Channel APIs · WhatsApp Business / SMS gateway — unavoidable for the channel itself

The shape of the mixOS for everything where you'd be unhappy giving up control. Proprietary for the parts where someone is genuinely better at it than you'd ever be. The default tilt is OS — because the regret of vendor lock-in is bigger than the regret of paying for a server.

Part 5 · Build Your Own

Build Your Own AI Use Case

An 11-step template that turns "we want to do AI" into a defensible architecture, scoped, costed, with a trust map. Use this as the structure for every new use case discussion — internal or with a client.

Use this template as a worksheetOpen this page side-by-side with a fresh doc. For each step, write 2–4 lines for your specific use case. By the end you have an architecture diagram, a layer table, a trust map, a build estimate, and a pitch. Total time: ~90 minutes for a senior engineer with the client's pain understood.

Frame the user job — one sentence

If you can't say it in one sentence, you don't understand it yet.

Question

"Who is doing what task, and what would 'much better' look like for them?"

Customer Care AI: "An L1 support agent at our retail client handles 80 tickets a day; a well-scoped AI should resolve 50 of them end-to-end and prepare drafts for the other 30."

Decide: replace, co-pilot, or augment?

This single choice changes the architecture, the risk profile, and the price.

Replace — AI handles the whole task, human is exception path. Use for high-volume, low-stakes, well-defined work (L1 ticket triage, simple RAG Q&A).
Co-pilot — AI sits next to the human in real time. Use for high-stakes, high-judgement work (VIP call centre, financial analysis, code review).
Augment — AI runs offline, prepares work for humans to consume. Use for batch tasks (lead enrichment, document summarisation, daily briefings).

Map the data sources — every box, every owner

List every system the agent will read from or write to. For each: where it physically sits, who owns the schema, what its access pattern is.

Use the Data layer (L5) and DS Platform layer (L6) as your starting checklist. Don't propose any data migration — work with what exists. If Teradata holds the truth, the agent talks to Teradata (via Dataiku).

Pick the model — frontier vs open, big vs small

Pick by task, not by hype. Match the model's cost/latency profile to the request rate. Then pick the inference home (Anthropic API / OpenAI API / GCP Vertex AI / Groq / Ollama) based on the trust zone you need.

Reasoning steps (planning, complex Q&A, code) → frontier (Claude Opus / GPT-class / Gemini 2.5 Pro) or top-tier open (Llama 3.3 70B).
Long-context tasks (whole codebases, long PDFs, multi-doc analysis) → Gemini 2.5 Pro on Vertex AI for 1M+ tokens.
Classification, extraction, simple Q&A → small open model (Llama 3 8B, Qwen 2.5 7B) or Claude Haiku / Gemini Flash. Cheaper and often faster.
Latency-bound loops (live voice, autocomplete, live agent assist) → small open model on Groq · sub-300ms first-token. Without Groq, voice and live-assist don't work.
Sensitive prompts → must be open weights hosted by you, or Claude/Gemini via Vertex AI in your own GCP project (single-tenant, region-pinned). Not Groq — it's a public API.

Pick the inference home — three trust zones, one rule per zone

For every model call in your design, write its inference home next to it. The home determines the trust zone and the price.

Local on-prem (green zone) → Ollama / vLLM on your GPU · use for any prompt you can't let leave the building. Open weights only. Slowest, but most compliant.
Your GCP project (amber zone) → GCP Vertex AI · Claude or Gemini, region-pinned (use me-central2 for KSA / Dammam). Use for sensitive prompts that need frontier quality, or any non-time-bound workload. Default for the user's stack.
Public API (red zone) → Groq for latency-bound open-weight, Anthropic API for fastest path to Claude when prompt is non-sensitive, OpenAI API for GPT specifically. Always Presidio-redact before any call.

Pick the orchestration shape

There are exactly four shapes. Pick deliberately.

Single-shot LLM call — RAG chatbot, summariser. No agent. Don't over-engineer.
Workflow (n8n) — fixed steps, mostly SaaS API calls, one or two LLM nodes. Sales-outreach archetype.
Agent (LangGraph) — model decides which tool to call next, loops until done. Customer Care, Analytics archetype.
Multi-agent — multiple specialised agents handing off via tasks/queue. Apex archetype. Don't reach for it unless the work genuinely splits across roles.

Define the tools (MCP servers, one per integration)

Every external system the agent touches is an MCP tool with an explicit input/output schema and a logged invocation.

Naming convention: {system}_{verb} — e.g. customer_history_get, order_refund, kb_search, ticket_escalate. Each tool has a one-line description that the model sees, an explicit JSON schema, and a unit test that the orchestrator runs at startup.

Draw the trust map · mark every red box

Use the "Where Your Data Lives" trust map as a template. Every component you wrote down in steps 3–6 gets a green / amber / red badge.

If you have a red box, write the mitigation next to it (Presidio in front · contract terms · fallback to local model · etc.). If a red box has no mitigation, the architecture is not done.

Define the guardrails & the escalation gate

No AI ships without an explicit "when does the human take over?"

Confidence threshold — below X, hand to human with the draft attached.
Action allow-list per tier — refunds up to $X without approval, anything above goes to a human.
Output filters — Llama Guard for unsafe content, custom regex for never-say-this-to-a-customer phrases, Presidio for PII in outputs.
Per-conversation rate limits — kill switch if the agent loops on the same customer five times in an hour.

Build the eval set before writing the code

Twenty real examples with the ground-truth answer. Promptfoo runs them on every change. No regression-eval, no merge.

Source examples from real tickets / calls / emails (with PII scrubbed). Cover the happy path, the obvious failure modes, the legally-sensitive cases, and the cases where the right answer is "I don't know — escalating."

Estimate the cost & the build size · then commit

Two numbers: per-query cost at scale, and engineer-weeks to build.

Per-query cost = (input tokens × input price) + (output tokens × output price), summed across every model call in a typical conversation. Multiply by expected daily volume × 30. If it's offensive, switch a step to a smaller model and recompute.
Build size = MCP tools (1 day each) + orchestration loop (1 week) + frontend / channel integration (1–3 weeks) + evals (3 days) + observability (3 days) + hardening / load-test (1 week). Typical first MVP: 4–8 engineer-weeks.
Then commit. Write a one-page architecture sketch with the layer table, the trust map, the eval plan, the cost-per-query, and the timeline. That document is the basis of the SOW.

Worked Example — Filling The Template For "Customer Care AI"

1. User job

L1 support agent at retail client; resolve 50/80 tickets end-to-end, draft the rest.

2. Mode

Replace for L1 volume; co-pilot for the residual.

3. Data

Teradata (orders, history) via Dataiku semantic layer; pgvector (KB articles).

4. Model

Llama 3.3 70B for replies (live, latency-bound); Llama 3 8B for intent classify; Claude/Gemini for the post-conversation wrap-up.

5. Inference home

Groq for the live reply (sub-200ms first-token, the customer-experience differentiator) · Vertex AI in your GCP project for the wrap-up · local Llama on Ollama as the fallback for tenants who refuse public APIs.

6. Shape

LangGraph agent with deterministic state machine (classify → fetch → policy → draft → validate → respond).

7. Tools

customer_history_get, kb_search, order_refund, order_reschedule, ticket_escalate.

8. Trust map

Green: Teradata, Dataiku, pgvector, LangChain, Llama Guard. Amber: Vertex AI in your GCP project. Red: Groq prompt (Presidio in front), WhatsApp (sensitive flows move to portal).

9. Guardrails

Confidence ≥ 0.85 to act; refunds ≤ $200; rate limit 3 actions / customer / hour; Llama Guard + custom never-say list.

10. Evals

25 historical tickets with verified resolutions; nightly Promptfoo run; PR-blocking on regression > 5%.

11. Cost & build

~$0.012 / resolved ticket on Groq; ~6 engineer-weeks for MVP.

From this template to a client meetingThe 11 boxes above are exactly the slides you walk through in a discovery meeting. By the time you finish slide 11, the client knows what you're proposing, why each piece is there, where their data goes, and what it costs. No hand-waving. That's the difference between consulting and reselling.

Part 4 · Practical

Glossary

One-line definitions. Send this page to anyone who needs a quick lookup.

A – F

Agent

An LLM that decides its own next action, calls tools, loops until done.

Bedrock (AWS)

Multi-model inference gateway from AWS — Claude, Llama, Titan, Cohere, Mistral all behind one API.

Chunking

Splitting documents into small pieces for embedding. Chunk size massively affects RAG quality.

Context window

Max tokens a model can see in a single call. Bigger = more material; slower and more expensive.

CUDA

NVIDIA's programming toolkit. The reason NVIDIA has a software moat on GPUs.

Dataiku

Visual + code data science platform. "Tableau for ML." Common in enterprises.

Embedding

Numeric representation of meaning. A ~1500-number vector per piece of text.

Fine-tuning

Adjusting a model's weights with training examples. Harder, slower, rarely the right answer.

Function calling

Model returns JSON saying "call this function." Your code executes and returns the result.

G – L

Gemini

Google's frontier model family. Strong at long context and multimodal.

Groq

Inference provider using custom LPU chips. Dramatically faster tokens/sec on open-weight models.

GPU

Graphics Processing Unit. The default silicon for AI. NVIDIA dominates.

Guardrails

Filters on LLM input/output to block unsafe or off-topic behaviour.

Hallucination

Model producing confident-sounding false output. Mitigated via RAG, citations, validation.

Inference

Running a trained model to produce output. What every API call is.

Lakehouse

Hybrid of data lake + warehouse. Databricks' term.

LangChain

Original general-purpose agent framework. Many competitors now.

Langfuse

Open-source observability platform for LLM apps.

LLM

Large Language Model. The neural network at the heart of it all.

LiteLLM

Open-source proxy unifying 100+ LLM providers behind one OpenAI-compatible API.

LPU

Language Processing Unit. Groq's chip.

M – R

MCP

Model Context Protocol. Anthropic's open standard for agent-tool integration. Becoming the default.

Multi-agent

Multiple specialised agents collaborating, each with its own prompt and tools.

n8n

Self-hostable visual workflow automation tool. Great for LLM-in-a-pipeline use cases.

OLAP

Online Analytical Processing. Warehouse queries across lots of history.

OLTP

Online Transaction Processing. Your app's live database.

Ollama

Simplest way to self-host an LLM locally. ollama run llama3.3.

Open-weight

Model whose weights are published. You can download and run yourself.

pgvector

Postgres extension adding vector columns and similarity search. Use instead of a second DB when possible.

Prompt engineering

The craft of writing instructions that get the model to do what you want.

Qdrant

Open-source vector DB. Rust-based, fast. The go-to for non-trivial RAG.

RAG

Retrieval-Augmented Generation. Fetch relevant docs, inject into context, generate answer.

ReAct

Agent pattern: Reason + Act + Observe in a loop.

Reranker

Second-stage model that reorders retrieval results. Boosts RAG quality significantly.

S – Z

Semantic search

Search by meaning via embeddings. "Refund policy" matches "return guidelines."

Snowflake

Cloud-native data warehouse. Modern enterprise default.

System prompt

The long instruction defining an agent's identity and behaviour.

Teradata

40-year-old enterprise data warehouse. Dominant in big banks, telcos, airlines.

Temporal

Durable workflow engine. For agent workflows that survive restarts.

Token

The unit of work. ~3/4 of an English word. Everything is priced per token.

Tool use

Agent calling functions (search, email, create ticket). The capability multiplier.

TPU

Tensor Processing Unit. Google's custom AI chip.

Vector

Another word for embedding. A list of numbers representing meaning.

Vector DB

Database optimised for similarity search over embeddings.

vLLM

Production-grade inference server for open-weight models.

Warehouse

Analytical database for historical data. Snowflake, Teradata, BigQuery.

Whisper

OpenAI's speech-to-text model. Free to self-host, best-in-class quality.

Contribute backFound a term missing? Added a tool to our stack that should be here? Send a PR to the primer repo. Keeping this glossary current is everyone's job — not just whoever originally wrote it.