AI Field Guide 2026 — The Concepts Behind the Tools

01 Fundamentals

Context Windows

The single most important number in AI — bigger changes everything, and most people don't know what it actually means.

Every conversation you have with an AI model happens inside what's called a context window — think of it as the AI's working memory, or the "table" on which it can spread out all the material it's currently reading and thinking about. When a model has a 200,000 token context window, that's how much information it can actively hold in mind at once before it starts "forgetting" earlier parts of the conversation.

Here's the crucial thing: AI models don't have persistent memory like a human brain. Once a conversation ends — or once you exceed the context limit — the model has no access to anything that fell outside that window. It's as if every session starts with a fresh blank slate, and the context window is the only desk space available.

📄

The desk analogy. Imagine you're a researcher with a limited-size desk. You can pile papers on the desk and read across all of them simultaneously — noticing patterns, cross-referencing facts, catching contradictions. A small desk (4,000 tokens) means you can read a few pages at once. A large desk (1,000,000 tokens) means you can spread out fifty full research papers and read them all at the same time. The AI reasons across everything on the desk simultaneously — but anything not on the desk simply doesn't exist to it.

The number "tokens" isn't the same as words or characters. A token is roughly ¾ of an English word — so 1 million tokens is approximately 750,000 words, or about 1,500 pages of text. The leap from 8,000 tokens (where most models started) to 1,000,000 tokens (where frontier models are now) is genuinely transformative, not incremental.

Early GPT-3

4K

tokens ≈ 3 pages of text

GPT-4 Launch

8–32K

tokens ≈ a short book chapter

Claude 2023

100K

tokens ≈ a full novel

Claude 2026 ✦

1M

tokens ≈ 50 research papers at once

⚡ Why It Matters

Before large context windows, analysing a long contract meant splitting it into chunks, losing context between them. Reviewing 50 research papers meant summarising each separately, then trying to synthesise summaries. With a 1M token window, you feed everything in at once and ask holistic questions across the whole corpus. The AI finds the contradiction on page 312 that contradicts the claim on page 47 — because it has both in view simultaneously.

⚠ The Cost Catch

Larger context costs more. Models charge per token processed — both what you send in (input tokens) and what they send back (output tokens). Sending a 500-page PDF costs real money in API usage. For consumers using Claude.ai or ChatGPT Plus, this is handled transparently — but it's why developers building apps think carefully about what they actually need to include in each request.

02 Evaluation

AI Benchmarks
— and why to be sceptical of them

Model releases trumpet benchmark scores like sports statistics. Here's what the numbers actually measure — and why they're both useful and easily manipulated.

An AI benchmark is a standardised test — a set of questions, tasks, or challenges with known correct answers — used to objectively compare models. Instead of saying "Claude feels smarter than GPT," a benchmark lets you say "Claude scores 82% on SWE-Bench Verified vs GPT-5's 77%." They give the industry a shared measuring stick.

The problem is that benchmarks attract Goodhart's Law: once a measure becomes a target, it ceases to be a good measure. Companies train models specifically to perform well on popular benchmarks — which is why a model can score brilliantly on a test but disappoint you in real-world use.

"A model that gets 90% on a benchmark by memorising similar training examples tells you almost nothing about how it'll perform on your novel, messy, real-world problem."

— Common critique in the ML research community

The Major Benchmarks Explained

Benchmark	What It Tests	Why It Matters	2026 Leader
SWE-Bench Verified	Can the AI fix real GitHub issues in real codebases?	Most practical coding benchmark — actual bugs, not toy problems	Claude Sonnet 5 (82.1%)
ARC-AGI-2	Abstract pattern reasoning on novel visual puzzles humans find easy	Designed to be hard to memorise — tests genuine reasoning ability	Gemini 3.1 Pro (77.1%)
GPQA Diamond	Graduate-level questions in biology, chemistry, physics	Questions that stump PhD students — tests deep expert knowledge	Gemini 3.1 Pro (94.3%)
MMLU	57 subjects from law to medicine to history — multiple choice	Broad knowledge coverage benchmark; now largely saturated by top models	Most frontier models ≥90%
HumanEval	Coding problems: write a function from a description	Standard Python coding baseline; top models score ~95%+ (mostly saturated)	Largely obsolete in 2026
Elo Rating (Chatbot Arena)	Head-to-head battles: humans rate which response they prefer	Captures human preference — not just correctness but helpfulness and style	Claude Sonnet 4.6 (1,633)

📖 The "Elo" Concept

Elo is the chess rating system, adapted for AI. In "Chatbot Arena," real users are shown two anonymous AI responses and vote for the better one. Thousands of such votes produce an Elo score — a model that consistently wins matchups gets a higher rating. It's harder to game than written tests because it captures what humans actually prefer, including tone, helpfulness, and nuance. The tradeoff is subjectivity and potential cultural bias in who is doing the voting.

✅ What benchmarks do well

Enable apples-to-apples comparison between models
Track progress over time objectively
SWE-Bench uses real-world tasks (hard to fake)
Elo captures genuine human preference
Identify specific capability gaps (coding vs reasoning)

⚠ Where benchmarks mislead

Models can be trained on benchmark questions
High scores don't predict real-world usefulness
Easy to cherry-pick the benchmark you lead
Test static knowledge, not reasoning on novel problems
Saturated benchmarks (MMLU) no longer discriminate

⚡ The Practical Takeaway

When a company announces their model "beats GPT-5 on benchmarks," the most important question is: which benchmark, and was it in their training data? SWE-Bench and ARC-AGI-2 are currently the hardest to game because they use real-world tasks and novel patterns. MMLU scores above 90% no longer meaningfully distinguish between models. The best benchmark is always: does it work well on your actual task?

03 Economics

Tokens & Pricing

AI isn't billed by the word, the minute, or the question — it's billed by the token. Understanding this unlocks why some tools feel expensive and others feel free.

The word "token" comes up constantly in AI — in context window sizes, in pricing, in model descriptions. A token is the basic unit of text that a model processes. It's not a word and it's not a character — it's a fragment of text, roughly corresponding to 3–4 characters or about ¾ of an average English word.

Common English words are usually 1 token. Longer or unusual words get split into multiple tokens. Numbers, punctuation, and code have their own tokenisation patterns. The model "sees" text as a sequence of tokens, not a sequence of words.

  // How a sentence gets split into tokens (approximate): "The quick brown fox" →  [The] [quick] [brown] [fox] // 4 tokens "Anthropic" →  [Anthrop] [ic] // 2 tokens "AI" →  [AI] // 1 token "uncharacteristically" →  [un] [character] [istically] // 3 tokens // Rule of thumb: 1,000 tokens ≈ 750 English words ≈ 1.5 pages  

API pricing has two components: input tokens (everything you send to the model — your prompt, any documents, the conversation history) and output tokens (what the model writes back). Output is almost always priced higher than input because generating text requires more computation than reading it.

DeepSeek V3 (cheapest)

$0.027

per million input tokens
≈ 2,000 pages of text

Claude Sonnet 4.6

$3

per million input tokens
~111× more than DeepSeek

Claude Opus 4.6

$15

per million input tokens
Best capability, highest cost

💡 Real-World Cost Examples

Analysing a 50-page PDF: ~50K input tokens → $0.15 on Sonnet 4.6, $0.75 on Opus.

A 30-minute chat session: ~6K tokens in + 3K out → about $0.025 — less than a penny.

Processing 1,000 customer emails: ~200K tokens → $0.60 on Sonnet. Building this kind of pipeline used to require enterprise contracts. Now it's affordable on a personal credit card.

Consumer plans (Claude Pro, ChatGPT Plus): Flat fee ≈ $17–20/month covers typical usage without token-by-token billing.

⚠ The "Prompt Caching" Trick

Prompt caching is a newer technique where if you send the same large document or system prompt repeatedly across many requests, the model caches the processed version and charges dramatically less for subsequent uses. For developers building apps that constantly reference the same instructions or knowledge base, this can reduce costs by 80–90%.

"The cost of intelligence has dropped by roughly 40× in 18 months. What cost $150/month in API fees in mid-2024 now costs about $3."

— The DeepSeek effect, 2025–2026

04 Limitations

Hallucinations

AI models sometimes state false things with complete confidence. Understanding why this happens is essential before trusting AI output for anything important.

A hallucination is when an AI model generates text that is plausible-sounding but factually incorrect — and presents it without any indication that it might be wrong. The term is borrowed loosely from psychology, but the AI version is distinct: the model isn't confused or dreaming; it's doing exactly what it was designed to do, just with imperfect underlying information.

Language models work by predicting the most statistically likely next token given everything that came before. They're extraordinarily good at this — but "most likely" and "most true" are not the same thing. A model trained on vast amounts of human text learns patterns of how information is typically expressed, not a perfect factual database.

🎭

The confident-but-wrong expert. Imagine someone who read millions of books and articles but never stopped to verify facts — they just absorbed patterns of how things are described. Ask them a question and they'll give you a fluent, authoritative-sounding answer. Most of the time it's correct because the patterns are reliable. But in gaps — obscure facts, recent events, specific statistics — they fill in the most plausible-sounding answer rather than saying "I don't know." That's essentially what hallucination is.

High Hallucination Risk

Specific dates, statistics, citations
Obscure or niche facts
Events after the training cutoff
Specific quotes attributed to real people
Legal or medical details
Mathematical computation (not reasoning)

Lower Hallucination Risk

Well-established, broadly documented facts
Reasoning from provided context (RAG)
Creative writing (no "correct" answer)
Code logic and syntax
Summarising text you've provided
Explaining concepts with broad consensus

How to Reduce Hallucinations

1. Use RAG (Retrieval-Augmented Generation). When you give the model the actual document to reference, it grounds its answers in real text rather than reconstructed memory. This is why NotebookLM and Claude with uploaded documents are far less likely to hallucinate than a bare chat prompt.

2. Ask for citations and check them. Request that the model cite its sources — and verify the key ones. Tools like Perplexity do this automatically with live web links you can click.

3. Calibrate your confidence. Use AI for first drafts and exploration; use authoritative sources for anything that goes into a published document, legal brief, or medical decision.

⚡ Why This Matters in 2026

Models have improved dramatically — Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.3 hallucinate significantly less than their predecessors on mainstream questions. But they haven't eliminated it. The dangerous scenario is confident hallucination on obscure specifics — the AI that gets the big picture right but invents the supporting statistics. Always verify any specific factual claim that will appear in something consequential.

05 Industry Structure

Open Source vs
Closed Source AI

The most consequential division in the AI industry — and the reason DeepSeek's January 2025 release caused a trillion-dollar market event.

In traditional software, open source means the code is publicly available for anyone to read, modify, and run. In AI, open source means the model's weights — the billions of numerical parameters that define what the model "knows" — are publicly downloadable. Closed source means the model runs only on the company's servers, accessible only via their API or products.

The implications are enormous. An open-source model can be downloaded, run on your own hardware, modified for specific tasks, deployed without ongoing fees, and studied by researchers. A closed-source model gives you none of those abilities — but typically represents the absolute frontier of capability, since the company is incentivised to keep its best work proprietary.

🔓 Open Source

Run it yourself. Modify it. No fees.

Meta Llama 4 (10M context)
DeepSeek V3/V4
Mistral family
Qwen (Alibaba)
Falcon, Gemma (Google)

🔒 Closed Source

API-only. You pay per token. More capable.

Claude (Anthropic)
GPT-5.x (OpenAI)
Gemini 3.1 (Google)
Grok 4.x (xAI)
Command R+ (Cohere)

The DeepSeek moment illustrates why this matters so much. When DeepSeek released their R1 model in January 2025 as open-source — at a level of capability approaching GPT-4o — it proved that frontier-level AI could be built at a fraction of the cost of Western labs, and then given away freely. This single event erased enormous perceived moats and raised fundamental questions about whether closed-source AI companies could maintain their pricing power long-term.

🌍 Why DeepSeek Changed Everything

Prior to DeepSeek, the assumption was that building truly capable AI required billion-dollar GPU clusters and thousands of engineers. DeepSeek showed that with highly efficient training techniques, a team of a few hundred could match US frontier models at a fraction of the cost — and then release the result for free. This doesn't mean closed models have no future (they still lead on capability), but it fundamentally changed the economics of the entire industry and forced rapid price cuts from OpenAI, Anthropic, and Google.

06 Architecture

Parameters & Model Size

"This model has 70 billion parameters" — but what does that actually mean?

A neural network is made up of layers of mathematical operations. Each operation involves numbers called parameters (also called weights) — these are the millions or billions of numerical values that the model adjusts during training until it gets good at predicting text. The total count of these numbers is the model's "parameter count."

A model with 7 billion parameters has 7,000,000,000 individual numerical values that were optimised during training. A model with 700 billion parameters has 100 times more. More parameters generally means more capacity to learn patterns, more nuance, better reasoning — but also dramatically more computational cost to run.

🧩

The jigsaw puzzle analogy. Think of each parameter as a single adjustment knob on an enormously complex mixing board. Training the model means slowly turning each knob until the mix sounds right across billions of example tasks. A 7B model has 7 billion knobs. A 700B model has 700 billion. More knobs means more subtle adjustments possible — but also takes longer to tune, costs more to run, and requires more physical memory. The breakthrough of "small but capable" models like DeepSeek R1 is finding ways to achieve expert-level performance with far fewer, but better-placed, knobs.

💡 Why Parameter Count Isn't Everything

A 70B parameter model trained cleverly often outperforms a 700B model trained naively. Modern architectures like Mixture of Experts (MoE) allow a "1.8 trillion parameter" model to only activate a fraction of those parameters per token — giving the capacity of a giant model at the running cost of a smaller one. This is how GPT-4 and Grok reportedly work. The raw parameter count number is increasingly meaningless without knowing the architecture and training approach.

07 Fundamentals

Training vs Inference

These two phases of an AI model's life are completely different — and confusing them leads to a lot of misconceptions about what AI "knows" and when.

Training is when a model is created — the months-long process of feeding billions of examples of text through the neural network and adjusting its parameters until it becomes good at predicting text patterns. This happens once (per model version), costs tens to hundreds of millions of dollars, requires enormous GPU clusters, and permanently "freezes" what the model has learned at a particular point in time.

Inference is when a trained model is actually used — the milliseconds-to-seconds process of running your prompt through the frozen model to get a response. This is what happens every time you send a message on Claude.ai or ChatGPT. The model is not learning anything during inference; it is applying what it learned during training to your specific input.

🏋️ Training

Happens once (months of computation)
Costs $10M–$500M+ in compute
Requires massive GPU clusters
Model learns from billions of examples
Produces a frozen set of weights
Determines knowledge cutoff date

⚡ Inference

Happens every time you send a message
Costs fractions of a cent
Model is frozen — not learning
Applies training to your specific input
Speed: milliseconds to seconds
Context window is available "working memory"

⚠ The Knowledge Cutoff Problem

Because training is frozen at a specific date, the model has no knowledge of events after that point. Claude's knowledge is current to early August 2025. Ask it about events from September 2025 onward and it won't know — unless it has web search capabilities that allow it to retrieve current information during inference. This is also why model versions are important: Claude 3 Sonnet was trained on data up to a different date than Claude Sonnet 4.6.

🔬 Test-Time Compute (The New Frontier)

One of the biggest advances of 2025–2026 is "thinking" models that use extra computation at inference time — not just retrieving a pattern, but reasoning step-by-step before answering. Models like Claude Opus 4.6 with extended thinking and OpenAI's o3 spend more compute per query to produce better-reasoned answers. This means the distinction between "training" and "inference" is blurring: smarter inference can partly substitute for more training.

08 Capabilities

Multimodal AI

Early AI could only read and write text. Multimodal models see, listen, and reason across text, images, audio, and code simultaneously.

A unimodal model understands one type of input — typically text. A multimodal model understands multiple types — text, images, audio, video, and code — and can reason across them together. Ask a multimodal model "what's wrong with this chart?" and it reads the image visually, understands the axes and data, and gives you a thoughtful critique. Ask "does this code match the architecture diagram?" and it reads the diagram and the code simultaneously.

Every frontier model in 2026 is multimodal by default. The question has shifted from "can it handle images?" to "how well does it handle each modality and how seamlessly does it reason across them?"

📝

Text

Has always worked

🖼️

Images

Standard since 2023

🎧

Audio

Frontier models 2024+

🎥

Video

Emerging in 2025–26

💻

Code / Data

Native in most models

💡 Real Uses of Multimodal

Photograph a whiteboard diagram → Claude reads it and writes the implementation code.

Upload a PDF with charts → the model reads both the text and the visual data in the charts simultaneously.

Record a voice note → transcribed and reasoned over as text (GPT-4o audio, ElevenLabs).

Screenshot an error message → model diagnoses the bug from the image, no typing required.

09 The Future of AI

Agents & Agentic AI

The biggest shift in AI right now — from models that answer questions to models that take actions over time, autonomously, using tools.

Traditional AI (2020–2023) was fundamentally reactive: you ask a question, it answers. Agentic AI is proactive: you give it a goal, and it figures out the steps, uses tools, takes actions, monitors outcomes, and iterates — all without needing you to approve every step.

An agent might be given "research and write a report on our top three competitors." It then decides to search the web multiple times, extract data from websites, analyse the results, draft sections, check for inconsistencies, and produce a final document — all autonomously. Each of those sub-steps might itself call another AI model or tool.

🤖

The intern vs the employee. Asking a traditional AI is like asking an intern to answer a specific question and wait for more instructions. An AI agent is like giving an employee a project goal and leaving them to manage the workflow — they break it into tasks, decide which tools to use, handle roadblocks, and deliver a finished result. The difference isn't intelligence; it's autonomy and action-taking.

What Agents Can Do

Search the web and read pages
Write and execute code
Read and write files
Send emails, create calendar events
Fill out web forms, click buttons
Call other AI models as sub-agents
Query databases

Risks to Understand

Harder to correct once started
Compound errors (wrong step 2 = wrong everything)
Prompt injection from malicious web content
Unintended actions (sending the wrong email)
Cost overruns (many API calls)
"Reward hacking" — gaming the goal

⚡ Why 2026 Is the Agent Inflection Point

Three things came together: models became reliable enough that their planning steps are usually correct; MCP (Model Context Protocol) created a standard way to give agents access to any tool; and models became cheap enough that long multi-step agent runs are economically viable. Claude Code, Devin, and Cursor's multi-agent mode are early commercial expressions of this shift — and they're already changing what a one-person software team can accomplish.

10 Architecture

RAG & Vector Embeddings

How AI retrieves relevant information before generating an answer — and what "embeddings" and "vector databases" actually are under the hood.

Retrieval-Augmented Generation (RAG) is the architecture behind NotebookLM, Perplexity, and enterprise AI search systems. The idea is simple: instead of asking the model to recall information from its training, you retrieve the actual relevant documents first, then feed them into the model's context window alongside the question. The model is then grounded in real text rather than reconstructed memory.

The retrieval step is where it gets interesting. You can't just keyword-search across thousands of documents in real time — you need a smarter way to find relevant content. The solution is embeddings: a numerical representation of text's meaning.

📍

The meaning map. Imagine a map where every sentence you've ever written is a location, and similar-meaning sentences are physically close to each other. "The dog chased the ball" and "a canine pursued the sphere" would appear near each other on this map, even though they share no words. When you ask a question, you plot its location on the same map and grab all the nearby sentences. That's what a vector embedding does — it converts text meaning into coordinates (hundreds of numbers) that can be compared geometrically. Two texts are "similar" if their coordinates are close.

  // The RAG Pipeline — what happens when you "chat with a document" Step 1 // Ingestion (done once)
Split document into chunks (~500 tokens each)
  Convert each chunk into a vector embedding  [0.42, -0.18, 0.91, ...]
  Store all vectors in a vector database (Pinecone, pgvector, etc.)
Step 2 // Query (happens every time you ask a question)
Convert your question into a vector embedding
  Find the top-K most similar chunks by vector distance
  Return those chunks as context
Step 3 // Generation
Build prompt: [retrieved chunks] + [your question]
  Send to LLM → grounded, hallucination-resistant answer
 

💡 Why This Matters for Privacy

RAG pipelines mean your private documents never need to be part of the model's training data — they're retrieved at query time from your own vector database. A hospital can build a RAG system over patient records without ever sending that data to an AI company. The model only sees the specific retrieved chunks relevant to the current query. This is the architecture behind most serious enterprise AI deployments in regulated industries.

11 Controls

Temperature & Sampling

The dial that controls how creative — or how predictable — an AI's outputs are. And why it matters more than most people realise.

Every time an AI model generates the next word (token), it doesn't pick the single most-likely option — it samples from a probability distribution. The word "sun" might have a 40% probability, "moon" 25%, "sky" 20%, "rain" 10%, "cheese" 0.01%. Temperature controls how much the model "spreads" that distribution before sampling from it.

At temperature 0, the model always picks the highest-probability token — completely deterministic and reproducible. At temperature 1, it samples from the natural distribution. At temperature 2, lower-probability options become more likely — the output becomes more varied, creative, and occasionally strange.

Temperature Scale

0

0.7

1.0

2.0

Deterministic
Factual · Reproducible Balanced
Default for most tasks Creative
Varied · Unpredictable

Low Temperature (0–0.3)

Best for: code generation, data extraction, factual Q&A, JSON output, classification tasks. When you need the same reliable answer every time, or when "creativity" means "wrong."

High Temperature (0.8–1.5)

Best for: creative writing, brainstorming, generating diverse options, poetry, character dialogue. When you want variety and surprise over reliability.

💡 Consumer Products Hide This

When you use Claude.ai or ChatGPT normally, temperature is set automatically by the system — usually around 0.7–1.0 for general chat, lower for code generation. You only control it explicitly when building via the API. But knowing it exists explains a lot: why generating the same prompt twice gives you different answers, and why asking Claude to "be more creative" in a system prompt actually changes output even when you're not tweaking temperature yourself.

12 Customisation

Fine-Tuning vs Prompting

Two very different ways to make a general AI model behave in domain-specific ways — and when each is worth the effort.

Prompting is the simplest form of customisation: you write careful instructions that tell the model how to behave for your use case. A "system prompt" at the start of every conversation can set the tone, role, constraints, and output format. With modern frontier models, a well-crafted prompt can achieve remarkable domain adaptation — this is why prompt engineering became its own discipline.

Fine-tuning is training the model further on a curated dataset of examples specific to your domain — essentially updating the model's weights to internalise patterns from your data. A medical company might fine-tune a model on hundreds of thousands of clinical notes to improve its medical terminology, tone, and reasoning style.

✍️ Prompting

No training required. Works today.

Write a system prompt describing role and rules
No data needed; no training cost
Changes can be made instantly
Works with any API model
Best for: most use cases, even complex ones
Limit: model still has its base knowledge and tendencies

🔬 Fine-Tuning

Training the model on your data.

Requires hundreds–thousands of labelled examples
Training cost: $500–$50,000+
Model deeply internalises domain style
Consistent style/format without prompting
Best for: very high volume, specific style needs
Risk: overfitting, degrading general ability

⚠ The 2026 Verdict

In most cases, prompting is now the right answer. Models have become capable enough that a well-written system prompt achieves what fine-tuning used to be necessary for. Fine-tuning still makes sense for: enforcing a very specific output format at high volume, teaching proprietary domain vocabulary, or when inference costs need to be minimised by using a smaller fine-tuned model instead of a large general one. But the "fine-tune first" instinct from 2022 has been largely superseded by "prompt first, fine-tune only if prompting hits a real ceiling."

13 Safety & Alignment

RLHF & Constitutional AI

How AI companies teach their models to be helpful and avoid harm — and why this is harder than it sounds.

A model trained purely to predict the next token on internet text will reproduce whatever patterns dominate the internet — including misinformation, harmful content, and persuasion techniques. The "alignment" challenge is shaping the model's behaviour to be helpful, honest, and harmless without breaking its underlying capabilities.

RLHF (Reinforcement Learning from Human Feedback) is the dominant technique. After base training, human raters compare pairs of model outputs and indicate which is better. A reward model learns from these preferences. The main model is then optimised to produce outputs the reward model rates highly. The result is a model that's better at following instructions, more helpful, and less likely to produce harmful content.

Constitutional AI (CAI), developed by Anthropic, takes a different approach: rather than relying entirely on human feedback, it trains the model on a set of written principles (a "constitution") and has the model critique and revise its own outputs against those principles. This reduces reliance on expensive human labelling and makes the model's values more explicit and auditable.

⚠ The Alignment Tax

There's an ongoing debate about whether safety training slightly reduces raw capability — whether a model optimised to refuse harmful requests also becomes slightly worse at edge cases near those refusals. This is called the "alignment tax." Evidence is mixed and decreasing as training techniques improve, but it partly explains why some users seek out fine-tuned models with fewer restrictions for specific research applications.

🔬 Why This Matters for Trust

Understanding RLHF explains why Claude, GPT, and Gemini behave the way they do when you ask them to help with sensitive topics. They're not running a keyword filter — they've learned patterns of what constitutes helpful vs. harmful responses from millions of human preference judgements. The model "wants" to be helpful; its refusals represent a trained judgement that the request risks harm that outweighs helpfulness. Whether that judgement is calibrated correctly for your use case is a legitimate ongoing debate.

14 Workflow

"Vibe Coding" — Step by Step

The AI Landscape report declares vibe coding mainstream. But what actually happens when you describe an app in English and it materialises?

The term "vibe coding" was coined by AI researcher Andrej Karpathy to describe a new workflow: you describe what you want in plain English, the AI generates the code, you run it, describe what's wrong, and iterate — treating the codebase as something that materialises through natural language rather than manual typing. "Vibe" refers to communicating intent and aesthetic direction rather than technical specifications.

This isn't magic. Under the hood, a very specific sequence of events is happening — and understanding it helps you use the workflow far more effectively.

  // What happens when you type "build me a recipe app with dark mode" in Lovable or Bolt 1. Prompt Processing
Your message + system context → Claude Sonnet / GPT-5 with very long system prompt
  System prompt contains: framework rules, file structure templates, design conventions
2. Planning Phase // The model "thinks" before writing
Model decomposes into components: Layout, RecipeCard, SearchBar, ThemeToggle...
  Chooses stack: React + Tailwind CSS + localStorage (or Supabase for DB)
  Designs component hierarchy
3. Code Generation
Writes multiple files simultaneously: App.jsx, components/, styles, config
  Applies design system (colour tokens, spacing, typography)
  Adds dark mode via CSS variables
4. Execution
Platform runs the code in a sandboxed environment
  Hot-reloads to show preview instantly
5. Your Iteration // "Make the cards bigger and add a search bar"
Entire conversation history + current code sent as context
  Model makes targeted edits, not full rewrites
  Repeat until done
 

💡 What Makes a Good Vibe Coding Prompt

Describe the user experience, not the code: "When I tap a recipe, it slides in from the right and shows steps one at a time" beats "implement a push navigation transition with step-by-step rendering."

Reference visual examples: "Like Apple's Notes app, but for recipes" gives the AI enormous context about spacing, typography, interaction patterns.

Be specific about data: "Each recipe has a title, ingredient list, servings count, prep time, and 5 steps" prevents the AI from guessing the data model.

Iterate in small steps: One change at a time is more reliable than a 10-point list of changes.

⚠ The Reality Check

Vibe coding excels at front-end interfaces, prototypes, and standard CRUD applications. It struggles with: complex business logic with many edge cases, performance-critical code, novel algorithms, and anything requiring deep understanding of your existing codebase. The most productive vibe coders understand the code that's generated — so they know when the AI has made a logical error, even if it compiles and runs.

15 Protocol

MCP — The Protocol
That Connects Everything

The AI Landscape report calls MCP a "universal standard." Here's what that means in practice and why it matters beyond just developer tooling.

MCP (Model Context Protocol) is an open standard, published by Anthropic in November 2024, that defines a common way for AI models to connect to and interact with external tools, data sources, and services. Before MCP, every AI tool had its own proprietary way to connect to things — Cursor had its own plugin format, Claude had its own API format, and getting an AI to use a database meant custom code every time.

MCP is analogous to what USB did for peripheral devices, or what HTTP did for web communication. Before USB, every device had its own connector and driver. USB made any device plug into any computer. MCP aims to make any tool pluggable into any AI model — write the integration once, use it everywhere.

🔌

The USB analogy. Before USB (1996), your printer had a parallel port connector, your mouse had a PS/2 connector, your keyboard had its own connector. Every device was a bespoke hardware integration. USB created a universal standard: one connector type, one driver model, plug-and-play for everything. MCP is trying to do this for AI tool connections. A Postgres database server, a GitHub integration, a Google Calendar tool — all follow the same protocol, discoverable by any MCP-compatible AI client.

  // What MCP looks like from an AI's perspective // The AI can ask: "what tools do you have?" list_tools()
→ [
    { name: "read_file", description: "Read a file from the filesystem" },
    { name: "run_query", description: "Execute a SQL query on the database" },
    { name: "browser_navigate", description: "Navigate to a URL and return the page" }
  ]
// The AI then calls a tool as naturally as using a function: call_tool("run_query", { sql: "SELECT * FROM users LIMIT 10" })
→ [ { id: 1, name: "Alice" }, { id: 2, name: "Bob" }, ... ]
// The result comes back into the context window as data // The AI continues reasoning with the actual query results  

⚡ Why This Is Bigger Than It Sounds

The key insight is that MCP makes AI tool integrations composable and portable. A company builds a single MCP server for their internal database. Any MCP-compatible AI client — Claude Code, Cursor, a custom chatbot, a future AI we haven't seen yet — can immediately use that integration. No custom work per AI platform. The MCP ecosystem already has thousands of servers: for filesystems, web browsers, GitHub, Slack, Jira, Postgres, Salesforce, Notion, and more. This is the infrastructure that makes the "agentic AI" era practical rather than theoretical.

The concepts behind the tools

Context Windows

Quick Reference

Key Term

2026 Leaders

AI Benchmarks— and why to be sceptical of them

The Major Benchmarks Explained

✅ What benchmarks do well

⚠ Where benchmarks mislead

Key Term

Goodhart's Law

Best Benchmarks 2026

Tokens & Pricing

Token Math

Key Terms

Hallucinations

High Hallucination Risk

Lower Hallucination Risk

How to Reduce Hallucinations

Key Term

Solutions

Open Source vsClosed Source AI

🔓 Open Source

🔒 Closed Source

Key Terms

Best Open Models 2026

Parameters & Model Size

Common Sizes

Key Term

Training vs Inference

🏋️ Training

⚡ Inference

Key Term

Key Term

Multimodal AI

Key Term

Agents & Agentic AI

What Agents Can Do

Risks to Understand

Key Terms

RAG & Vector Embeddings

Key Terms

Temperature & Sampling

Low Temperature (0–0.3)

High Temperature (0.8–1.5)

Quick Guide

Fine-Tuning vs Prompting

✍️ Prompting

🔬 Fine-Tuning

Decision Guide

RLHF & Constitutional AI

Key Terms

"Vibe Coding" — Step by Step

Best Tools 2026

Key Term

MCP — The ProtocolThat Connects Everything

Key Terms

Popular MCP Servers

What comes next

Prompt Engineering — Structured & Systematic

RAG Explained — Retrieval-Augmented Generation

The Transformer Architecture (Plain English)

AI Benchmarks
— and why to be sceptical of them

Open Source vs
Closed Source AI

MCP — The Protocol
That Connects Everything