2026-06-17

Tokenization in the AI Era: What It Is, Why It's Expensive, and What's Breaking in 2026

Tokenization is the unglamorous foundation underneath every AI cost number, context window, and weird failure you've encountered. Token prices crashed, but bills are rising anyway — and there's a real 'tokenization tax' on non-English languages.

TL;DR: Tokenization — the step that breaks text, images, and audio into the "tokens" an AI model actually processes — is the unglamorous foundation underneath every cost number, context window, and weird AI failure you've encountered. Token prices have crashed (up to 280x cheaper in two years), but usage has grown even faster, so bills are rising anyway. Non-English speakers pay a real "tokenization tax." And a growing body of expert opinion, led by Andrej Karpathy, argues that tokenization is the source of much of AI's strangest behavior — from miscounting letters in "strawberry" to outright security holes.

Executive Summary

What it is: tokenization converts any input — text, images, audio — into discrete units ("tokens") that are the actual currency a model computes on and that you're billed for. Modern tokenizers (BPE, SentencePiece, tiktoken) trade a bit of human readability for the ability to handle any language or input without ever hitting an "unknown word."
What's costing real money: token prices have fallen up to 280x in two years, but total usage has grown so much faster (OpenAI alone processes ~15 billion tokens/minute as of April 2026) that company AI bills are still rising — sometimes tripling even as per-token costs collapse. Agentic and reasoning workflows are the main driver, consuming 5-30x (and in extreme cases up to 1000x) more tokens per task than a simple chat message.
What's overhyped/contested: whether tokenization is a temporary engineering hack we'll soon delete (Karpathy's camp, plus Meta's Byte Latent Transformer research) or a "core design decision" that's here to stay (a January 2026 EACL paper pushing back on that assumption). As of mid-2026, every flagship commercial model — GPT, Claude, Gemini, Llama — still uses conventional subword tokenization; nothing tokenizer-free has shipped in production.
Where it's heading: incremental fixes (bigger vocabularies, smarter multilingual handling, security-hardened tokenizers) rather than a clean break from tokenization — even as the research case for ditching it gets stronger.

Background / Context

Every AI model you use — ChatGPT, Claude, Gemini — doesn't actually "read" your text the way you do. Before anything reaches the neural network, a tokenizer chops your input into pieces (tokens) and converts each piece into a number via a lookup table. "Hello, world!" might become ["Hello", ",", "world", "!"]. A rough rule of thumb for English: 1 token ≈ 4 characters ≈ 0.75 words — and this is also the literal unit every major provider bills you by.

This sounds like a minor implementation detail. It isn't. Tokenization choices ripple into how much you pay, how long a conversation a model can hold in memory, why some languages cost more than others to use AI in, and why models do oddly dumb things like insisting 9.11 is bigger than 9.9.

Key Findings

How tokenizers actually work

The dominant technique is Byte Pair Encoding (BPE): start with individual characters/bytes, then repeatedly merge the most frequent adjacent pair into a new token, until you hit a target vocabulary size. It originated as a 1990s compression algorithm and was adapted for NLP in a 2016 paper that's now the field's canonical reference.
SentencePiece (Google, 2018) takes a different approach — a probabilistic "unigram" model that picks the most likely way to split text — and was purpose-built so one tokenizer could handle 100+ languages without language-specific preprocessing. It powers Gemini/Gemma.
tiktoken (OpenAI's tokenizer) works at the byte level: its base vocabulary is just the 256 possible byte values, so it can always encode any input — any language, emoji, or garbled text — by falling back to raw bytes if nothing else matches. On average, each token covers about 4 bytes of text.
Vocabulary sizes vary a lot by model and have grown over time: GPT-4/3.5 used ~100,000 tokens; GPT-4o jumped to ~200,000; Llama 3 uses ~128,000 (up from 32,000 in Llama 2); Gemini/Gemma 3 uses ~256,000. Anthropic has never published Claude's vocabulary size — outside estimates put it around 65,000, though this is unconfirmed.
Bigger vocabularies aren't free: they mean common phrases compress into fewer tokens (cheaper, more room in the context window), but they also bloat the model's parameter count, so labs balance vocabulary size against model size rather than growing it indefinitely.
Multimodal tokenization has become standard practice, not just research: images are split into fixed-size patches (a 224×224 image becomes 196 patches in the classic Vision Transformer approach), and in production APIs like GPT-4o, image cost is literally calculated from patch/tile count (roughly 170 tokens per 512px tile). Audio gets converted into discrete tokens via neural codecs that compress a waveform into a sequence of "codebook" entries — Kyutai's Mimi codec, for example, compresses 16kHz audio down to about 12.5 token-frames per second, which is short enough to fit comfortably in an LLM's context window.

Why tokenization is costing (and saving) real money

Token prices have fallen dramatically: a GPT-3.5-equivalent model went from $20 per million tokens (Nov 2022) to $0.07 per million (Oct 2024) — a 280x drop in two years. GPT-4-level reasoning that cost ~$60 per million output tokens in early 2023 is available for $0.30-$0.75 per million by 2026.
But the paradox is real: one analysis found token costs falling ~99.7% while a company's actual AI bill tripled — because cheaper tokens unlocked far more usage, especially through AI agents. OpenAI's own processing volume jumped from 6 billion to 15 billion tokens per minute between October 2025 and April 2026 — 2.5x in five months. Goldman Sachs forecasts a 24x increase in token consumption by 2030, driven by agentic AI.
Reasoning and agentic workloads are the multiplier. Chain-of-thought reasoning and multi-step agent tool-calling consume 5-30x more tokens per task than a simple chat message (some extreme tool-calling loops reportedly hit up to 1000x). AT&T's internal token usage went from ~8 billion to 27 billion tokens/day after deploying multi-agent systems. By the end of 2025, reasoning models' share of all tokens processed industry-wide had crossed 50%, up from near-zero a year earlier.
Context windows have exploded — roughly 20,000x growth from GPT-1's 512 tokens (2018) to Llama 4 Scout's 10 million tokens (2026) — enabling products to ingest whole codebases, full legal contracts, or long-running agent memory in a single pass. Several flagship models (Claude Opus, Gemini Pro) now offer 1M-token context at flat rates.
There's a documented "tokenization tax" on non-English languages. A widely-cited 2023 study found some languages need up to 15x more tokens than English for equivalent content, because tokenizer vocabularies are built from training data dominated by English/Latin-script text. Using GPT-4o pricing as an example: the same content costs $2.90 per million words in English, but $4.73 in Hindi (a 63% premium) and $4.93 in Arabic (a 70% premium). This isn't a rounding error — it's a structural cost difference baked into how the tokenizer was trained, and it persists even in tokenizers explicitly designed to be multilingual.
Companies have responded with several now-standard cost-control techniques: prompt caching (Anthropic cuts cached-input cost by 90%, OpenAI by 50%, Google by 75%), batch APIs (flat discounts of 25-50% for non-real-time requests), model routing (sending easy queries to cheap models, hard ones to frontier models — shown to cut overall token usage 37-46%), and semantic caching (matching new queries to similar cached ones, with one production case reporting a 90% cache-hit rate and 80% cost reduction). Stacking these techniques together is documented to deliver 47-80% total cost reduction.

Real-world weirdness, costly surprises, and security holes

The "strawberry" problem: models notoriously miscounted the letters in "strawberry" (saying 2 R's instead of 3) because the tokenizer splits it into chunks like "straw" + "berry" — the model never sees individual letters, only token IDs. OpenAI's reasoning model was even internally codenamed "Strawberry," a direct nod to fixing this exact class of failure through step-by-step reasoning rather than a single forward pass.
"9.11 vs 9.9": a separate but related embarrassment — multiple models confidently claimed 9.11 is bigger than 9.9, traced partly to how digits get split into tokens like "9", ".", "11" rather than being understood as decimal values. Researchers caution this is "tokenization plus" something else — a deeper pattern-matching issue — not tokenization in total isolation.
Glitch tokens, the most famous being "SolidGoldMagikarp": a rare Reddit username that ended up in a tokenizer's vocabulary but was essentially untrained in the model itself, creating a "blind spot" where simply asking the model to repeat the word triggered bizarre, unrelated, or hostile outputs. This was first documented by independent researchers in early 2023 and has spawned a whole sub-field of detection research since.
Two real, recent pricing blowups for PMs to know: Cursor (the AI coding tool) had to publicly apologize in July 2025 after quietly moving from a flat "500 requests/month" plan to a token-cost-based model that spiked some users' effective bills 20x+ once heavy agentic usage hit; refunds followed. GitHub Copilot's June 2026 move to token-based billing similarly blindsided developers — some projected bill increases of 10x-50x for agentic workflows (one scenario went from $29/month to $750/month), and the GitHub community thread announcing it got 958 downvotes against 24 upvotes.
Named companies that blew past their AI budgets in 2026: Uber exhausted its entire annual AI coding budget by April after rolling out Claude Code (engineer adoption jumped from 32% to 84% in two months), with heavy users running up $500-2,000/month each. Priceline's AI coding tool contract renewal jumped 4-5x; one employee alone generated a $40,000 monthly token bill. One unnamed enterprise reportedly accumulated a $500 million bill after failing to set usage limits. A 2025 industry survey found 85% of companies missed their AI cost forecasts by more than 10%.
This is also a real security surface, not just a cost or quirk problem. "TokenBreak" (disclosed June 2025) showed that single-character tweaks to a harmful prompt — like changing "instructions" to "finstructions" — can change how a tokenizer splits the text just enough to slip past a safety filter, while the underlying model still understands the intended (harmful) meaning perfectly. Separately, "adversarial tokenization" research (March 2025) showed the same trick works across multiple model families (Llama 3, Gemma 2, OLMo) by exploiting tokenizer-level segmentation differences, and other research has found "special token" exploits achieving jailbreak success rates as high as 96%.

What's overhyped / contested

The debate over whether tokenization should be deleted entirely is real, current, and unresolved. Andrej Karpathy — among the most-cited voices on this — has said: "Everyone should hope that we can throw away tokenization in LLMs," and that "a lot of weird behaviors and problems of LLMs actually trace back to tokenization." He frames it as an inelegant holdover from older NLP pipelines, not a deliberate design choice.
Counter to that, a January 2026 paper ("Stop Taking Tokenizers for Granted," published at EACL 2026) argues the opposite: that tokenizer choice is a "core design decision" on par with model architecture, and that the field's habit of dismissing tokenizer differences as minor is wrong — measurable performance differences persist across tokenizer choices even at large model scale.
Tokenizer-free research is promising but not close to mainstream deployment. Meta's Byte Latent Transformer (Dec 2024) processes raw bytes directly, dynamically allocating more compute to unpredictable stretches of text, and reportedly matches Llama 3 performance using up to 50% fewer inference FLOPs. MambaByte (2024) is a similar byte-level approach built on the Mamba architecture. Both are described by researchers as "feasible at scale" and "promising" — language that signals proof-of-concept, not production-ready. As of mid-2026, every flagship commercial model (GPT, Claude, Gemini, Llama) still ships with conventional subword tokenization.
The actual industry response so far is incremental, not revolutionary: bigger vocabularies, better multilingual coverage, and tokenizer choices made partly for security reasons (e.g., Unigram tokenizers being immune to the TokenBreak exploit that hits BPE/WordPiece) — rather than the field abandoning tokenization altogether.

Where things are heading

Expect continued growth in vocabulary size and multilingual-aware tokenizer design as the main lever labs pull to address the "tokenization tax," rather than a wholesale architecture change.
Watch the tokenizer-free research lineage (BLT, MambaByte, and newer adaptations like ByteFlow) — it's well-funded and active, and Meta's own research division is simultaneously the most vocal advocate for eliminating tokenization while its production Llama models still use conventional tokenizers. That tension is a useful signal: even the labs pushing tokenizer-free hardest haven't shipped it yet.
Expect continued volatility in how AI products price themselves. The Cursor and GitHub Copilot pricing incidents are likely previews, not one-offs — as agentic and reasoning workloads keep pushing token consumption up faster than per-token prices fall, more products will face the same "our flat-rate plan didn't account for how token-hungry agents actually are" reckoning.
Expect more "outcome-based" pricing experiments (e.g., Intercom charging $0.99 per resolved support ticket rather than per token or message) as companies try to decouple what customers pay from the internal token cost they can't fully predict or control.

Implications for PMs / Practitioners

If your product has any agentic or reasoning features, budget for 5-30x the token consumption of a simple chat feature — and stress-test your pricing model against that multiplier before launch, not after a Cursor- or Copilot-style backlash forces your hand.
Don't assume "tokens got cheaper" means "this feature got cheaper to run." Falling per-token prices and rising total usage are pulling in opposite directions on your actual bill — model your cost projections on usage growth, not just published price drops.
If you have non-English-speaking users, check what they're actually paying in tokens for the same task. A 60-70% cost premium for Hindi or Arabic users isn't a hypothetical — it's documented, current, and worth knowing before you set regional pricing or usage limits.
Treat "switching models" as a real migration cost, not a config change. Different tokenizers mean different token counts for the same prompt, which can quietly break context-budget logic, cost estimates, and prompt engineering tuned for one vendor.
If you're evaluating AI safety/guardrail tooling, ask specifically how it handles tokenizer-level evasion (TokenBreak-style attacks) — a classifier that looks robust on the model's own tokenizer can still be evaded with single-character tricks that exploit the tokenizer's segmentation.

Sources

Note on sourcing: pricing and token-consumption figures are volatile and several come from third-party aggregators citing primary data (Epoch AI, Goldman Sachs, FinOps Foundation) rather than vendor-published numbers directly — worth re-verifying before quoting any specific dollar figure publicly. A small number of model-version details (e.g. exact Claude tokenizer changes) rely on unofficial reverse-engineering since Anthropic hasn't published its tokenizer specifics.