o200k_base
o200k_base is the latest tokenizer encoding created by OpenAI. It provides general-purpose encoding for current model families including the GPT-5 model family.
What is o200k_base?
o200k_base is a byte-pair encoding (BPE) that encodes arbitrary text into token IDs.
o200k_base is associated with GPT-4o, GPT-4o mini, GPT-4.1-family models, several o-series models, and current frontier gpt-5 models.
History and background
o200k_base was published May 2024 with the launch of GPT-4o by. The tokenizer was described as more efficient on non-English text than GPT-4 Turbo. GPT-4o mini was later described as sharing the same improved tokenizer.
Models that use o200k_base
| Model or family | Relationship to o200k_base | Practical note |
|---|---|---|
gpt-4o and versioned gpt-4o-* variants | Mapped to o200k_base in tiktoken | Use model-aware lookup when possible. |
gpt-4o-mini | Mapped to o200k_base in tiktoken | Shares the GPT-4o tokenizer line. |
gpt-4.1 and versioned gpt-4.1-* variants | Mapped to o200k_base in tiktoken | Relevant for current GPT-4.1-family workflows. |
o1, o3, and o4-mini families | Mapped to o200k_base in tiktoken | Useful for newer reasoning-model token counting. |
| Fine-tuned GPT-4o identifiers | Mapped to o200k_base in tiktoken | Verify with encoding_for_model() when available. |
gpt-5 identifiers | Mapped to o200k_base in tiktoken | Relevant for current OpenAI model-name mappings. |
o200k_harmony is a later related tokenizer but not a documented drop-in replacement for o200k_base.
How o200k_base works
o200k_base uses byte-pair encoding with tokenizer-specific pre-segmentation. In practical terms, the process has three important layers.
First, text is split using a Unicode-aware regex pattern. The pattern groups letter sequences, short digit sequences, punctuation spans, newline groups, and whitespace before BPE merges are applied. This pre-segmentation behavior is part of what makes o200k_base distinct from nearby encodings.
Second, BPE merge ranks are applied to convert byte sequences into token IDs. The merge-ranks data is stored separately from the Python code and is loaded by tiktoken when the encoding is used. First use may require access to the encoding data or a populated local cache.
Third, special-token handling is applied. o200k_base includes a defined special-token set, including end-of-text and end-of-prompt markers. These tokens are not ordinary text fragments. They can carry control meaning in model contexts, so tiktoken handles them conservatively unless the developer explicitly allows them.
Because o200k_base is byte-oriented, a token does not always correspond cleanly to one visible character or one valid UTF-8 string. For inspecting individual tokens, byte-level decode helpers are safer than plain .decode().
Strengths and advantages
Stable reference encoding for current OpenAI workflows
The strongest advantage of o200k_base is its role as a shared operational encoding across current OpenAI model families. A single encoding can support token-counting and chunking logic for GPT-4o, GPT-4o mini, GPT-4.1, newer o-series models, and newer GPT-5 mappings.
Better fit than older encodings for some multilingual text
OpenAI's GPT-4o launch materials describe a tokenizer that is more efficient on non-English text than GPT-4 Turbo. That is a specific documented advantage of the GPT-4o-era tokenizer line and the main reason o200k_base exists as a distinct reference encoding.
Robust handling of arbitrary text
Because o200k_base operates at the byte level, it can handle mixed and messy input without relying on a conventional unknown-token fallback. This is useful for web text, code fragments, URLs, punctuation-heavy content, mixed Unicode, and other real-world input that may not resemble clean prose.
Limitations and quirks
Raw tokenization is not exact chat accounting
Tokenizing the visible text of a prompt is not always the same as the number of tokens billed or counted by a chat API request. Message roles, separators, tool calls, function schemas, and provider-side serialization can add hidden or semi-hidden overhead.
Single-token decoding can be misleading
A single token may not be aligned to a complete UTF-8 character boundary. Calling .decode() on individual tokens can therefore produce confusing or lossy results.
For token inspection, use byte-level APIs such as decode_single_token_bytes() or offset-aware decode helpers.
Best uses and practical guidance
Use o200k_base when the target workflow is explicitly tied to GPT-4o, GPT-4o mini, GPT-4.1, o1, o3, o4-mini, gpt-5, or another model that OpenAI maps to the same encoding.
It is especially suitable for:
- prompt-budget estimation for current OpenAI chat and reasoning models;
- retrieval chunk sizing for workflows built around GPT-4o-era models;
- local validation before API requests;
- tokenizer inspection tools and reference pages;
- regression tests for systems that already standardized on
o200k_base; - comparison against
cl100k_base,p50k_base, andr50k_base.
Relationship to cl100k_base and earlier encodings
cl100k_base, p50k_base, r50k_base, and o200k_base are related OpenAI tiktoken encodings, but they are not interchangeable.
r50k_base is closely associated with GPT-2-style and older GPT-3 text-model tokenization. p50k_base is associated with Codex and later GPT-3 completion models such as text-davinci-002 and text-davinci-003. cl100k_base is associated with GPT-3.5 Turbo, GPT-4, newer base models, and OpenAI embedding models. o200k_base is associated with GPT-4o-era and newer model families.
| Encoding | Associated era / use | Practical guidance |
|---|---|---|
r50k_base | GPT-2 and older GPT-3 text models | Use for legacy compatibility and historical reproduction. |
p50k_base | Codex and text-davinci-002 / text-davinci-003 style models | Use for older code and completion workflows. |
cl100k_base | GPT-3.5 Turbo, GPT-4, newer base models, embedding models | Use for GPT-4/GPT-3.5-era and embedding workflows mapped to it. |
o200k_base | GPT-4o-era and newer model families | Use for newer models mapped to it. |
For token reference work, this distinction is usually what explains a token-count mismatch: the model family changed, the encoding changed, or both changed.
Special tokens and structured-chat caveats
o200k_base includes a published special-token set:
| Special token | Typical meaning |
|---|---|
<|endoftext|> | End-of-text marker |
<|endofprompt|> | End-of-prompt marker |
These tokens should not be treated as ordinary strings. In tiktoken, special tokens are disallowed by default unless the caller explicitly allows them. That helps prevent accidental conversion of literal text into control tokens.
Structured requests introduce a separate caveat. Provider-side requests are not always raw strings; they can include message wrappers, separators, or other serialized fields. Counting only the visible text with o200k_base can undercount the full request size.
For developer tools, a useful distinction is:
- raw text tokens;
- structured-request estimate;
- provider-reported usage.
That distinction keeps a tokenizer reference page honest about what local tokenization can and cannot guarantee.
Compatibility with tiktoken and Transformers
tiktoken is the reference implementation for o200k_base. In Python, the standard usage is:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode("Hello world")
text = enc.decode(tokens)
For model-aware loading, use:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
For individual token inspection, prefer byte-level decoding:
token_bytes = enc.decode_single_token_bytes(tokens[0])
For special-token-bearing text, make the handling explicit:
enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})
When using Transformers or another tokenizer framework, do not assume that a raw exported tokenizer asset is enough for exact behavioral parity. Verify:
- merge ranks;
- regex pre-tokenization behavior;
- special-token definitions;
- decode behavior;
- handling of disallowed special-token text.
Developer reference
| Field | Value |
|---|---|
| Encoding name | o200k_base |
| Library | tiktoken |
| Maintainer | OpenAI |
| Method | Byte-pair encoding with regex-guided pre-segmentation |
| Direct loader | tiktoken.get_encoding("o200k_base") |
| Model-aware loader | tiktoken.encoding_for_model(model_name) |
| Common use cases | Token counting, prompt budgeting, retrieval chunking, API-input validation |
| Special tokens | <|endoftext|>, <|endofprompt|> |
| Single-token inspection | Prefer decode_single_token_bytes() |
| Main migration concern | Older model families may use cl100k_base instead |
| Main structured-chat caveat | Raw text tokenization does not necessarily equal full API request accounting |
Minimal token count example
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
text = "Hello world"
tokens = enc.encode(text)
print(tokens)
print(len(tokens))
print(enc.decode(tokens))
Model-aware example
import tiktoken
model = "gpt-4o"
enc = tiktoken.encoding_for_model(model)
print(enc.name)
Safer token inspection example
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
for token_id in enc.encode("Hello world"):
print(token_id, enc.decode_single_token_bytes(token_id))
Browse by type
Tokenizer Tools
Token index, decoded values, search, and the live playground.