Tokenizer profile

o200k_base

OpenAI byte-pair encoding used by newer GPT-4o-era models, with a larger multilingual vocabulary. Reference profile with a long-form technical report.

o200k_base

o200k_base is the latest tokenizer encoding created by OpenAI. It provides general-purpose encoding for current model families including the GPT-5 model family.

What is o200k_base?

o200k_base is a byte-pair encoding (BPE) that encodes arbitrary text into token IDs.

o200k_base is associated with GPT-4o, GPT-4o mini, GPT-4.1-family models, several o-series models, and current frontier gpt-5 models.

History and background

o200k_base was published May 2024 with the launch of GPT-4o by. The tokenizer was described as more efficient on non-English text than GPT-4 Turbo. GPT-4o mini was later described as sharing the same improved tokenizer.

Models that use o200k_base

Model or familyRelationship to o200k_basePractical note
gpt-4o and versioned gpt-4o-* variantsMapped to o200k_base in tiktokenUse model-aware lookup when possible.
gpt-4o-miniMapped to o200k_base in tiktokenShares the GPT-4o tokenizer line.
gpt-4.1 and versioned gpt-4.1-* variantsMapped to o200k_base in tiktokenRelevant for current GPT-4.1-family workflows.
o1, o3, and o4-mini familiesMapped to o200k_base in tiktokenUseful for newer reasoning-model token counting.
Fine-tuned GPT-4o identifiersMapped to o200k_base in tiktokenVerify with encoding_for_model() when available.
gpt-5 identifiersMapped to o200k_base in tiktokenRelevant for current OpenAI model-name mappings.

o200k_harmony is a later related tokenizer but not a documented drop-in replacement for o200k_base.

How o200k_base works

o200k_base uses byte-pair encoding with tokenizer-specific pre-segmentation. In practical terms, the process has three important layers.

First, text is split using a Unicode-aware regex pattern. The pattern groups letter sequences, short digit sequences, punctuation spans, newline groups, and whitespace before BPE merges are applied. This pre-segmentation behavior is part of what makes o200k_base distinct from nearby encodings.

Second, BPE merge ranks are applied to convert byte sequences into token IDs. The merge-ranks data is stored separately from the Python code and is loaded by tiktoken when the encoding is used. First use may require access to the encoding data or a populated local cache.

Third, special-token handling is applied. o200k_base includes a defined special-token set, including end-of-text and end-of-prompt markers. These tokens are not ordinary text fragments. They can carry control meaning in model contexts, so tiktoken handles them conservatively unless the developer explicitly allows them.

Because o200k_base is byte-oriented, a token does not always correspond cleanly to one visible character or one valid UTF-8 string. For inspecting individual tokens, byte-level decode helpers are safer than plain .decode().

Strengths and advantages

Stable reference encoding for current OpenAI workflows

The strongest advantage of o200k_base is its role as a shared operational encoding across current OpenAI model families. A single encoding can support token-counting and chunking logic for GPT-4o, GPT-4o mini, GPT-4.1, newer o-series models, and newer GPT-5 mappings.

Better fit than older encodings for some multilingual text

OpenAI's GPT-4o launch materials describe a tokenizer that is more efficient on non-English text than GPT-4 Turbo. That is a specific documented advantage of the GPT-4o-era tokenizer line and the main reason o200k_base exists as a distinct reference encoding.

Robust handling of arbitrary text

Because o200k_base operates at the byte level, it can handle mixed and messy input without relying on a conventional unknown-token fallback. This is useful for web text, code fragments, URLs, punctuation-heavy content, mixed Unicode, and other real-world input that may not resemble clean prose.

Limitations and quirks

Raw tokenization is not exact chat accounting

Tokenizing the visible text of a prompt is not always the same as the number of tokens billed or counted by a chat API request. Message roles, separators, tool calls, function schemas, and provider-side serialization can add hidden or semi-hidden overhead.

Single-token decoding can be misleading

A single token may not be aligned to a complete UTF-8 character boundary. Calling .decode() on individual tokens can therefore produce confusing or lossy results.

For token inspection, use byte-level APIs such as decode_single_token_bytes() or offset-aware decode helpers.

Best uses and practical guidance

Use o200k_base when the target workflow is explicitly tied to GPT-4o, GPT-4o mini, GPT-4.1, o1, o3, o4-mini, gpt-5, or another model that OpenAI maps to the same encoding.

It is especially suitable for:

  • prompt-budget estimation for current OpenAI chat and reasoning models;
  • retrieval chunk sizing for workflows built around GPT-4o-era models;
  • local validation before API requests;
  • tokenizer inspection tools and reference pages;
  • regression tests for systems that already standardized on o200k_base;
  • comparison against cl100k_base, p50k_base, and r50k_base.

Relationship to cl100k_base and earlier encodings

cl100k_base, p50k_base, r50k_base, and o200k_base are related OpenAI tiktoken encodings, but they are not interchangeable.

r50k_base is closely associated with GPT-2-style and older GPT-3 text-model tokenization. p50k_base is associated with Codex and later GPT-3 completion models such as text-davinci-002 and text-davinci-003. cl100k_base is associated with GPT-3.5 Turbo, GPT-4, newer base models, and OpenAI embedding models. o200k_base is associated with GPT-4o-era and newer model families.

EncodingAssociated era / usePractical guidance
r50k_baseGPT-2 and older GPT-3 text modelsUse for legacy compatibility and historical reproduction.
p50k_baseCodex and text-davinci-002 / text-davinci-003 style modelsUse for older code and completion workflows.
cl100k_baseGPT-3.5 Turbo, GPT-4, newer base models, embedding modelsUse for GPT-4/GPT-3.5-era and embedding workflows mapped to it.
o200k_baseGPT-4o-era and newer model familiesUse for newer models mapped to it.

For token reference work, this distinction is usually what explains a token-count mismatch: the model family changed, the encoding changed, or both changed.

Special tokens and structured-chat caveats

o200k_base includes a published special-token set:

Special tokenTypical meaning
<|endoftext|>End-of-text marker
<|endofprompt|>End-of-prompt marker

These tokens should not be treated as ordinary strings. In tiktoken, special tokens are disallowed by default unless the caller explicitly allows them. That helps prevent accidental conversion of literal text into control tokens.

Structured requests introduce a separate caveat. Provider-side requests are not always raw strings; they can include message wrappers, separators, or other serialized fields. Counting only the visible text with o200k_base can undercount the full request size.

For developer tools, a useful distinction is:

  • raw text tokens;
  • structured-request estimate;
  • provider-reported usage.

That distinction keeps a tokenizer reference page honest about what local tokenization can and cannot guarantee.

Compatibility with tiktoken and Transformers

tiktoken is the reference implementation for o200k_base. In Python, the standard usage is:

import tiktoken

enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode("Hello world")
text = enc.decode(tokens)

For model-aware loading, use:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

For individual token inspection, prefer byte-level decoding:

token_bytes = enc.decode_single_token_bytes(tokens[0])

For special-token-bearing text, make the handling explicit:

enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})

When using Transformers or another tokenizer framework, do not assume that a raw exported tokenizer asset is enough for exact behavioral parity. Verify:

  • merge ranks;
  • regex pre-tokenization behavior;
  • special-token definitions;
  • decode behavior;
  • handling of disallowed special-token text.

Developer reference

FieldValue
Encoding nameo200k_base
Librarytiktoken
MaintainerOpenAI
MethodByte-pair encoding with regex-guided pre-segmentation
Direct loadertiktoken.get_encoding("o200k_base")
Model-aware loadertiktoken.encoding_for_model(model_name)
Common use casesToken counting, prompt budgeting, retrieval chunking, API-input validation
Special tokens<|endoftext|>, <|endofprompt|>
Single-token inspectionPrefer decode_single_token_bytes()
Main migration concernOlder model families may use cl100k_base instead
Main structured-chat caveatRaw text tokenization does not necessarily equal full API request accounting

Minimal token count example

import tiktoken

enc = tiktoken.get_encoding("o200k_base")
text = "Hello world"
tokens = enc.encode(text)

print(tokens)
print(len(tokens))
print(enc.decode(tokens))

Model-aware example

import tiktoken

model = "gpt-4o"
enc = tiktoken.encoding_for_model(model)
print(enc.name)

Safer token inspection example

import tiktoken

enc = tiktoken.get_encoding("o200k_base")
for token_id in enc.encode("Hello world"):
    print(token_id, enc.decode_single_token_bytes(token_id))

Browse by type

Tokenizer Tools

Token index, decoded values, search, and the live playground.