cl100k_base
cl100k_base is a tokenizer model created by OpenAI. The encoding is associated with GPT-4-era and GPT-3.5-era large language models.
It should be understood as a specific encoding rather than a general tokenizer family. In tiktoken, cl100k_base has its own constructor, merge-ranks file, regex pre-tokenization pattern, and special-token table.
cl100k_base is succeeded by o200k_base.
What is cl100k_base?
cl100k_base is a type of byte-pair encoding (BPE) used in OpenAI’s tiktoken tokenizer library. BPEs can encode arbitrary text into token IDs and decode token IDs back into bytes or text.
cl100k_base is associated with GPT-4, GPT-3.5 Turbo, text-embedding-ada-002, and the text-embedding-3 embedding models.
History and background
cl100k_base was released in December 2022 with the publication of tiktoken.
The encoding appeared after OpenAI’s earlier 50k-era encodings, including r50k_base, p50k_base, and p50k_edit. Those earlier encodings are associated with GPT-2, GPT-3 text models, Codex, and text-davinci-002 / text-davinci-003 style model families. cl100k_base became the practical reference encoding for the next major wave of OpenAI developer usage: GPT-3.5 Turbo, GPT-4, newer base models such as davinci-002 and babbage-002, and embedding models used in retrieval systems.
OpenAI did not release any press at the time of introduction, but cl100k_base enabled embedding workflows and improved support of multilingual text.
Models that use cl100k_base
The following model families are mapped to cl100k_base:
| Model or family | Relationship to cl100k_base | Practical note |
|---|---|---|
gpt-4 and versioned gpt-4-* variants | Mapped to cl100k_base in tiktoken | Use model-aware lookup when possible. |
gpt-3.5-turbo and versioned variants | Mapped to cl100k_base in tiktoken | Chat formatting can add overhead beyond raw text tokens. |
gpt-35-turbo aliases | Mapped to cl100k_base in tiktoken | Relevant for Azure-style naming. |
davinci-002 | Mapped to cl100k_base | Not the same tokenizer as older text-davinci-003. |
babbage-002 | Mapped to cl100k_base | Part of the newer base-model mapping. |
text-embedding-ada-002 | Associated with cl100k_base | Important for older embedding pipelines. |
text-embedding-3-small | OpenAI guidance points to cl100k_base | Use for token counting before embedding calls. |
text-embedding-3-large | OpenAI guidance points to cl100k_base | Use for token counting before embedding calls. |
How cl100k_base works
cl100k_base uses byte-pair encoding with tokenizer-specific pre-segmentation. In practical terms, the process has three important layers.
First, text is split using a Unicode-aware regex pattern. The regex groups pieces of text such as contractions, letter sequences, short digit sequences, punctuation spans, newlines, and whitespace before BPE merges are applied. This pre-segmentation behavior is part of what makes cl100k_base distinct from nearby encodings.
Second, BPE merge ranks are applied to convert byte sequences into token IDs. The merge-ranks data is stored separately from the Python code and is loaded by tiktoken when the encoding is used. This is why first use of a tiktoken encoding may require access to the encoding data or a populated local cache.
Third, special-token handling is applied. cl100k_base includes a defined special-token set, including end-of-text, fill-in-the-middle markers, and end-of-prompt. These tokens are not ordinary text fragments. They can carry control meaning in model contexts, so tiktoken handles them conservatively unless the developer explicitly allows them.
Because cl100k_base is byte-oriented, a token does not always correspond cleanly to one visible character or one valid UTF-8 string. For inspecting individual tokens, byte-level decode helpers are safer than plain .decode().
Strengths and advantages
Stable reference encoding for major OpenAI workflows
The strongest advantage of cl100k_base is its role as a shared operational encoding across important OpenAI model families. A single encoding can support token-counting and chunking logic for GPT-4, GPT-3.5 Turbo, davinci-002, babbage-002, text-embedding-ada-002, and the text-embedding-3 models.
That makes it especially useful in systems that combine chat, retrieval, and embedding workflows from the same generation of OpenAI tooling.
Better fit than older 50k encodings for some multilingual text
OpenAI comparison examples show cl100k_base using fewer tokens than r50k_base and p50k_base on at least some non-English text. For example, a Japanese phrase shown in OpenAI materials takes fewer tokens with cl100k_base than with the older 50k encodings.
This is a specific documented advantage, but it should not be generalized too far. It does not prove that cl100k_base is always more efficient across all languages, all domains, or all input formats.
Robust handling of arbitrary text
Because cl100k_base operates at the byte level, it can handle mixed and messy input without relying on a conventional unknown-token fallback. This is useful for web text, code fragments, URLs, punctuation-heavy content, mixed Unicode, and other real-world input that may not resemble clean prose.
The tradeoff is that the resulting token boundaries can be unintuitive to inspect manually.
Strong tooling support
cl100k_base is directly supported by OpenAI’s tiktoken library, and third-party ports exist in several ecosystems. This makes it easier to build reproducible token-counting tools, prompt-budget checks, retrieval chunkers, and tokenizer reference pages.
For exact compatibility, official tiktoken behavior should be treated as the reference implementation.
Limitations and quirks
Raw tokenization is not exact chat accounting
Tokenizing the visible text of a prompt is not always the same as the number of tokens billed or counted by a chat API request. Message roles, separators, tool calls, function schemas, and provider-side serialization can add hidden or semi-hidden overhead.
For critical limits, local counts should be treated as estimates and validated against actual API usage where possible.
It is not the default for newer OpenAI models
cl100k_base is still current for several model families, but newer flagship and reasoning models are commonly mapped to o200k_base. Code that hard-codes cl100k_base for all OpenAI models will drift out of alignment as newer model families become the default.
Prefer tiktoken.encoding_for_model(model_name) when the goal is to match a specific model.
Token efficiency is input-dependent
cl100k_base is not universally more compact than older encodings. OpenAI comparison examples show multilingual cases where it improves over older encodings, but also simple text examples where older encodings use fewer tokens.
Developers should test representative samples from their own corpus before making assumptions about cost, chunk size, or context-window utilization.
Single-token decoding can be misleading
A single token may not be aligned to a complete UTF-8 character boundary. Calling .decode() on individual tokens can therefore produce confusing or lossy results.
For token inspection, use byte-level APIs such as decode_single_token_bytes() or offset-aware decode helpers.
First-run loading and caching can matter
tiktoken may need to load encoding data on first use, and that data may be cached locally. In offline, serverless, locked-down, or reproducible-build environments, this can become a deployment issue.
Production systems should make tokenizer data availability explicit: pre-cache the encoding, set cache directories deliberately, or vendor the required files where licensing and deployment policy allow.
Best uses and practical guidance
Use cl100k_base when the target workflow is explicitly tied to GPT-4, GPT-3.5 Turbo, davinci-002, babbage-002, text-embedding-ada-002, text-embedding-3-small, or text-embedding-3-large.
It is especially suitable for:
- prompt-budget estimation for GPT-4 and GPT-3.5-era applications;
- retrieval chunk sizing for embedding pipelines that use
text-embedding-ada-002ortext-embedding-3; - local validation before embedding requests;
- tokenizer inspection tools and reference pages;
- regression tests for systems that already standardized on
cl100k_base; - historical comparison against
r50k_base,p50k_base, ando200k_base.
Relationship to r50k_base and p50k_base
r50k_base, p50k_base, and cl100k_base are related OpenAI tiktoken encodings, but they are not interchangeable.
r50k_base is closely associated with GPT-2-style and older GPT-3 text-model tokenization. p50k_base is associated with Codex and later GPT-3 completion models such as text-davinci-002 and text-davinci-003. cl100k_base is associated with GPT-3.5 Turbo, GPT-4, newer base models, and OpenAI embedding models.
Migration from cl100k_base to o200k_base
Migration from cl100k_base to o200k_base should be treated as a model-driven change, not as a blanket tokenizer upgrade.
The main migration concerns are:
- Token counts may change. The same input can become shorter, longer, or differently segmented.
- Chunk boundaries may shift. Retrieval systems that store chunk text, token counts, offsets, or embeddings may need reprocessing.
- Regression tests may fail. Token IDs and token counts are not stable across encodings.
- Cost estimates may change. More compact tokenization on some corpora can improve context packing, but this must be measured on real data.
- Debugging tools need relabeling. A token reference UI should clearly show which encoding produced a token ID.
A conservative migration approach:
1. Identify the actual target model and its mapped encoding. 2. Sample real prompts, documents, and tool-call payloads. 3. Compare token counts under cl100k_base and o200k_base. 4. Re-check chunking assumptions for retrieval workflows. 5. Re-run evals where prompt truncation or chunk boundaries affect output quality. 6. Update developer documentation so cl100k_base is not described as the default OpenAI tokenizer.
Do not describe o200k_base as the formal successor to cl100k_base unless the context is explicitly practical rather than historical. It is better to say that o200k_base is used by newer OpenAI model families, while cl100k_base remains used by several earlier and still-relevant model families.
Special tokens and structured-chat caveats
cl100k_base includes a published special-token set:
| Special token | Typical meaning |
|---|---|
<|endoftext|> | End-of-text marker |
<|fim_prefix|> | Fill-in-the-middle prefix marker |
<|fim_middle|> | Fill-in-the-middle middle marker |
<|fim_suffix|> | Fill-in-the-middle suffix marker |
<|endofprompt|> | End-of-prompt marker |
These tokens should not be treated as ordinary strings. In tiktoken, special tokens are disallowed by default unless the caller explicitly allows them. This helps prevent accidental conversion of literal text into control tokens.
Structured chat introduces a separate caveat. Chat requests are not just raw strings; they include messages, roles, separators, optional tool definitions, function-call structures, and other serialized fields. Counting only the visible message text with cl100k_base can undercount the full request.
For developer tools, a good UI distinction is:
- Raw text tokens: tokens from encoding a plain string.
- Chat estimate: approximate count for a structured message payload.
- API usage: provider-reported count after serialization and processing.
That distinction prevents a tokenizer reference page from overstating what local tokenization can guarantee.
Compatibility with tiktoken and Transformers
tiktoken is the reference implementation for cl100k_base. In Python, the standard usage is:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Hello world")
text = enc.decode(tokens)
For model-aware loading, use:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
For individual token inspection, prefer byte-level decoding:
token_bytes = enc.decode_single_token_bytes(tokens[0])
For special-token-bearing text, make the handling explicit:
enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})
Developer reference
| Field | Value |
|---|---|
| Encoding name | cl100k_base |
| Library | tiktoken |
| Maintainer | OpenAI |
| Method | Byte-pair encoding with regex-guided pre-segmentation |
| Direct loader | tiktoken.get_encoding("cl100k_base") |
| Model-aware loader | tiktoken.encoding_for_model(model_name) |
| Common use cases | Token counting, prompt budgeting, retrieval chunking, embedding-input validation |
| Special tokens | <|endoftext|>, <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>, <|endofprompt|> |
| Single-token inspection | Prefer decode_single_token_bytes() |
| Main migration concern | Newer model families may use o200k_base instead |
| Main structured-chat caveat | Raw text tokenization does not necessarily equal full API request accounting |
Minimal token count example
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "Hello world"
tokens = enc.encode(text)
print(tokens)
print(len(tokens))
print(enc.decode(tokens))
Model-aware example
import tiktoken
model = "gpt-4"
enc = tiktoken.encoding_for_model(model)
print(enc.name)
Safer token inspection example
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
for token_id in enc.encode("Hello world"):
print(token_id, enc.decode_single_token_bytes(token_id))
Browse by type
Tokenizer Tools
Token index, decoded values, search, and the live playground.