cl100k_base

cl100k_base is a tokenizer model created by OpenAI. The encoding is associated with GPT-4-era and GPT-3.5-era large language models.

It should be understood as a specific encoding rather than a general tokenizer family. In tiktoken, cl100k_base has its own constructor, merge-ranks file, regex pre-tokenization pattern, and special-token table.

cl100k_base is succeeded by o200k_base.

What is cl100k_base?

cl100k_base is a type of byte-pair encoding (BPE) used in OpenAI’s tiktoken tokenizer library. BPEs can encode arbitrary text into token IDs and decode token IDs back into bytes or text.

cl100k_base is associated with GPT-4, GPT-3.5 Turbo, text-embedding-ada-002, and the text-embedding-3 embedding models.

History and background

cl100k_base was released in December 2022 with the publication of tiktoken.

The encoding appeared after OpenAI’s earlier 50k-era encodings, including r50k_base, p50k_base, and p50k_edit. Those earlier encodings are associated with GPT-2, GPT-3 text models, Codex, and text-davinci-002 / text-davinci-003 style model families. cl100k_base became the practical reference encoding for the next major wave of OpenAI developer usage: GPT-3.5 Turbo, GPT-4, newer base models such as davinci-002 and babbage-002, and embedding models used in retrieval systems.

OpenAI did not release any press at the time of introduction, but cl100k_base enabled embedding workflows and improved support of multilingual text.

Models that use cl100k_base

The following model families are mapped to cl100k_base:

Model or family	Relationship to `cl100k_base`	Practical note
`gpt-4` and versioned `gpt-4-*` variants	Mapped to `cl100k_base` in `tiktoken`	Use model-aware lookup when possible.
`gpt-3.5-turbo` and versioned variants	Mapped to `cl100k_base` in `tiktoken`	Chat formatting can add overhead beyond raw text tokens.
`gpt-35-turbo` aliases	Mapped to `cl100k_base` in `tiktoken`	Relevant for Azure-style naming.
`davinci-002`	Mapped to `cl100k_base`	Not the same tokenizer as older `text-davinci-003`.
`babbage-002`	Mapped to `cl100k_base`	Part of the newer base-model mapping.
`text-embedding-ada-002`	Associated with `cl100k_base`	Important for older embedding pipelines.
`text-embedding-3-small`	OpenAI guidance points to `cl100k_base`	Use for token counting before embedding calls.
`text-embedding-3-large`	OpenAI guidance points to `cl100k_base`	Use for token counting before embedding calls.

How cl100k_base works

cl100k_base uses byte-pair encoding with tokenizer-specific pre-segmentation. In practical terms, the process has three important layers.

First, text is split using a Unicode-aware regex pattern. The regex groups pieces of text such as contractions, letter sequences, short digit sequences, punctuation spans, newlines, and whitespace before BPE merges are applied. This pre-segmentation behavior is part of what makes cl100k_base distinct from nearby encodings.

Second, BPE merge ranks are applied to convert byte sequences into token IDs. The merge-ranks data is stored separately from the Python code and is loaded by tiktoken when the encoding is used. This is why first use of a tiktoken encoding may require access to the encoding data or a populated local cache.

Third, special-token handling is applied. cl100k_base includes a defined special-token set, including end-of-text, fill-in-the-middle markers, and end-of-prompt. These tokens are not ordinary text fragments. They can carry control meaning in model contexts, so tiktoken handles them conservatively unless the developer explicitly allows them.

Because cl100k_base is byte-oriented, a token does not always correspond cleanly to one visible character or one valid UTF-8 string. For inspecting individual tokens, byte-level decode helpers are safer than plain .decode().

Strengths and advantages

Stable reference encoding for major OpenAI workflows

The strongest advantage of cl100k_base is its role as a shared operational encoding across important OpenAI model families. A single encoding can support token-counting and chunking logic for GPT-4, GPT-3.5 Turbo, davinci-002, babbage-002, text-embedding-ada-002, and the text-embedding-3 models.

That makes it especially useful in systems that combine chat, retrieval, and embedding workflows from the same generation of OpenAI tooling.

Better fit than older 50k encodings for some multilingual text

OpenAI comparison examples show cl100k_base using fewer tokens than r50k_base and p50k_base on at least some non-English text. For example, a Japanese phrase shown in OpenAI materials takes fewer tokens with cl100k_base than with the older 50k encodings.

This is a specific documented advantage, but it should not be generalized too far. It does not prove that cl100k_base is always more efficient across all languages, all domains, or all input formats.

Robust handling of arbitrary text

Because cl100k_base operates at the byte level, it can handle mixed and messy input without relying on a conventional unknown-token fallback. This is useful for web text, code fragments, URLs, punctuation-heavy content, mixed Unicode, and other real-world input that may not resemble clean prose.

The tradeoff is that the resulting token boundaries can be unintuitive to inspect manually.

Strong tooling support

cl100k_base is directly supported by OpenAI’s tiktoken library, and third-party ports exist in several ecosystems. This makes it easier to build reproducible token-counting tools, prompt-budget checks, retrieval chunkers, and tokenizer reference pages.

For exact compatibility, official tiktoken behavior should be treated as the reference implementation.

Limitations and quirks

Raw tokenization is not exact chat accounting

Tokenizing the visible text of a prompt is not always the same as the number of tokens billed or counted by a chat API request. Message roles, separators, tool calls, function schemas, and provider-side serialization can add hidden or semi-hidden overhead.

For critical limits, local counts should be treated as estimates and validated against actual API usage where possible.

It is not the default for newer OpenAI models

cl100k_base is still current for several model families, but newer flagship and reasoning models are commonly mapped to o200k_base. Code that hard-codes cl100k_base for all OpenAI models will drift out of alignment as newer model families become the default.

Prefer tiktoken.encoding_for_model(model_name) when the goal is to match a specific model.

Token efficiency is input-dependent

cl100k_base is not universally more compact than older encodings. OpenAI comparison examples show multilingual cases where it improves over older encodings, but also simple text examples where older encodings use fewer tokens.

Developers should test representative samples from their own corpus before making assumptions about cost, chunk size, or context-window utilization.

Single-token decoding can be misleading

A single token may not be aligned to a complete UTF-8 character boundary. Calling .decode() on individual tokens can therefore produce confusing or lossy results.

For token inspection, use byte-level APIs such as decode_single_token_bytes() or offset-aware decode helpers.

First-run loading and caching can matter

tiktoken may need to load encoding data on first use, and that data may be cached locally. In offline, serverless, locked-down, or reproducible-build environments, this can become a deployment issue.

Production systems should make tokenizer data availability explicit: pre-cache the encoding, set cache directories deliberately, or vendor the required files where licensing and deployment policy allow.

Best uses and practical guidance

Use cl100k_base when the target workflow is explicitly tied to GPT-4, GPT-3.5 Turbo, davinci-002, babbage-002, text-embedding-ada-002, text-embedding-3-small, or text-embedding-3-large.

It is especially suitable for:

prompt-budget estimation for GPT-4 and GPT-3.5-era applications;
retrieval chunk sizing for embedding pipelines that use text-embedding-ada-002 or text-embedding-3;
local validation before embedding requests;
tokenizer inspection tools and reference pages;
regression tests for systems that already standardized on cl100k_base;
historical comparison against r50k_base, p50k_base, and o200k_base.

Relationship to r50k_base and p50k_base

r50k_base, p50k_base, and cl100k_base are related OpenAI tiktoken encodings, but they are not interchangeable.

r50k_base is closely associated with GPT-2-style and older GPT-3 text-model tokenization. p50k_base is associated with Codex and later GPT-3 completion models such as text-davinci-002 and text-davinci-003. cl100k_base is associated with GPT-3.5 Turbo, GPT-4, newer base models, and OpenAI embedding models.

Migration from cl100k_base to o200k_base

Migration from cl100k_base to o200k_base should be treated as a model-driven change, not as a blanket tokenizer upgrade.

The main migration concerns are:

Token counts may change. The same input can become shorter, longer, or differently segmented.
Chunk boundaries may shift. Retrieval systems that store chunk text, token counts, offsets, or embeddings may need reprocessing.
Regression tests may fail. Token IDs and token counts are not stable across encodings.
Cost estimates may change. More compact tokenization on some corpora can improve context packing, but this must be measured on real data.
Debugging tools need relabeling. A token reference UI should clearly show which encoding produced a token ID.

A conservative migration approach:

1. Identify the actual target model and its mapped encoding. 2. Sample real prompts, documents, and tool-call payloads. 3. Compare token counts under cl100k_base and o200k_base. 4. Re-check chunking assumptions for retrieval workflows. 5. Re-run evals where prompt truncation or chunk boundaries affect output quality. 6. Update developer documentation so cl100k_base is not described as the default OpenAI tokenizer.

Do not describe o200k_base as the formal successor to cl100k_base unless the context is explicitly practical rather than historical. It is better to say that o200k_base is used by newer OpenAI model families, while cl100k_base remains used by several earlier and still-relevant model families.

Special tokens and structured-chat caveats

cl100k_base includes a published special-token set:

Special token	Typical meaning
`<\|endoftext\|>`	End-of-text marker
`<\|fim_prefix\|>`	Fill-in-the-middle prefix marker
`<\|fim_middle\|>`	Fill-in-the-middle middle marker
`<\|fim_suffix\|>`	Fill-in-the-middle suffix marker
`<\|endofprompt\|>`	End-of-prompt marker

These tokens should not be treated as ordinary strings. In tiktoken, special tokens are disallowed by default unless the caller explicitly allows them. This helps prevent accidental conversion of literal text into control tokens.

Structured chat introduces a separate caveat. Chat requests are not just raw strings; they include messages, roles, separators, optional tool definitions, function-call structures, and other serialized fields. Counting only the visible message text with cl100k_base can undercount the full request.

For developer tools, a good UI distinction is:

Raw text tokens: tokens from encoding a plain string.
Chat estimate: approximate count for a structured message payload.
API usage: provider-reported count after serialization and processing.

That distinction prevents a tokenizer reference page from overstating what local tokenization can guarantee.

Compatibility with tiktoken and Transformers

tiktoken is the reference implementation for cl100k_base. In Python, the standard usage is:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Hello world")
text = enc.decode(tokens)

For model-aware loading, use:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

For individual token inspection, prefer byte-level decoding:

token_bytes = enc.decode_single_token_bytes(tokens[0])

For special-token-bearing text, make the handling explicit:

enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})

Developer reference

Field	Value
Encoding name	`cl100k_base`
Library	`tiktoken`
Maintainer	OpenAI
Method	Byte-pair encoding with regex-guided pre-segmentation
Direct loader	`tiktoken.get_encoding("cl100k_base")`
Model-aware loader	`tiktoken.encoding_for_model(model_name)`
Common use cases	Token counting, prompt budgeting, retrieval chunking, embedding-input validation
Special tokens	`<\|endoftext\|>`, `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|endofprompt\|>`
Single-token inspection	Prefer `decode_single_token_bytes()`
Main migration concern	Newer model families may use `o200k_base` instead
Main structured-chat caveat	Raw text tokenization does not necessarily equal full API request accounting

Minimal token count example

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "Hello world"
tokens = enc.encode(text)

print(tokens)
print(len(tokens))
print(enc.decode(tokens))

Model-aware example

import tiktoken

model = "gpt-4"
enc = tiktoken.encoding_for_model(model)
print(enc.name)

Safer token inspection example

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
for token_id in enc.encode("Hello world"):
    print(token_id, enc.decode_single_token_bytes(token_id))

Browse by type

PunctuationSymbols and marks WhitespaceSpaces, tabs, newlines EmojiPictographs and fragments MultilingualExamples across scripts CodeProgramming tokens Control/special tokensReserved tokenizer values

Tokenizer Tools

Token index, decoded values, search, and the live playground.

Open token index Open playground

cl100k_base

Models using this tokenizer

cl100k_base

What is cl100k_base?

History and background

Models that use cl100k_base

How cl100k_base works

Strengths and advantages

Stable reference encoding for major OpenAI workflows

Better fit than older 50k encodings for some multilingual text

Robust handling of arbitrary text

Strong tooling support

Limitations and quirks

Raw tokenization is not exact chat accounting

It is not the default for newer OpenAI models

Token efficiency is input-dependent

Single-token decoding can be misleading

First-run loading and caching can matter

Best uses and practical guidance

Relationship to r50k_base and p50k_base

Migration from cl100k_base to o200k_base

Special tokens and structured-chat caveats

Compatibility with tiktoken and Transformers

Developer reference

Minimal token count example

Model-aware example

Safer token inspection example

Browse by type

Tokenizer Tools