Tokenizer profile

r50k_base

OpenAI byte-pair encoding matching the public GPT-2 tokenizer vocabulary shape in tiktoken. Reference profile with a long-form technical report.

r50k_base

r50k_base is a legacy tokenizer model associated with GPT-2 and older GPT-3-era text models. It remains useful for legacy token counting, historical reproducibility, GPT-2 interoperability, and maintenance of older prompt or search workflows.

Newer OpenAI model families are generally mapped to cl100k_base or o200k_base instead.

What is r50k_base?

r50k_base is a byte-pair encoding (BPE) used in OpenAI's tiktoken tokenizer library. It is a byte-oriented subword tokenizer: it can encode arbitrary text into token IDs and decode token IDs back into bytes or text.

For practical developer use, r50k_base is best treated as a stable compatibility target for GPT-2-style and legacy GPT-3-era tokenization behavior. It is useful when a developer needs token counts that match old davinci-era workflows or wants consistency with GPT-2 tooling.

History and background

r50k_base comes from the GPT-2 tokenizer lineage. GPT-2 used byte-level BPE with a 50,257-token vocabulary, and r50k_base mirrors that public vocabulary shape inside OpenAI's later tiktoken tooling.

The encoding became publicly available with the release of the tiktoken library in December 2022, but its technical lineage is older than that. The practical reason it remained important is straightforward: GPT-3 reused GPT-2-style reversible tokenization, so OpenAI needed a stable GPT-2-compatible encoding in its tokenizer tooling for older model families.

r50k_base sits before OpenAI's later 50k and 100k-era encodings. p50k_base became the relevant encoding for Codex and later completion-style models such as text-davinci-002 and text-davinci-003. cl100k_base then became the practical reference encoding for GPT-3.5 Turbo, GPT-4, and OpenAI embedding models, and newer flagship families moved again to o200k_base.

Models that use r50k_base

The following model families are documented or mapped to r50k_base in OpenAI or tiktoken materials:

Model or familyRelationship to r50k_basePractical note
GPT-2 / gpt2Closely associated with r50k_baseUseful for GPT-2-compatible tooling and interoperability.
ada, babbage, curie, davinciMapped to r50k_base in tiktokenLegacy GPT-3 base models; now mainly relevant for reproduction of older workflows.
text-ada-001, text-babbage-001, text-curie-001, text-davinci-001Mapped to r50k_base in tiktokenLegacy instruct-style text models.
text-similarity-* familiesMapped to r50k_base in tiktokenRelevant for older similarity pipelines only.
text-search-*-doc-001 familiesMapped to r50k_base in tiktokenLegacy search-era token counting target.
code-search-ada-code-001, code-search-babbage-code-001Mapped to r50k_base in tiktokenOlder code-search embeddings, not current coding models.
Fine-tuned variants of mapped legacy familiesCan resolve to r50k_base through model-name helpersVerify with encoding_for_model() when available.

Important non-use cases matter too. Codex and text-davinci-002 / text-davinci-003 are associated with p50k_base, not r50k_base. GPT-3.5 Turbo, GPT-4, GPT-4o, GPT-4.1, GPT-4.5, GPT-5, and current reasoning-model families are mapped to newer encodings.

How r50k_base works

r50k_base uses byte-pair encoding with GPT-2-style pre-segmentation. In practical terms, the process has three important layers.

First, text is split using a GPT-2-derived regex pattern. This splitting behavior is part of why r50k_base behaves like the public GPT-2 tokenizer family and why it stays close to older GPT-2-compatible implementations.

Second, BPE merge ranks are applied to convert byte sequences into token IDs. In tiktoken, r50k_base is distributed as a serialized .tiktoken asset instead of the older encoder.json and vocab.bpe pair commonly associated with GPT-2 tooling.

Third, special-token handling is applied. r50k_base includes the published <|endoftext|> special token. This token is not just an ordinary text fragment; it can carry control meaning in model contexts, so tiktoken handles it conservatively unless the caller explicitly allows it.

Because r50k_base is byte-oriented, a token does not always correspond cleanly to one visible character or one valid UTF-8 string. For inspecting individual tokens, byte-level decode helpers are safer than plain .decode().

Strengths and advantages

Stable compatibility target for GPT-2 and legacy GPT-3 workflows

The strongest advantage of r50k_base is that it preserves the tokenizer behavior expected by GPT-2 and many older GPT-3-era systems. If a team needs to reproduce historical prompt lengths, chunk sizes, or token-budget logic, r50k_base is the right reference point.

This matters most for archived prompt corpora, legacy evaluation setups, and old embeddings or search pipelines that were designed around those model families.

Direct GPT-2 ecosystem alignment

r50k_base is closely aligned with the public GPT-2 tokenizer family. That makes it easier to work across older OpenAI tooling, GPT-2-based research code, and alternative implementations such as Hugging Face GPT-2 tokenizers.

Exact interoperability should still be tested when it matters, but the ecosystem overlap is one of the encoding's main practical benefits.

Robust handling of arbitrary text

Because r50k_base is byte-level, it can tokenize arbitrary Unicode input without depending on a conventional unknown-token fallback. This is useful for messy web text, punctuation-heavy strings, mixed encodings, and other real-world data that may not resemble clean prose.

The tradeoff is that the resulting token boundaries can be unintuitive when inspected manually.

Close relationship to p50k_base on ordinary text

r50k_base and p50k_base overlap heavily on non-code text. That makes transitions between older text-model workflows and nearby Codex-era workflows easier to reason about, at least for ordinary prose.

That does not mean they are interchangeable for code-heavy workloads. p50k_base exists precisely because OpenAI needed a related but distinct encoding for code-oriented families.

Limitations

It is a legacy encoding, not a current default

r50k_base is still available, but it is not the tokenizer developers should assume for current OpenAI work. Modern model families are mapped elsewhere, so hard-coding r50k_base in new systems will often produce the wrong counts and the wrong compatibility assumptions.

Prefer tiktoken.encoding_for_model(model_name) when the goal is to match a live model.

It is not the right reference for Codex-era or modern chat models

Older code-oriented and late-completion models such as Codex and text-davinci-002 / text-davinci-003 are associated with p50k_base, not r50k_base. Modern chat and reasoning models use newer encodings again.

That means "50k-era OpenAI tokenizer" is not precise enough when exact token counting matters.

GPT-2-style space handling can confuse manual inspection

As with GPT-2 tokenizers generally, spaces are often grouped into tokens in ways that make the same visible word tokenize differently depending on whether it appears at the start of a string or after whitespace.

That behavior is normal, but it can surprise people comparing counts across trimmed strings or reconstructed prompts.

Special-token and byte-boundary behavior need deliberate API use

Literal text that matches a special token such as <|endoftext|> can trigger special-token protections in tiktoken. Also, decoding one token at a time with plain .decode() can be misleading when the token boundary does not align with UTF-8.

For token inspection or tooling, use explicit special-token controls and prefer decode_single_token_bytes().

Raw text tokenization is not exact API accounting

Even for legacy systems, local tokenization of visible text is not always the same as provider-side request accounting. Hidden separators, serialization format, or endpoint-specific message structure can add overhead outside the plain string.

For hard limits or billing-sensitive logic, validate against real API behavior where possible.

Best uses and practical guidance

Use r50k_base when the target workflow is explicitly tied to GPT-2 or older GPT-3-era text models.

It is especially suitable for:

  • historical prompt-budget estimation for legacy davinci-era workflows;
  • GPT-2-compatible tokenizer tooling and regression tests;
  • reproduction of older research setups or evaluation corpora;
  • legacy similarity and search pipeline maintenance;
  • token reference pages and debugging tools for GPT-2-style vocabularies;
  • comparison against p50k_base, cl100k_base, and o200k_base.

Relationship to gpt2, p50k_base, and cl100k_base

r50k_base, gpt2, p50k_base, and cl100k_base are related OpenAI tokenizer names, but they are not interchangeable.

r50k_base is the tiktoken encoding most closely associated with GPT-2-compatible tokenization. p50k_base is associated with Codex and later completion-style model families. cl100k_base is associated with GPT-3.5 Turbo, GPT-4, newer base models, and OpenAI embedding models.

The practical difference is not just vocabulary size. Each encoding has different merge ranks, model mappings, and segmentation behavior. The same text can produce different token IDs and different token counts across them.

EncodingAssociated era / usePractical guidance
r50k_baseGPT-2 and older GPT-3 text modelsUse for legacy compatibility and historical reproduction.
gpt2GPT-2 tokenizer naming in older ecosystemsClosely related to r50k_base; useful for interoperability discussions.
p50k_baseCodex and text-davinci-002 / text-davinci-003 style modelsUse for older code and completion workflows.
cl100k_baseGPT-3.5 Turbo, GPT-4, newer base models, embedding modelsUse for GPT-4/GPT-3.5-era and embedding workflows mapped to it.
o200k_baseNewer flagship and reasoning model familiesUse for newer models mapped to it.

For token reference work, this distinction is usually what explains a token-count mismatch: the model family changed, the encoding changed, or both changed.

Legacy status and continued relevance

r50k_base is best described as a legacy encoding that still matters for compatibility. That combination is common in production systems: the newest tokenizer is not always the one that matters most, because long-lived data pipelines, eval sets, and archived prompts often depend on historical token boundaries.

Its current value is therefore narrow but real:

  • reproducing GPT-2 and older GPT-3 token counts;
  • maintaining old token-budget assumptions during migration work;
  • comparing the behavior of later OpenAI encodings against a GPT-2-style baseline;
  • keeping cross-tool compatibility with GPT-2 tokenizer ecosystems.

Special tokens and structured-input caveats

r50k_base includes a published special token:

Special tokenTypical meaning
<|endoftext|>End-of-text marker

This token should not be treated as an ordinary string. In tiktoken, special tokens are disallowed by default unless the caller explicitly allows them. That helps prevent accidental conversion of literal text into control tokens.

Structured requests introduce a separate caveat. Provider-side requests are not always raw strings; they can include message wrappers, separators, or other serialized fields. Counting only the visible text with r50k_base can undercount the full request size.

For developer tools, a useful distinction is:

  • raw text tokens;
  • structured-request estimate;
  • provider-reported usage.

That distinction keeps a tokenizer reference page honest about what local tokenization can and cannot guarantee.

Compatibility with tiktoken and Transformers

tiktoken is the reference implementation for r50k_base. In Python, the standard usage is:

import tiktoken

enc = tiktoken.get_encoding("r50k_base")
tokens = enc.encode("Hello world")
text = enc.decode(tokens)

For model-aware loading, use:

import tiktoken

enc = tiktoken.encoding_for_model("davinci")

For individual token inspection, prefer byte-level decoding:

token_bytes = enc.decode_single_token_bytes(tokens[0])

For special-token-bearing text, make the handling explicit:

enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})

When using Hugging Face Transformers or another tokenizer framework, do not assume that a raw exported tokenizer asset is enough for exact behavioral parity. Verify:

  • merge ranks;
  • regex pre-tokenization behavior;
  • special-token definitions;
  • decode behavior;
  • handling of disallowed special-token text.

Developer reference

FieldValue
Encoding namer50k_base
Librarytiktoken
MaintainerOpenAI
MethodByte-pair encoding with GPT-2-style regex pre-segmentation
Direct loadertiktoken.get_encoding("r50k_base")
Model-aware loadertiktoken.encoding_for_model(model_name)
Common use casesLegacy token counting, GPT-2 interoperability, historical prompt reproduction
Special tokens<|endoftext|>
Single-token inspectionPrefer decode_single_token_bytes()
Main migration concernMost current model families use newer encodings
Main structured-input caveatRaw text tokenization does not necessarily equal full request accounting

Minimal token count example

import tiktoken

enc = tiktoken.get_encoding("r50k_base")
text = "Hello world"
tokens = enc.encode(text)

print(tokens)
print(len(tokens))
print(enc.decode(tokens))

Model-aware example

import tiktoken

model = "davinci"
enc = tiktoken.encoding_for_model(model)
print(enc.name)

Safer token inspection example

import tiktoken

enc = tiktoken.get_encoding("r50k_base")
for token_id in enc.encode("Hello world"):
    print(token_id, enc.decode_single_token_bytes(token_id))

Browse by type

Tokenizer Tools

Token index, decoded values, search, and the live playground.