Token category

p50k_base Multilingual

Multilingual text often splits into script-specific subwords, punctuation, and byte fragments. The same sentence can tokenize very differently across languages and writing systems.

Loading tokens...