Gemma4 tokenization now preserves Unicode accents and diacritics

dhiltgen

·Apr 2, 2026·#15232tokenizer: add byte fallback for SentencePiece BPE encoding

Models using SentencePiece BPE encoding (like Gemma4:31b) will no longer silently drop accented characters and diacritics—foreign language output should now render correctly.

When Gemma4:31b tokenized text, characters not found in its vocabulary were being silently discarded rather than encoded. This caused accented characters (é, è, ą, ę) to vanish from both content and thinking output, rendering French, Polish, and other accented languages illegible.

The tokenizer now falls back to byte-level encoding when a character isn't in the vocabulary. Each UTF-8 byte becomes a dedicated token (formatted as <0xHH>), and the decoder reverses this process. The SentencePiece decoder also now passes Unicode codepoints through directly, preventing characters like ą and ę from being mangled by the GPT-2 byte reversal logic.

Users with gemma4:31b should now see properly accented output in French, Polish, and other languages with diacritics.

View Original GitHub Description

When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes.

Fixes #15229, fixes #15231