How to Train a Chinese Tokenizer Model Using SentencePiece

What is SentencePiece

Converts a sequence of characters into a sequence of numbers.
Three levels of granularity: word, char, subword (balances advantages of word and char levels).
Common subword algorithms:
- BPE (Byte-Pair Encoding): Start with characters, iteratively merge the most frequent consecutive token pairs until reaching the target vocabulary size.
- BBPE: Extends BPE from character to byte level. Each byte is treated as a character, limiting the base vocabulary to 256 symbols. Pros: cross-lingual vocab sharing, smaller vocab size. Cons: for Chinese, sequence length increases significantly.
- WordPiece: A variant of BPE based on probability. Instead of merging the most frequent pair, it merges the pair that maximizes the language model likelihood.
- Unigram: Initializes a large vocabulary and removes tokens iteratively based on a language model until the desired vocabulary size is reached.
- SentencePiece: Google’s open-source subword toolkit. Treats sentences as a whole, ignoring natural word boundaries. Supports BPE or Unigram algorithms and treats spaces as special characters.
Popular model tokenizers:

Tokenizer Example

SentencePiece is an unsupervised text tokenizer and detokenizer for neural text generation systems with a fixed vocabulary before training.
Implements subword units (BPE, Unigram LM) and can train directly from raw sentences. This allows creating an end-to-end, language-independent NLP system.
- Supports Chinese, English, Korean, or mixed-language text
- Pure data-driven end-to-end pipeline
- No language-specific preprocessing required (useful for multilingual systems)

Features:

SentencePiece has two parts: model training and model usage.

Dependencies:

Install build tools on Ubuntu:

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Build and install:

git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
make install
ldconfig -v

Check CLI usage:

spm_train --help

pip install sentencepiece

spm_train --input=train.txt --model_prefix=./tokenizer --vocab_size=4000 --character_coverage=0.9995 --model_type=bpe

Parameters:

--input: Training corpus (comma-separated files, one sentence per line). No need for tokenization or preprocessing. SentencePiece applies Unicode NFKC normalization by default.
--model_prefix: Output prefix; generates <prefix>.model and <prefix>.vocab.
--vocab_size: Vocabulary size (e.g., 4000, 8000, 16000, 32000).
--character_coverage: Fraction of characters to cover; for rich character sets (Chinese/Japanese) use 0.9995, otherwise 1.0.
--model_type: Model type; options: unigram (default), bpe, char, word. For word, input must be pre-tokenized.

Output files:

ls -al model_path

View vocabulary:

head -n20 vocab_path

Reference:

https://zhuanlan.zhihu.com/p/630696264

DeepLearning Tokenizer