mirror of
https://git.FreeBSD.org/ports.git
synced 2025-01-28 10:08:24 +00:00
8 lines
477 B
Plaintext
8 lines
477 B
Plaintext
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
|
|
Neural Network-based text generation systems where the vocabulary size is
|
|
predetermined prior to the neural model training. SentencePiece implements
|
|
subword units (e.g., byte-pair-encoding (BPE)) and unigram language model
|
|
with the extension of direct training from raw sentences. SentencePiece
|
|
allows us to make a purely end-to-end system that does not depend on
|
|
language-specific pre/postprocessing.
|