mirror of
https://git.FreeBSD.org/ports.git
synced 2025-01-06 06:30:19 +00:00
11 lines
548 B
Plaintext
11 lines
548 B
Plaintext
Ucto tokenizes text files: it separates words from punctuation, and splits
|
|
sentences. It offers several other basic preprocessing steps such as changing
|
|
case that you can all use to make your text suited for further processing such
|
|
as indexing, part-of-speech tagging, or machine translation.
|
|
|
|
Ucto comes with tokenisation rules for several languages and can be easily
|
|
extended to suit other languages. It has been incorporated for tokenizing Dutch
|
|
text in Frog, our Dutch morpho-syntactic processor.
|
|
|
|
WWW: https://languagemachines.github.io/ucto/
|