Language/Multiple-languages/Culture/Text-Processing-Tools

From Polyglot Club WIKI
< Language‎ | Multiple-languages‎ | Culture
Revision as of 13:44, 20 May 2021 by GrimPixel (talk | contribs) (Created page with "Category:Computer-Knowledge In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. This brings about difficulties for computer p...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Rate this lesson:
5.00
(2 votes)


In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between. Sounds like easy, but they have to deal with ambiguities and unknown words, the time and accuracy are both to be considered.

Here are free and open-source tools to do it:

Chinese:

Japanese:

Thai:


If you don't know Python, please try this:

Contributors

GrimPixel and Maintenance script


Create a new Lesson