Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

From Polyglot Club WIKI
Jump to navigation Jump to search
Line 3: Line 3:
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.


The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. Sounds like easy, but they have to deal with ambiguities and unknown words including proper names, the time and accuracy are both to be considered. It can be based on the dictionary or machine learning.
The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the [https://en.wikipedia.org/wiki/Non-breaking_space non-breaking space].
 
The process sounds like easy, but they have to deal with ambiguities and unknown words including proper names, the time and accuracy are both to be considered. It can be based on the dictionary or machine learning.


You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the [https://en.wikipedia.org/wiki/Non-breaking_space non-breaking space].


Here are free and open-source tools to do it:
Here are free and open-source tools to do it:

Revision as of 14:16, 20 May 2021


In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the non-breaking space.

The process sounds like easy, but they have to deal with ambiguities and unknown words including proper names, the time and accuracy are both to be considered. It can be based on the dictionary or machine learning.


Here are free and open-source tools to do it:

Chinese:

Japanese:

Thai:

Vietnamese:

Many of them are written in Python, which is an important language in machine learning. It's very easy to use: just create a blank “.py” file, write a line to import from the library, write a line to segment the text, write a line to save the result.

If you don't know Python, please try this: