Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 13:49, 20 May 2021

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between. Sounds like easy, but they have to deal with ambiguities and unknown words including proper names, the time and accuracy are both to be considered. It can be based on the dictionary or machine learning.

Here are free and open-source tools to do it:

Chinese:

Ansj https://github.com/NLPchina/ansj_seg
CoreNLP https://github.com/stanfordnlp/CoreNLP
FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
FudanNLP https://github.com/FudanNLP/fnlp
HanLP https://github.com/hankcs/HanLP
jieba https://github.com/fxsjy/jieba
LAC https://github.com/baidu/lac
LTP https://github.com/HIT-SCIR/ltp
SnowNLP https://github.com/isnowfy/snownlp
pkuseg https://github.com/lancopku/pkuseg-python
pyhanlp https://github.com/hankcs/pyhanlp
THULAC https://github.com/thunlp/THULAC-Python

Japanese:

janome https://github.com/mocobeta/janome
Juman++ https://github.com/ku-nlp/jumanpp
Kagome https://github.com/ikawaha/kagome
Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
KyTea http://www.phontron.com/kytea/
MeCab https://taku910.github.io/mecab/
nagisa https://github.com/taishi-i/nagisa
Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy

Thai:

Cutkum https://github.com/pucktada/cutkum
CutThai https://github.com/pureexe/cutthai
Deepcut https://github.com/rkcosmos/deepcut
PyThaiNLP https://github.com/PyThaiNLP/pythainlp
SWATH https://www.cs.cmu.edu/~paisarn/software.html
SynThai https://github.com/KrakenAI/SynThai
TLTK https://pypi.org/project/tltk/
wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy

If you don't know Python, please try this:

@@ Line 3: / Line 3: @@
 In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.
-The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. Sounds like easy, but they have to deal with ambiguities and unknown words, the time and accuracy are both to be considered.
+The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. Sounds like easy, but they have to deal with ambiguities and unknown words including proper names, the time and accuracy are both to be considered. It can be based on the dictionary or machine learning.
 Here are free and open-source tools to do it:
 Chinese:
+* Ansj https://github.com/NLPchina/ansj_seg
 * CoreNLP https://github.com/stanfordnlp/CoreNLP
 * FoolNLTK https://github.com/rockyzhengwu/FoolNLTK

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 13:49, 20 May 2021

Navigation menu

Search