Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"
Line 14: | Line 14: | ||
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring them is called diacritization. | In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring them is called diacritization. | ||
Tools: | <big><b>Tools:</b></big> | ||
Arabic: | Arabic: | ||
* Farasa https://github.com/MagedSaeed/farasapy | * Farasa https://github.com/MagedSaeed/farasapy | ||
Line 30: | Line 31: | ||
The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the [https://en.wikipedia.org/wiki/Non-breaking_space non-breaking space]. | The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the [https://en.wikipedia.org/wiki/Non-breaking_space non-breaking space]. | ||
<b>Tools:</b> | <big><b>Tools:</b></big> | ||
Chinese: | Chinese: |
Revision as of 14:38, 20 May 2021
In this lesson, several useful linguistic tools useful for common language learners are discussed.
Many of tools introduced are written in Python, which is an important language in machine learning. It's very easy to use: just create a blank “.py” file, write a line to import from the library, write a line to segment the text, write a line to save the result.
If you don't know Python, please try this:
In progress.
Diacritization
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring them is called diacritization.
Tools:
Arabic:
- Farasa https://github.com/MagedSaeed/farasapy
- Mishkal https://sourceforge.net/projects/mishkal/
- Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
- Shakkala Project https://github.com/Barqawiz/Shakkala
- Shakkelha Website https://github.com/AliOsm/shakkelha-website
Yoruba:
- Yorùbá text https://github.com/Niger-Volta-LTI/yoruba-text
Word segmentation
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.
The solution is called “word segmentation”, which detects words and insert spaces in between. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the non-breaking space.
Tools:
Chinese:
- Ansj https://github.com/NLPchina/ansj_seg
- CoreNLP https://github.com/stanfordnlp/CoreNLP
- FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
- FudanNLP https://github.com/FudanNLP/fnlp
- HanLP https://github.com/hankcs/HanLP
- jieba https://github.com/fxsjy/jieba
- LAC https://github.com/baidu/lac
- LTP https://github.com/HIT-SCIR/ltp
- SnowNLP https://github.com/isnowfy/snownlp
- pkuseg https://github.com/lancopku/pkuseg-python
- pyhanlp https://github.com/hankcs/pyhanlp
- THULAC https://github.com/thunlp/THULAC-Python
Japanese:
- janome https://github.com/mocobeta/janome
- Juman++ https://github.com/ku-nlp/jumanpp
- Kagome https://github.com/ikawaha/kagome
- Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
- KyTea http://www.phontron.com/kytea/
- MeCab https://taku910.github.io/mecab/
- nagisa https://github.com/taishi-i/nagisa
- Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy
Thai:
- Cutkum https://github.com/pucktada/cutkum
- CutThai https://github.com/pureexe/cutthai
- Deepcut https://github.com/rkcosmos/deepcut
- PyThaiNLP https://github.com/PyThaiNLP/pythainlp
- SWATH https://www.cs.cmu.edu/~paisarn/software.html
- SynThai https://github.com/KrakenAI/SynThai
- TLTK https://pypi.org/project/tltk/
- wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy
Vietnamese:
- DongDu https://github.com/rockkhuya/DongDu
- JVnSegmenter http://jvnsegmenter.sourceforge.net/
- Roy_VnTokenizer https://github.com/roy-a/Roy_VnTokenizer
- VietSeg https://github.com/manhtai/vietseg
- VnCoreNLP https://github.com/vncorenlp/VnCoreNLP