Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"
Line 25: | Line 25: | ||
== Lemmatisation == | == Lemmatisation == | ||
When you search a word in inflected form, | When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done. | ||
Revision as of 13:37, 27 April 2022
In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.
Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.
If you don't know Python, please try this:
In progress.
Diacritisation
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.
Arabic:
- Arabycia https://github.com/mohabmes/Arabycia
- Farasa https://github.com/MagedSaeed/farasapy
- Mishkal https://sourceforge.net/projects/mishkal/
- Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
- Shakkala https://github.com/Barqawiz/Shakkala
- Shakkelha https://github.com/AliOsm/shakkelha
Lemmatisation
When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.
Multiple languages:
- CST's lemmatiser https://www.cst.dk/online/lemmatiser/
- Natural Language Toolkit https://www.nltk.org/
- Pattern https://github.com/clips/pattern
- TextBlob https://textblob.readthedocs.io/
Pitch-Accent Marking
In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.
Japanese:
- Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
- tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic
Stress Marking
In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.
Russian:
- RussianGram https://russiangram.com/
- Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/
Part of Speech Tagging
It tags words in the sentence with parts of speech. Some of them can draw parse trees.
Multiple languages:
- Natural Language Toolkit https://www.nltk.org/
- spaCy https://spacy.io/
- Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml
Word Segmentation
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.
The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list.
Chinese:
- Ansj https://github.com/NLPchina/ansj_seg
- CoreNLP https://github.com/stanfordnlp/CoreNLP
- FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
- FudanNLP https://github.com/FudanNLP/fnlp
- HanLP https://github.com/hankcs/HanLP
- jieba https://github.com/fxsjy/jieba
- LAC https://github.com/baidu/lac
- LTP https://github.com/HIT-SCIR/ltp
- SnowNLP https://github.com/isnowfy/snownlp
- pkuseg https://github.com/lancopku/pkuseg-python
- pyhanlp https://github.com/hankcs/pyhanlp
- THULAC https://github.com/thunlp/THULAC-Python
Japanese:
- janome https://github.com/mocobeta/janome
- Juman++ https://github.com/ku-nlp/jumanpp
- Kagome https://github.com/ikawaha/kagome
- Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
- KyTea http://www.phontron.com/kytea/
- MeCab https://taku910.github.io/mecab/
- nagisa https://github.com/taishi-i/nagisa
- Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy
Lao:
- Lao Word-Segmentation https://github.com/frankxayachack/LaoWordSegmentation
Thai:
- Cutkum https://github.com/pucktada/cutkum
- CutThai https://github.com/pureexe/cutthai
- Deepcut https://github.com/rkcosmos/deepcut
- PyThaiNLP https://github.com/PyThaiNLP/pythainlp
- SWATH https://www.cs.cmu.edu/~paisarn/software.html
- SynThai https://github.com/KrakenAI/SynThai
- TLTK https://pypi.org/project/tltk/
- wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy
Vietnamese:
- DongDu https://github.com/rockkhuya/DongDu
- JVnSegmenter http://jvnsegmenter.sourceforge.net/
- Roy_VnTokenizer https://github.com/roy-a/Roy_VnTokenizer
- VietSeg https://github.com/manhtai/vietseg
- VnCoreNLP https://github.com/vncorenlp/VnCoreNLP