Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 13:59, 11 March 2022

In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.

Many of tools introduced are written in Python, which is an important language in machine learning and easy to learn.

If you don't know Python, please try this:

In progress.

Diacritization

In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritization.

Arabic:

Arabycia https://github.com/mohabmes/Arabycia
Farasa https://github.com/MagedSaeed/farasapy
Mishkal https://sourceforge.net/projects/mishkal/
Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
Shakkala https://github.com/Barqawiz/Shakkala
Shakkelha https://github.com/AliOsm/shakkelha

Pitch-Accent Marking

In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.

Japanese:

Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic

Stress Generation

In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.

Russian:

RussianGram https://russiangram.com/
Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/

Word Segmentation

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the non-breaking space.

Chinese:

Ansj https://github.com/NLPchina/ansj_seg
CoreNLP https://github.com/stanfordnlp/CoreNLP
FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
FudanNLP https://github.com/FudanNLP/fnlp
HanLP https://github.com/hankcs/HanLP
jieba https://github.com/fxsjy/jieba
LAC https://github.com/baidu/lac
LTP https://github.com/HIT-SCIR/ltp
SnowNLP https://github.com/isnowfy/snownlp
pkuseg https://github.com/lancopku/pkuseg-python
pyhanlp https://github.com/hankcs/pyhanlp
THULAC https://github.com/thunlp/THULAC-Python

Japanese:

janome https://github.com/mocobeta/janome
Juman++ https://github.com/ku-nlp/jumanpp
Kagome https://github.com/ikawaha/kagome
Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
KyTea http://www.phontron.com/kytea/
MeCab https://taku910.github.io/mecab/
nagisa https://github.com/taishi-i/nagisa
Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy

Lao:

Lao Word-Segmentation https://github.com/frankxayachack/LaoWordSegmentation

Thai:

Cutkum https://github.com/pucktada/cutkum
CutThai https://github.com/pureexe/cutthai
Deepcut https://github.com/rkcosmos/deepcut
PyThaiNLP https://github.com/PyThaiNLP/pythainlp
SWATH https://www.cs.cmu.edu/~paisarn/software.html
SynThai https://github.com/KrakenAI/SynThai
TLTK https://pypi.org/project/tltk/
wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy

Vietnamese:

DongDu https://github.com/rockkhuya/DongDu
JVnSegmenter http://jvnsegmenter.sourceforge.net/
Roy_VnTokenizer https://github.com/roy-a/Roy_VnTokenizer
VietSeg https://github.com/manhtai/vietseg
VnCoreNLP https://github.com/vncorenlp/VnCoreNLP

@@ Line 24: / Line 24: @@
 * Shakkelha https://github.com/AliOsm/shakkelha
-== Pitch-Accent Marker ==
+== Pitch-Accent Marking ==
 In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.
@@ Line 32: / Line 32: @@
 * tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic
-== Stress Generator ==
+== Stress Generation ==
 In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.
@@ Line 40: / Line 40: @@
 * Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/
-== Word segmentation ==
+== Word Segmentation ==
 In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 13:59, 11 March 2022

Contents

Diacritization

Pitch-Accent Marking

Stress Generation

Word Segmentation

Navigation menu

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 13:59, 11 March 2022

Diacritization

Pitch-Accent Marking

Stress Generation

Word Segmentation

Navigation menu

Search