Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 11:03, 27 March 2023

In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.

Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.

If you don't know Python, please try this:

In progress.

Diacritisation

In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.

Arabic:

Arabycia https://github.com/mohabmes/Arabycia
Farasa https://github.com/MagedSaeed/farasapy
Mishkal https://sourceforge.net/projects/mishkal/
Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
Shakkala https://github.com/Barqawiz/Shakkala
Shakkelha https://github.com/AliOsm/shakkelha

Lemmatisation

When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.

Multiple languages:

CST's lemmatiser https://www.cst.dk/online/lemmatiser/
Natural Language Toolkit https://www.nltk.org/
Pattern https://github.com/clips/pattern
TextBlob https://textblob.readthedocs.io/

Pitch-Accent Marking

In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.

Japanese:

Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic

Stress Marking

In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.

Russian:

RussianGram https://russiangram.com/
Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/

Transcription

Some languages are written in more than one writing systems. This tool converts them from one to another.

Chinese:

Chinese-Tools.com https://www.chinese-tools.com/tools/converter-simptrad.html
ChineseConverter.com https://www.chineseconverter.com/en/convert/simplified-to-traditional
hanzi2reading https://github.com/bdon/hanzi2reading
OMGChinese.com https://www.omgchinese.com/tools/chinese-simplified-traditional-converter

Part of Speech Tagging

It tags words in the sentence with parts of speech. Some of them can draw parse trees.

Multiple languages:

CoreNLP https://github.com/stanfordnlp/CoreNLP
Natural Language Toolkit https://www.nltk.org/
spaCy https://spacy.io/
Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml

Arabic:

Arabycia https://github.com/mohabmes/Arabycia

Chinese:

FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
FudanNLP https://github.com/FudanNLP/fnlp
HanLP https://github.com/hankcs/HanLP
LAC https://github.com/baidu/lac
LTP https://github.com/HIT-SCIR/ltp
SnowNLP https://github.com/isnowfy/snownlp
pkuseg https://github.com/lancopku/pkuseg-python
pyhanlp https://github.com/hankcs/pyhanlp
THULAC https://github.com/thunlp/THULAC-Python

Japanese:

janome https://github.com/mocobeta/janome
Juman++ https://github.com/ku-nlp/jumanpp
Kagome https://github.com/ikawaha/kagome
Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
KyTea http://www.phontron.com/kytea/
MeCab https://taku910.github.io/mecab/
nagisa https://github.com/taishi-i/nagisa
Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy

Thai:

PyThaiNLP https://github.com/PyThaiNLP/pythainlp
SynThai https://github.com/KrakenAI/SynThai
TLTK https://pypi.org/project/tltk/

Vietnamese:

JVnSegmenter http://jvnsegmenter.sourceforge.net/
VnCoreNLP https://github.com/vncorenlp/VnCoreNLP

Word Segmentation

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list.

Chinese:

Ansj https://github.com/NLPchina/ansj_seg
FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
FudanNLP https://github.com/FudanNLP/fnlp
HanLP https://github.com/hankcs/HanLP
jieba https://github.com/fxsjy/jieba
LAC https://github.com/baidu/lac
LTP https://github.com/HIT-SCIR/ltp
SnowNLP https://github.com/isnowfy/snownlp
pkuseg https://github.com/lancopku/pkuseg-python
pyhanlp https://github.com/hankcs/pyhanlp
THULAC https://github.com/thunlp/THULAC-Python

Japanese:

janome https://github.com/mocobeta/janome
Juman++ https://github.com/ku-nlp/jumanpp
Kagome https://github.com/ikawaha/kagome
Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
KyTea http://www.phontron.com/kytea/
MeCab https://taku910.github.io/mecab/
nagisa https://github.com/taishi-i/nagisa
Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy

Lao:

Lao Word-Segmentation https://github.com/frankxayachack/LaoWordSegmentation

Thai:

Cutkum https://github.com/pucktada/cutkum
CutThai https://github.com/pureexe/cutthai
Deepcut https://github.com/rkcosmos/deepcut
PyThaiNLP https://github.com/PyThaiNLP/pythainlp
SWATH https://www.cs.cmu.edu/~paisarn/software.html
SynThai https://github.com/KrakenAI/SynThai
TLTK https://pypi.org/project/tltk/
wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy

Vietnamese:

DongDu https://github.com/rockkhuya/DongDu
JVnSegmenter http://jvnsegmenter.sourceforge.net/
VietSeg https://github.com/manhtai/vietseg
VnCoreNLP https://github.com/vncorenlp/VnCoreNLP

Other Lessons

@@ Line 14: / Line 14: @@
 == Diacritisation ==
 In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.
 Arabic:
@@ Line 26: / Line 25: @@
 == Lemmatisation ==
 When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.
 Multiple languages:
@@ Line 36: / Line 34: @@
 == Pitch-Accent Marking ==
 In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.
 Japanese:
@@ Line 44: / Line 41: @@
 == Stress Marking ==
 In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.
 Russian:
@@ Line 52: / Line 48: @@
 == Transcription ==
 Some languages are written in more than one writing systems. This tool converts them from one to another.
 Chinese:
@@ Line 62: / Line 57: @@
 == Part of Speech Tagging ==
 It tags words in the sentence with parts of speech. Some of them can draw parse trees.
 Multiple languages:
@@ Line 107: / Line 101: @@
 The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between or put the segmented words into a list.
 Chinese:
@@ Line 151: / Line 144: @@
 * VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
-==Related Lessons==
+==Other Lessons==
 * [[Language/Multiple-languages/Culture/Internet-Dictionaries|Internet Dictionaries]]
 * [[Language/Multiple-languages/Culture/Astrology-in-different-Cultures-and-Languages|Astrology in different Cultures and Languages]]
@@ Line 162: / Line 155: @@
 * [[Language/Multiple-languages/Culture/Helpful-Anki-Shared-Decks|Helpful Anki Shared Decks]]
 * [[Language/Multiple-languages/Culture/Internet-resources-for-learning-specific-languages|Internet resources for learning specific languages]]
+<span links></span>

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 11:03, 27 March 2023

Contents

Diacritisation

Lemmatisation

Pitch-Accent Marking

Stress Marking

Transcription

Part of Speech Tagging

Word Segmentation

Other Lessons

Navigation menu

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Revision as of 11:03, 27 March 2023

Diacritisation

Lemmatisation

Pitch-Accent Marking

Stress Marking

Transcription

Part of Speech Tagging

Word Segmentation

Other Lessons

Navigation menu

Search