Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Latest revision as of 13:50, 16 February 2024

In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.

Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.

If you don't know Python, please try this:

https://www.youtube.com/watch?v=_uQrJ0TkZlc

In progress.

Diacritisation[edit | edit source]

In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.

Arabic:

Arabycia https://github.com/mohabmes/Arabycia
Farasa https://github.com/MagedSaeed/farasapy
Mishkal https://sourceforge.net/projects/mishkal/
Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
Shakkala https://github.com/Barqawiz/Shakkala
Shakkelha https://github.com/AliOsm/shakkelha

Lemmatisation[edit | edit source]

When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.

multiple languages:

CST's lemmatiser https://www.cst.dk/online/lemmatiser/
Pattern https://github.com/clips/pattern
Simplemma https://github.com/adbar/simplemma
TextBlob https://textblob.readthedocs.io/

English:

LemmInflect https://github.com/bjascob/LemmInflect
Natural Language Toolkit https://www.nltk.org/

German:

GermaLemma https://github.com/WZBSocialScienceCenter/germalemma

Hungarian:

HuSpaCy https://github.com/huspacy/huspacy

Persian:

Hazm https://github.com/roshan-research/hazm

Turkish:

Zeyrek https://github.com/obulat/zeyrek

Pitch-Accent Marking[edit | edit source]

In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.

Japanese:

Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic

Stress Marking[edit | edit source]

In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.

Russian:

RussianGram https://russiangram.com/
Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/

Transcription[edit | edit source]

Some languages are written in more than one writing systems. This tool converts them from one to another.

Chinese:

Chinese-Tools.com https://www.chinese-tools.com/tools/converter-simptrad.html
ChineseConverter.com https://www.chineseconverter.com/en/convert/simplified-to-traditional
hanzi2reading https://github.com/bdon/hanzi2reading
OMGChinese.com https://www.omgchinese.com/tools/chinese-simplified-traditional-converter

Part of Speech Tagging[edit | edit source]

It tags words in the sentence with parts of speech. Some of them can draw parse trees.

Multiple languages:

CoreNLP https://github.com/stanfordnlp/CoreNLP
Natural Language Toolkit https://www.nltk.org/
spaCy https://spacy.io/
Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml

Arabic:

Arabycia https://github.com/mohabmes/Arabycia

Chinese:

FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
FudanNLP https://github.com/FudanNLP/fnlp
HanLP https://github.com/hankcs/HanLP
LAC https://github.com/baidu/lac
LTP https://github.com/HIT-SCIR/ltp
SnowNLP https://github.com/isnowfy/snownlp
pkuseg https://github.com/lancopku/pkuseg-python
pyhanlp https://github.com/hankcs/pyhanlp
THULAC https://github.com/thunlp/THULAC-Python

Japanese:

janome https://github.com/mocobeta/janome
Juman++ https://github.com/ku-nlp/jumanpp
Kagome https://github.com/ikawaha/kagome
Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
KyTea http://www.phontron.com/kytea/
MeCab https://taku910.github.io/mecab/
nagisa https://github.com/taishi-i/nagisa
Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy

Thai:

PyThaiNLP https://github.com/PyThaiNLP/pythainlp
SynThai https://github.com/KrakenAI/SynThai
TLTK https://pypi.org/project/tltk/

Vietnamese:

JVnSegmenter http://jvnsegmenter.sourceforge.net/
VnCoreNLP https://github.com/vncorenlp/VnCoreNLP

Romanisation[edit | edit source]

multiple languages:

Translit https://translit.cc/

Iranian Persian:

Behnevis: easy farsi transliteration (pinglish) editor https://behnevis.com/en/

Japanese:

NihongoDera - Romaji Converter https://nihongodera.com/tools/romaji-converter

Korean:

한국어/로마자 변환기 http://roman.cs.pusan.ac.kr/

Mandarin Chinese:

Chinese Romanization Converter https://chinese.gratis/tools/zhuyin/

Standard Arabic:

Romanize Arabic ALA-LC http://romanize-arabic.camel-lab.com/

Thai:

thai-language.com Romanize an Arbitrary Thai Word http://thai-language.com/?nav=dictionary&anyxlit=1
Phonetic transliteration of Thai https://www.thailit.com/transliterate.php

Word Segmentation[edit | edit source]

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list.

Burmese, Khmer, Lao, Thai:

Chamkho https://codeberg.org/mekong-lang/chamkho

Burmese:

Myan-word-breaker https://github.com/stevenay/myan-word-breaker

Chinese:

Ansj https://github.com/NLPchina/ansj_seg
FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
FudanNLP https://github.com/FudanNLP/fnlp
HanLP https://github.com/hankcs/HanLP
jieba https://github.com/fxsjy/jieba
LAC https://github.com/baidu/lac
LTP https://github.com/HIT-SCIR/ltp
SnowNLP https://github.com/isnowfy/snownlp
pkuseg https://github.com/lancopku/pkuseg-python
pyhanlp https://github.com/hankcs/pyhanlp
THULAC https://github.com/thunlp/THULAC-Python

Japanese:

janome https://github.com/mocobeta/janome
Juman++ https://github.com/ku-nlp/jumanpp
Kagome https://github.com/ikawaha/kagome
Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
KyTea http://www.phontron.com/kytea/
MeCab https://taku910.github.io/mecab/
nagisa https://github.com/taishi-i/nagisa
Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy

Lao:

Lao Word-Segmentation https://github.com/frankxayachack/LaoWordSegmentation

Thai:

Cutkum https://github.com/pucktada/cutkum
CutThai https://github.com/pureexe/cutthai
Deepcut https://github.com/rkcosmos/deepcut
PyThaiNLP https://github.com/PyThaiNLP/pythainlp
SWATH https://www.cs.cmu.edu/~paisarn/software.html
SynThai https://github.com/KrakenAI/SynThai
TLTK https://pypi.org/project/tltk/
wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy

Vietnamese:

DongDu https://github.com/rockkhuya/DongDu
JVnSegmenter http://jvnsegmenter.sourceforge.net/
VietSeg https://github.com/manhtai/vietseg
VnCoreNLP https://github.com/vncorenlp/VnCoreNLP

Other Lessons[edit | edit source]

@@ Line 1: / Line 1: @@
 [[Category:Computer-Knowledge]]
+{{Multiple-languages-flag}}
-In this lesson, several useful linguistic tools useful for common language learners are discussed.
+In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.
-Many of tools introduced are written in Python, which is an important language in machine learning. It's very easy to use: just create a blank “.py” file, write a line to import from the library, write a line to segment the text, write a line to save the result.
+Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.
 If you don't know Python, please try this:
-<youtube>_uQrJ0TkZlc</youtube>
+https://www.youtube.com/watch?v=_uQrJ0TkZlc
 In progress.
-== Diacritization ==
+== Diacritisation ==
-In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring them is called diacritization.
+In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.
-<big><b>Tools:</b></big>
 Arabic:
+* Arabycia https://github.com/mohabmes/Arabycia
 * Farasa https://github.com/MagedSaeed/farasapy
 * Mishkal https://sourceforge.net/projects/mishkal/
 * Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
-* Shakkala Project https://github.com/Barqawiz/Shakkala
+* Shakkala https://github.com/Barqawiz/Shakkala
-* Shakkelha Website https://github.com/AliOsm/shakkelha-website
+* Shakkelha https://github.com/AliOsm/shakkelha
+== Lemmatisation ==
+When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.
+multiple languages:
+* CST's lemmatiser https://www.cst.dk/online/lemmatiser/
+* Pattern https://github.com/clips/pattern
+* Simplemma https://github.com/adbar/simplemma
+* TextBlob https://textblob.readthedocs.io/
+English:
+* LemmInflect https://github.com/bjascob/LemmInflect
+* Natural Language Toolkit https://www.nltk.org/
+German:
+* GermaLemma https://github.com/WZBSocialScienceCenter/germalemma
+Hungarian:
+* HuSpaCy https://github.com/huspacy/huspacy
+Persian:
+* Hazm https://github.com/roshan-research/hazm
+Turkish:
+* Zeyrek https://github.com/obulat/zeyrek
+== Pitch-Accent Marking ==
+In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.
+Japanese:
+* Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
+* tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic
+== Stress Marking ==
+In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.
+Russian:
+* RussianGram https://russiangram.com/
+* Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/
+== Transcription ==
+Some languages are written in more than one writing systems. This tool converts them from one to another.
-Yoruba:
+Chinese:
-* Yorùbá text https://github.com/Niger-Volta-LTI/yoruba-text
+* Chinese-Tools.com https://www.chinese-tools.com/tools/converter-simptrad.html
+* ChineseConverter.com https://www.chineseconverter.com/en/convert/simplified-to-traditional
+* hanzi2reading https://github.com/bdon/hanzi2reading
+* OMGChinese.com https://www.omgchinese.com/tools/chinese-simplified-traditional-converter
-== Word segmentation ==
+== Part of Speech Tagging ==
-In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.
+It tags words in the sentence with parts of speech. Some of them can draw parse trees.
-The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the [https://en.wikipedia.org/wiki/Non-breaking_space non-breaking space].
+Multiple languages:
+* CoreNLP https://github.com/stanfordnlp/CoreNLP
+* Natural Language Toolkit https://www.nltk.org/
+* spaCy https://spacy.io/
+* Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml
-<big><b>Tools:</b></big>
+Arabic:
+* Arabycia https://github.com/mohabmes/Arabycia
+Chinese:
+* FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
+* FudanNLP https://github.com/FudanNLP/fnlp
+* HanLP https://github.com/hankcs/HanLP
+* LAC https://github.com/baidu/lac
+* LTP https://github.com/HIT-SCIR/ltp
+* SnowNLP https://github.com/isnowfy/snownlp
+* pkuseg https://github.com/lancopku/pkuseg-python
+* pyhanlp https://github.com/hankcs/pyhanlp
+* THULAC https://github.com/thunlp/THULAC-Python
+Japanese:
+* janome https://github.com/mocobeta/janome
+* Juman++ https://github.com/ku-nlp/jumanpp
+* Kagome https://github.com/ikawaha/kagome
+* Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
+* KyTea http://www.phontron.com/kytea/
+* MeCab https://taku910.github.io/mecab/
+* nagisa https://github.com/taishi-i/nagisa
+* Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy
+Thai:
+* PyThaiNLP https://github.com/PyThaiNLP/pythainlp
+* SynThai https://github.com/KrakenAI/SynThai
+* TLTK https://pypi.org/project/tltk/
+Vietnamese:
+* JVnSegmenter http://jvnsegmenter.sourceforge.net/
+* VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
+== Romanisation ==
+multiple languages:
+* Translit https://translit.cc/
+Iranian Persian:
+* Behnevis: easy farsi transliteration (pinglish) editor https://behnevis.com/en/
+Japanese:
+* NihongoDera - Romaji Converter https://nihongodera.com/tools/romaji-converter
+Korean:
+* 한국어/로마자 변환기 http://roman.cs.pusan.ac.kr/
+Mandarin Chinese:
+* Chinese Romanization Converter https://chinese.gratis/tools/zhuyin/
+Standard Arabic:
+* Romanize Arabic ALA-LC http://romanize-arabic.camel-lab.com/
+Thai:
+* thai-language.com Romanize an Arbitrary Thai Word http://thai-language.com/?nav=dictionary&anyxlit=1
+* Phonetic transliteration of Thai https://www.thailit.com/transliterate.php
+== Word Segmentation ==
+In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.
+The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between or put the segmented words into a list.
+Burmese, Khmer, Lao, Thai:
+* Chamkho https://codeberg.org/mekong-lang/chamkho
+Burmese:
+* Myan-word-breaker https://github.com/stevenay/myan-word-breaker
 Chinese:
 * Ansj https://github.com/NLPchina/ansj_seg
-* CoreNLP https://github.com/stanfordnlp/CoreNLP
 * FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
 * FudanNLP https://github.com/FudanNLP/fnlp
@@ Line 56: / Line 169: @@
 * nagisa https://github.com/taishi-i/nagisa
 * Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy
+Lao:
+* Lao Word-Segmentation https://github.com/frankxayachack/LaoWordSegmentation
 Thai:
@@ Line 70: / Line 186: @@
 * DongDu https://github.com/rockkhuya/DongDu
 * JVnSegmenter http://jvnsegmenter.sourceforge.net/
-* Roy_VnTokenizer https://github.com/roy-a/Roy_VnTokenizer
 * VietSeg https://github.com/manhtai/vietseg
 * VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
+==Other Lessons==
+* [[Language/Multiple-languages/Culture/Internet-Dictionaries|Internet Dictionaries]]
+* [[Language/Multiple-languages/Culture/Astrology-in-different-Cultures-and-Languages|Astrology in different Cultures and Languages]]
+* [[Language/Multiple-languages/Culture/How-to-make-a-TSV-file|How to make a TSV file]]
+* [[Language/Multiple-languages/Culture/Texts-and-Audios-under-a-Public-License|Texts and Audios under a Public License]]
+* [[Language/Multiple-languages/Culture/Calendar-and-Clock|Calendar and Clock]]
+* [[Language/Multiple-languages/Culture/Online-Specialized-Dictionaries|Online Specialized Dictionaries]]
+* [[Language/Multiple-languages/Culture/Similar-Sayings|Similar Sayings]]
+* [[Language/Multiple-languages/Culture/Elements-of-Traditional-Architectures:-Western-Europe|Elements of Traditional Architectures: Western Europe]]
+* [[Language/Multiple-languages/Culture/Helpful-Anki-Shared-Decks|Helpful Anki Shared Decks]]
+* [[Language/Multiple-languages/Culture/Internet-resources-for-learning-specific-languages|Internet resources for learning specific languages]]
+<span links></span>

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Latest revision as of 13:50, 16 February 2024

Contents

Diacritisation[edit | edit source]

Lemmatisation[edit | edit source]

Pitch-Accent Marking[edit | edit source]

Stress Marking[edit | edit source]

Transcription[edit | edit source]

Part of Speech Tagging[edit | edit source]

Romanisation[edit | edit source]

Word Segmentation[edit | edit source]

Other Lessons[edit | edit source]

Navigation menu

Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

Latest revision as of 13:50, 16 February 2024

Diacritisation[edit | edit source]

Lemmatisation[edit | edit source]

Pitch-Accent Marking[edit | edit source]

Stress Marking[edit | edit source]

Transcription[edit | edit source]

Part of Speech Tagging[edit | edit source]

Romanisation[edit | edit source]

Word Segmentation[edit | edit source]

Other Lessons[edit | edit source]

Navigation menu

Search