Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

From Polyglot Club WIKI
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 8: Line 8:
If you don't know Python, please try this:
If you don't know Python, please try this:


<youtube>_uQrJ0TkZlc</youtube>
https://www.youtube.com/watch?v=_uQrJ0TkZlc


In progress.
In progress.
Line 26: Line 26:
When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.
When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.


Multiple languages:
multiple languages:
* CST's lemmatiser https://www.cst.dk/online/lemmatiser/
* CST's lemmatiser https://www.cst.dk/online/lemmatiser/
* Natural Language Toolkit https://www.nltk.org/
* Pattern https://github.com/clips/pattern
* Pattern https://github.com/clips/pattern
* Simplemma https://github.com/adbar/simplemma
* TextBlob https://textblob.readthedocs.io/
* TextBlob https://textblob.readthedocs.io/
English:
* LemmInflect https://github.com/bjascob/LemmInflect
* Natural Language Toolkit https://www.nltk.org/
German:
* GermaLemma https://github.com/WZBSocialScienceCenter/germalemma
Hungarian:
* HuSpaCy https://github.com/huspacy/huspacy
Persian:
* Hazm https://github.com/roshan-research/hazm
Turkish:
* Zeyrek https://github.com/obulat/zeyrek


== Pitch-Accent Marking ==
== Pitch-Accent Marking ==
Line 96: Line 112:
* JVnSegmenter http://jvnsegmenter.sourceforge.net/
* JVnSegmenter http://jvnsegmenter.sourceforge.net/
* VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
* VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
== Romanisation ==
multiple languages:
* Translit https://translit.cc/
Iranian Persian:
* Behnevis: easy farsi transliteration (pinglish) editor https://behnevis.com/en/
Japanese:
* NihongoDera - Romaji Converter https://nihongodera.com/tools/romaji-converter
Korean:
* 한국어/로마자 변환기 http://roman.cs.pusan.ac.kr/
Mandarin Chinese:
* Chinese Romanization Converter https://chinese.gratis/tools/zhuyin/
Standard Arabic:
* Romanize Arabic ALA-LC http://romanize-arabic.camel-lab.com/
Thai:
* thai-language.com Romanize an Arbitrary Thai Word http://thai-language.com/?nav=dictionary&anyxlit=1
* Phonetic transliteration of Thai https://www.thailit.com/transliterate.php


== Word Segmentation ==
== Word Segmentation ==
Line 103: Line 142:


Burmese, Khmer, Lao, Thai:
Burmese, Khmer, Lao, Thai:
* Chamkho https://github.com/veer66/chamkho
* Chamkho https://codeberg.org/mekong-lang/chamkho


Burmese:
Burmese:

Latest revision as of 13:50, 16 February 2024

Multiple-languages-flag-polyglotclub.jpg

In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.

Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.

If you don't know Python, please try this:

https://www.youtube.com/watch?v=_uQrJ0TkZlc

In progress.

Diacritisation[edit | edit source]

In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.

Arabic:

Lemmatisation[edit | edit source]

When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.

multiple languages:

English:

German:

Hungarian:

Persian:

Turkish:

Pitch-Accent Marking[edit | edit source]

In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.

Japanese:

Stress Marking[edit | edit source]

In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.

Russian:

Transcription[edit | edit source]

Some languages are written in more than one writing systems. This tool converts them from one to another.

Chinese:

Part of Speech Tagging[edit | edit source]

It tags words in the sentence with parts of speech. Some of them can draw parse trees.

Multiple languages:

Arabic:

Chinese:

Japanese:

Thai:

Vietnamese:

Romanisation[edit | edit source]

multiple languages:

Iranian Persian:

Japanese:

Korean:

Mandarin Chinese:

Standard Arabic:

Thai:

Word Segmentation[edit | edit source]

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list.

Burmese, Khmer, Lao, Thai:

Burmese:

Chinese:

Japanese:

Lao:

Thai:

Vietnamese:

Other Lessons[edit | edit source]