Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

From Polyglot Club WIKI
Jump to navigation Jump to search
Line 12: Line 12:


== Diacritization ==
== Diacritization ==
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. This brings difficulties to the language learners.
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritization.
 
The process of restoring diacritics is called diacritization.


<big><b>Tools:</b></big>
<big><b>Tools:</b></big>
Line 25: Line 23:
* Shakkala https://github.com/Barqawiz/Shakkala
* Shakkala https://github.com/Barqawiz/Shakkala
* Shakkelha https://github.com/AliOsm/shakkelha
* Shakkelha https://github.com/AliOsm/shakkelha
== Pitch-accent or Stress Marker ==
In Japanese, Russian and other languages, the pitch-accent or stress is important at distinguishing different words.
Japanese:
* Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
* tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic
Russian:
* RussianGram https://russiangram.com/
* Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/


== Word segmentation ==
== Word segmentation ==

Revision as of 14:04, 22 November 2021


In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.

Many of tools introduced are written in Python, which is an important language in machine learning and easy to learn.

If you don't know Python, please try this:

In progress.

Diacritization

In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritization.

Tools:

Arabic:

Pitch-accent or Stress Marker

In Japanese, Russian and other languages, the pitch-accent or stress is important at distinguishing different words.

Japanese:

Russian:

Word segmentation

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list. You may want to ask: The programs only recognise spaces as the word separators, how to deal with Vietnamese? The answer is using the non-breaking space.

Tools:

Chinese:

Japanese:

Lao:

Thai:

Vietnamese: