Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"

From Polyglot Club WIKI
Jump to navigation Jump to search
Line 98: Line 98:


== Word Segmentation ==
== Word Segmentation ==
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces.


The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between or put the segmented words into a list.
The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between or put the segmented words into a list.

Revision as of 21:09, 3 June 2023

Multiple-languages-flag-polyglotclub.jpg

In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.

Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.

If you don't know Python, please try this:

In progress.

Diacritisation

In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.

Arabic:

Lemmatisation

When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.

Multiple languages:

Pitch-Accent Marking

In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.

Japanese:

Stress Marking

In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.

Russian:

Transcription

Some languages are written in more than one writing systems. This tool converts them from one to another.

Chinese:

Part of Speech Tagging

It tags words in the sentence with parts of speech. Some of them can draw parse trees.

Multiple languages:

Arabic:

Chinese:

Japanese:

Thai:

Vietnamese:

Word Segmentation

In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.

The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list.

Burmese, Khmer, Lao, Thai:

Chinese:

Japanese:

Lao:

Thai:

Vietnamese:

Other Lessons