Difference between revisions of "Language/Multiple-languages/Culture/Text-Processing-Tools"
(26 intermediate revisions by 2 users not shown) | |||
Line 8: | Line 8: | ||
If you don't know Python, please try this: | If you don't know Python, please try this: | ||
https://www.youtube.com/watch?v=_uQrJ0TkZlc | |||
In progress. | In progress. | ||
Line 14: | Line 14: | ||
== Diacritisation == | == Diacritisation == | ||
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation. | In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation. | ||
Arabic: | Arabic: | ||
Line 25: | Line 24: | ||
== Lemmatisation == | == Lemmatisation == | ||
When you search a word in inflected form, | When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done. | ||
multiple languages: | |||
* CST's lemmatiser https://www.cst.dk/online/lemmatiser/ | * CST's lemmatiser https://www.cst.dk/online/lemmatiser/ | ||
* Pattern https://github.com/clips/pattern | * Pattern https://github.com/clips/pattern | ||
* Simplemma https://github.com/adbar/simplemma | |||
* TextBlob https://textblob.readthedocs.io/ | * TextBlob https://textblob.readthedocs.io/ | ||
English: | |||
* LemmInflect https://github.com/bjascob/LemmInflect | |||
* Natural Language Toolkit https://www.nltk.org/ | |||
German: | |||
* GermaLemma https://github.com/WZBSocialScienceCenter/germalemma | |||
Hungarian: | |||
* HuSpaCy https://github.com/huspacy/huspacy | |||
Persian: | |||
* Hazm https://github.com/roshan-research/hazm | |||
Turkish: | |||
* Zeyrek https://github.com/obulat/zeyrek | |||
== Pitch-Accent Marking == | == Pitch-Accent Marking == | ||
In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required. | In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required. | ||
Japanese: | Japanese: | ||
Line 42: | Line 55: | ||
* tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic | * tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic | ||
== Stress | == Stress Marking == | ||
In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted. | In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted. | ||
Russian: | Russian: | ||
* RussianGram https://russiangram.com/ | * RussianGram https://russiangram.com/ | ||
* Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/ | * Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/ | ||
== Transcription == | |||
Some languages are written in more than one writing systems. This tool converts them from one to another. | |||
Chinese: | |||
* Chinese-Tools.com https://www.chinese-tools.com/tools/converter-simptrad.html | |||
* ChineseConverter.com https://www.chineseconverter.com/en/convert/simplified-to-traditional | |||
* hanzi2reading https://github.com/bdon/hanzi2reading | |||
* OMGChinese.com https://www.omgchinese.com/tools/chinese-simplified-traditional-converter | |||
== Part of Speech Tagging == | |||
It tags words in the sentence with parts of speech. Some of them can draw parse trees. | |||
Multiple languages: | |||
* CoreNLP https://github.com/stanfordnlp/CoreNLP | |||
* Natural Language Toolkit https://www.nltk.org/ | |||
* spaCy https://spacy.io/ | |||
* Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml | |||
Arabic: | |||
* Arabycia https://github.com/mohabmes/Arabycia | |||
Chinese: | |||
* FoolNLTK https://github.com/rockyzhengwu/FoolNLTK | |||
* FudanNLP https://github.com/FudanNLP/fnlp | |||
* HanLP https://github.com/hankcs/HanLP | |||
* LAC https://github.com/baidu/lac | |||
* LTP https://github.com/HIT-SCIR/ltp | |||
* SnowNLP https://github.com/isnowfy/snownlp | |||
* pkuseg https://github.com/lancopku/pkuseg-python | |||
* pyhanlp https://github.com/hankcs/pyhanlp | |||
* THULAC https://github.com/thunlp/THULAC-Python | |||
Japanese: | |||
* janome https://github.com/mocobeta/janome | |||
* Juman++ https://github.com/ku-nlp/jumanpp | |||
* Kagome https://github.com/ikawaha/kagome | |||
* Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/ | |||
* KyTea http://www.phontron.com/kytea/ | |||
* MeCab https://taku910.github.io/mecab/ | |||
* nagisa https://github.com/taishi-i/nagisa | |||
* Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy | |||
Thai: | |||
* PyThaiNLP https://github.com/PyThaiNLP/pythainlp | |||
* SynThai https://github.com/KrakenAI/SynThai | |||
* TLTK https://pypi.org/project/tltk/ | |||
Vietnamese: | |||
* JVnSegmenter http://jvnsegmenter.sourceforge.net/ | |||
* VnCoreNLP https://github.com/vncorenlp/VnCoreNLP | |||
== Romanisation == | |||
multiple languages: | |||
* Translit https://translit.cc/ | |||
Iranian Persian: | |||
* Behnevis: easy farsi transliteration (pinglish) editor https://behnevis.com/en/ | |||
Japanese: | |||
* NihongoDera - Romaji Converter https://nihongodera.com/tools/romaji-converter | |||
Korean: | |||
* 한국어/로마자 변환기 http://roman.cs.pusan.ac.kr/ | |||
Mandarin Chinese: | |||
* Chinese Romanization Converter https://chinese.gratis/tools/zhuyin/ | |||
Standard Arabic: | |||
* Romanize Arabic ALA-LC http://romanize-arabic.camel-lab.com/ | |||
Thai: | |||
* thai-language.com Romanize an Arbitrary Thai Word http://thai-language.com/?nav=dictionary&anyxlit=1 | |||
* Phonetic transliteration of Thai https://www.thailit.com/transliterate.php | |||
== Word Segmentation == | == Word Segmentation == | ||
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces. | In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like [https://vocabhunter.github.io/ VocabHunter], [https://github.com/jeffkowalski/gritz gritz] and [https://github.com/zg/text-memorize text-memorize], where words are detected only with spaces. | ||
The solution is called “[https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation word segmentation]”, which detects words and insert spaces in between or put the segmented words into a list. | |||
Burmese, Khmer, Lao, Thai: | |||
* Chamkho https://codeberg.org/mekong-lang/chamkho | |||
Burmese: | |||
* Myan-word-breaker https://github.com/stevenay/myan-word-breaker | |||
Chinese: | Chinese: | ||
* Ansj https://github.com/NLPchina/ansj_seg | * Ansj https://github.com/NLPchina/ansj_seg | ||
* FoolNLTK https://github.com/rockyzhengwu/FoolNLTK | * FoolNLTK https://github.com/rockyzhengwu/FoolNLTK | ||
* FudanNLP https://github.com/FudanNLP/fnlp | * FudanNLP https://github.com/FudanNLP/fnlp | ||
Line 96: | Line 186: | ||
* DongDu https://github.com/rockkhuya/DongDu | * DongDu https://github.com/rockkhuya/DongDu | ||
* JVnSegmenter http://jvnsegmenter.sourceforge.net/ | * JVnSegmenter http://jvnsegmenter.sourceforge.net/ | ||
* VietSeg https://github.com/manhtai/vietseg | * VietSeg https://github.com/manhtai/vietseg | ||
* VnCoreNLP https://github.com/vncorenlp/VnCoreNLP | * VnCoreNLP https://github.com/vncorenlp/VnCoreNLP | ||
==Other Lessons== | |||
* [[Language/Multiple-languages/Culture/Internet-Dictionaries|Internet Dictionaries]] | |||
* [[Language/Multiple-languages/Culture/Astrology-in-different-Cultures-and-Languages|Astrology in different Cultures and Languages]] | |||
* [[Language/Multiple-languages/Culture/How-to-make-a-TSV-file|How to make a TSV file]] | |||
* [[Language/Multiple-languages/Culture/Texts-and-Audios-under-a-Public-License|Texts and Audios under a Public License]] | |||
* [[Language/Multiple-languages/Culture/Calendar-and-Clock|Calendar and Clock]] | |||
* [[Language/Multiple-languages/Culture/Online-Specialized-Dictionaries|Online Specialized Dictionaries]] | |||
* [[Language/Multiple-languages/Culture/Similar-Sayings|Similar Sayings]] | |||
* [[Language/Multiple-languages/Culture/Elements-of-Traditional-Architectures:-Western-Europe|Elements of Traditional Architectures: Western Europe]] | |||
* [[Language/Multiple-languages/Culture/Helpful-Anki-Shared-Decks|Helpful Anki Shared Decks]] | |||
* [[Language/Multiple-languages/Culture/Internet-resources-for-learning-specific-languages|Internet resources for learning specific languages]] | |||
<span links></span> |
Latest revision as of 13:50, 16 February 2024
In this lesson, several useful linguistic tools useful for common language learners are discussed. They are not always accurate, so keep in mind.
Many of the tools introduced are written in Python, which is an important language in machine learning and easy to learn.
If you don't know Python, please try this:
https://www.youtube.com/watch?v=_uQrJ0TkZlc
In progress.
Diacritisation[edit | edit source]
In Arabic writing system, diacritics indicate the accents, but they are often omitted for writing fluently. The process of restoring diacritics is called diacritisation.
Arabic:
- Arabycia https://github.com/mohabmes/Arabycia
- Farasa https://github.com/MagedSaeed/farasapy
- Mishkal https://sourceforge.net/projects/mishkal/
- Pipeline-diacritizer https://github.com/Hamza5/Pipeline-diacritizer
- Shakkala https://github.com/Barqawiz/Shakkala
- Shakkelha https://github.com/AliOsm/shakkelha
Lemmatisation[edit | edit source]
When you search a word in inflected form, the dictionary program can show you the result as lemma, during which the lemmatisation is done.
multiple languages:
- CST's lemmatiser https://www.cst.dk/online/lemmatiser/
- Pattern https://github.com/clips/pattern
- Simplemma https://github.com/adbar/simplemma
- TextBlob https://textblob.readthedocs.io/
English:
- LemmInflect https://github.com/bjascob/LemmInflect
- Natural Language Toolkit https://www.nltk.org/
German:
Hungarian:
Persian:
Turkish:
Pitch-Accent Marking[edit | edit source]
In Japanese and other languages, the pitch-accent is important on distinguishing different words. They are unwritten and required.
Japanese:
- Prosody Tutor Suzuki-kun http://www.gavo.t.u-tokyo.ac.jp/ojad/phrasing/index
- tdmelodic https://github.com/PKSHATechnology-Research/tdmelodic
Stress Marking[edit | edit source]
In Russian and other languages, the stress is important on distinguishing different words. They are usually omitted.
Russian:
- RussianGram https://russiangram.com/
- Russian Stress Finder https://www.readyrussian.org/WebApps/StressFinder/
Transcription[edit | edit source]
Some languages are written in more than one writing systems. This tool converts them from one to another.
Chinese:
- Chinese-Tools.com https://www.chinese-tools.com/tools/converter-simptrad.html
- ChineseConverter.com https://www.chineseconverter.com/en/convert/simplified-to-traditional
- hanzi2reading https://github.com/bdon/hanzi2reading
- OMGChinese.com https://www.omgchinese.com/tools/chinese-simplified-traditional-converter
Part of Speech Tagging[edit | edit source]
It tags words in the sentence with parts of speech. Some of them can draw parse trees.
Multiple languages:
- CoreNLP https://github.com/stanfordnlp/CoreNLP
- Natural Language Toolkit https://www.nltk.org/
- spaCy https://spacy.io/
- Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml
Arabic:
- Arabycia https://github.com/mohabmes/Arabycia
Chinese:
- FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
- FudanNLP https://github.com/FudanNLP/fnlp
- HanLP https://github.com/hankcs/HanLP
- LAC https://github.com/baidu/lac
- LTP https://github.com/HIT-SCIR/ltp
- SnowNLP https://github.com/isnowfy/snownlp
- pkuseg https://github.com/lancopku/pkuseg-python
- pyhanlp https://github.com/hankcs/pyhanlp
- THULAC https://github.com/thunlp/THULAC-Python
Japanese:
- janome https://github.com/mocobeta/janome
- Juman++ https://github.com/ku-nlp/jumanpp
- Kagome https://github.com/ikawaha/kagome
- Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
- KyTea http://www.phontron.com/kytea/
- MeCab https://taku910.github.io/mecab/
- nagisa https://github.com/taishi-i/nagisa
- Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy
Thai:
- PyThaiNLP https://github.com/PyThaiNLP/pythainlp
- SynThai https://github.com/KrakenAI/SynThai
- TLTK https://pypi.org/project/tltk/
Vietnamese:
- JVnSegmenter http://jvnsegmenter.sourceforge.net/
- VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
Romanisation[edit | edit source]
multiple languages:
- Translit https://translit.cc/
Iranian Persian:
- Behnevis: easy farsi transliteration (pinglish) editor https://behnevis.com/en/
Japanese:
- NihongoDera - Romaji Converter https://nihongodera.com/tools/romaji-converter
Korean:
- 한국어/로마자 변환기 http://roman.cs.pusan.ac.kr/
Mandarin Chinese:
- Chinese Romanization Converter https://chinese.gratis/tools/zhuyin/
Standard Arabic:
- Romanize Arabic ALA-LC http://romanize-arabic.camel-lab.com/
Thai:
- thai-language.com Romanize an Arbitrary Thai Word http://thai-language.com/?nav=dictionary&anyxlit=1
- Phonetic transliteration of Thai https://www.thailit.com/transliterate.php
Word Segmentation[edit | edit source]
In some languages, words are not separated by spaces, for example: Chinese, Japanese, Khmer, Lao, Thai. In Vietnamese, spaces are used to divide syllables instead of words. This brings about difficulties for computer programs like VocabHunter, gritz and text-memorize, where words are detected only with spaces.
The solution is called “word segmentation”, which detects words and insert spaces in between or put the segmented words into a list.
Burmese, Khmer, Lao, Thai:
Burmese:
- Myan-word-breaker https://github.com/stevenay/myan-word-breaker
Chinese:
- Ansj https://github.com/NLPchina/ansj_seg
- FoolNLTK https://github.com/rockyzhengwu/FoolNLTK
- FudanNLP https://github.com/FudanNLP/fnlp
- HanLP https://github.com/hankcs/HanLP
- jieba https://github.com/fxsjy/jieba
- LAC https://github.com/baidu/lac
- LTP https://github.com/HIT-SCIR/ltp
- SnowNLP https://github.com/isnowfy/snownlp
- pkuseg https://github.com/lancopku/pkuseg-python
- pyhanlp https://github.com/hankcs/pyhanlp
- THULAC https://github.com/thunlp/THULAC-Python
Japanese:
- janome https://github.com/mocobeta/janome
- Juman++ https://github.com/ku-nlp/jumanpp
- Kagome https://github.com/ikawaha/kagome
- Kuromoji https://github.com/atilika/kuromoji / https://github.com/takuyaa/kuromoji.js/
- KyTea http://www.phontron.com/kytea/
- MeCab https://taku910.github.io/mecab/
- nagisa https://github.com/taishi-i/nagisa
- Sudachi https://github.com/WorksApplications/Sudachi / https://github.com/WorksApplications/SudachiPy
Lao:
- Lao Word-Segmentation https://github.com/frankxayachack/LaoWordSegmentation
Thai:
- Cutkum https://github.com/pucktada/cutkum
- CutThai https://github.com/pureexe/cutthai
- Deepcut https://github.com/rkcosmos/deepcut
- PyThaiNLP https://github.com/PyThaiNLP/pythainlp
- SWATH https://www.cs.cmu.edu/~paisarn/software.html
- SynThai https://github.com/KrakenAI/SynThai
- TLTK https://pypi.org/project/tltk/
- wordcut https://github.com/veer66/wordcut / https://github.com/veer66/wordcutpy
Vietnamese:
- DongDu https://github.com/rockkhuya/DongDu
- JVnSegmenter http://jvnsegmenter.sourceforge.net/
- VietSeg https://github.com/manhtai/vietseg
- VnCoreNLP https://github.com/vncorenlp/VnCoreNLP
Other Lessons[edit | edit source]
- Internet Dictionaries
- Astrology in different Cultures and Languages
- How to make a TSV file
- Texts and Audios under a Public License
- Calendar and Clock
- Online Specialized Dictionaries
- Similar Sayings
- Elements of Traditional Architectures: Western Europe
- Helpful Anki Shared Decks
- Internet resources for learning specific languages