Difference between revisions of "Language/Multiple-languages/Culture/Tutorial-of-Text-to-Wordlist"
(Created page with "Hi, polyglots. About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word. There are tools like [https://github.com/jzohrab/lute-v3 Lute], [https://github.com/FreeLanguageTools/vocabsieve VocabSieve]. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on...") |
|||
Line 9: | Line 9: | ||
The entire procedure can be divided into the following steps: | The entire procedure can be divided into the following steps: | ||
# Extract all words from a text (also known as “word tokenisation”); | # Extract all words from a text (also known as “word tokenisation”); | ||
# | # Omit words that are not supposed to be included; | ||
# Add lemmas from different forms of a word (also known as “lemmatisation”); | # Add lemmas from different forms of a word (also known as “lemmatisation”); | ||
# Extract dictionary entries from each word. | # Extract dictionary entries from each word. | ||
Line 15: | Line 15: | ||
There are hidden steps: | There are hidden steps: | ||
* Get a tokenisation tool (solution: [https://spacy.io/ spaCy]); | * Get a tokenisation tool (solution: [https://spacy.io/ spaCy]); | ||
* Get a list of | * Get a list of omitted words (solution: various sources including [https://en.wikipedia.org/wiki/Template:Punctuation_marks_in_Unicode punctuation], [https://en.wikipedia.org/wiki/Currency_symbol currency symbols], [https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols mathematical symbols]); | ||
* Get a lemmatisation tool (solution: [https://github.com/adbar/simplemma Simplemma]); | * Get a lemmatisation tool (solution: [https://github.com/adbar/simplemma Simplemma]); | ||
* Get a dictionary (solution: [https://kaikki.org/ kaikki.org]). | * Get a dictionary (solution: [https://kaikki.org/ kaikki.org]). | ||
The program that does this work address is | The program that does this work address is | ||
<big><big><big>https://codeberg.org/GrimPixel/Text_to_Wordlist</big></big></big> | <big><big><big>https://codeberg.org/GrimPixel/Text_to_Wordlist</big></big></big> | ||
Make sure you have installed the latest Python, and created a virtual environment. There is a [https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/ guide on how to create a virtual environment]. | Make sure you have installed the latest Python, and created a virtual environment. There is a [https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/ guide on how to create a virtual environment]. | ||
Line 27: | Line 29: | ||
(in progress) | (in progress) | ||
== | == Preparation == | ||
To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to [https://pip.pypa.io/en/stable/user_guide/#requirements-files the official document], it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`. | |||
The next thing to do is to download a dictionary file at [https://kaikki.org/ kaikki.org]. Those JSON/JSONL files are extracted from the [https://dumps.wikimedia.org/backup-index.html Wiktionary dump] by means of [https://github.com/tatuylonen/wiktextract Wiktextract]. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` in `setting.yaml`. | |||
Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory` in `setting.yaml`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory. | |||
To install spaCy language models, see [https://spacy.io/usage/models its guide] and set the model at `s spacy pipeline` in `setting.yaml`; |
Revision as of 17:28, 30 April 2024
Hi, polyglots.
About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word.
There are tools like Lute, VocabSieve. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on dictionary websites.
How about utilising tools? Which tools can be utilised? There is a guide on how to make a TSV file, where the advantages of TSV as the format for wordlists are described.
The entire procedure can be divided into the following steps:
- Extract all words from a text (also known as “word tokenisation”);
- Omit words that are not supposed to be included;
- Add lemmas from different forms of a word (also known as “lemmatisation”);
- Extract dictionary entries from each word.
There are hidden steps:
- Get a tokenisation tool (solution: spaCy);
- Get a list of omitted words (solution: various sources including punctuation, currency symbols, mathematical symbols);
- Get a lemmatisation tool (solution: Simplemma);
- Get a dictionary (solution: kaikki.org).
The program that does this work address is
https://codeberg.org/GrimPixel/Text_to_Wordlist
Make sure you have installed the latest Python, and created a virtual environment. There is a guide on how to create a virtual environment.
(in progress)
Preparation
To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to the official document, it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`.
The next thing to do is to download a dictionary file at kaikki.org. Those JSON/JSONL files are extracted from the Wiktionary dump by means of Wiktextract. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` in `setting.yaml`.
Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory` in `setting.yaml`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory.
To install spaCy language models, see its guide and set the model at `s spacy pipeline` in `setting.yaml`;