Difference between revisions of "Language/Multiple-languages/Culture/Tutorial-of-Text-to-Wordlist"

From Polyglot Club WIKI
Jump to navigation Jump to search
(Created page with "Hi, polyglots. About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word. There are tools like [https://github.com/jzohrab/lute-v3 Lute], [https://github.com/FreeLanguageTools/vocabsieve VocabSieve]. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on...")
 
Line 9: Line 9:
The entire procedure can be divided into the following steps:
The entire procedure can be divided into the following steps:
# Extract all words from a text (also known as “word tokenisation”);
# Extract all words from a text (also known as “word tokenisation”);
# Filter words that are not supposed to be included;
# Omit words that are not supposed to be included;
# Add lemmas from different forms of a word (also known as “lemmatisation”);
# Add lemmas from different forms of a word (also known as “lemmatisation”);
# Extract dictionary entries from each word.
# Extract dictionary entries from each word.
Line 15: Line 15:
There are hidden steps:
There are hidden steps:
* Get a tokenisation tool (solution: [https://spacy.io/ spaCy]);
* Get a tokenisation tool (solution: [https://spacy.io/ spaCy]);
* Get a list of filtered words (solution: various sources including [https://en.wikipedia.org/wiki/Template:Punctuation_marks_in_Unicode punctuation], [https://en.wikipedia.org/wiki/Currency_symbol currency symbols], [https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols mathematical symbols]);
* Get a list of omitted words (solution: various sources including [https://en.wikipedia.org/wiki/Template:Punctuation_marks_in_Unicode punctuation], [https://en.wikipedia.org/wiki/Currency_symbol currency symbols], [https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols mathematical symbols]);
* Get a lemmatisation tool (solution: [https://github.com/adbar/simplemma Simplemma]);
* Get a lemmatisation tool (solution: [https://github.com/adbar/simplemma Simplemma]);
* Get a dictionary (solution: [https://kaikki.org/ kaikki.org]).
* Get a dictionary (solution: [https://kaikki.org/ kaikki.org]).


The program that does this work address is
The program that does this work address is


<big><big><big>https://codeberg.org/GrimPixel/Text_to_Wordlist</big></big></big>
<big><big><big>https://codeberg.org/GrimPixel/Text_to_Wordlist</big></big></big>


Make sure you have installed the latest Python, and created a virtual environment. There is a [https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/ guide on how to create a virtual environment].
Make sure you have installed the latest Python, and created a virtual environment. There is a [https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/ guide on how to create a virtual environment].
Line 27: Line 29:
(in progress)
(in progress)


== Get a tokenisation tool, Extract all words from a text ==
== Preparation ==
To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to [https://pip.pypa.io/en/stable/user_guide/#requirements-files the official document], it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`.
 
The next thing to do is to download a dictionary file at [https://kaikki.org/ kaikki.org]. Those JSON/JSONL files are extracted from the [https://dumps.wikimedia.org/backup-index.html Wiktionary dump] by means of [https://github.com/tatuylonen/wiktextract Wiktextract]. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` in `setting.yaml`.
 
Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory` in `setting.yaml`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory.
 
To install spaCy language models, see [https://spacy.io/usage/models its guide] and set the model at `s spacy pipeline` in `setting.yaml`;

Revision as of 17:28, 30 April 2024

Hi, polyglots.

About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word.

There are tools like Lute, VocabSieve. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on dictionary websites.

How about utilising tools? Which tools can be utilised? There is a guide on how to make a TSV file, where the advantages of TSV as the format for wordlists are described.

The entire procedure can be divided into the following steps:

  1. Extract all words from a text (also known as “word tokenisation”);
  2. Omit words that are not supposed to be included;
  3. Add lemmas from different forms of a word (also known as “lemmatisation”);
  4. Extract dictionary entries from each word.

There are hidden steps:

The program that does this work address is


https://codeberg.org/GrimPixel/Text_to_Wordlist


Make sure you have installed the latest Python, and created a virtual environment. There is a guide on how to create a virtual environment.

(in progress)

Preparation

To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to the official document, it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`.

The next thing to do is to download a dictionary file at kaikki.org. Those JSON/JSONL files are extracted from the Wiktionary dump by means of Wiktextract. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` in `setting.yaml`.

Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory` in `setting.yaml`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory.

To install spaCy language models, see its guide and set the model at `s spacy pipeline` in `setting.yaml`;