Language/Multiple-languages/Culture/Tutorial-of-Text-to-Wordlist
Hi, polyglots.
About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word.
There are tools like Lute, VocabSieve. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on dictionary websites.
How about utilising tools? Which tools can be utilised? There is a guide on how to make a TSV file, where the advantages of TSV as the format for wordlists are described.
The entire procedure can be divided into the following steps:
- Extract all words from a text (also known as “word tokenisation”);
- Filter words that are not supposed to be included;
- Add lemmas from different forms of a word (also known as “lemmatisation”);
- Extract dictionary entries from each word.
There are hidden steps:
- Get a tokenisation tool (solution: spaCy);
- Get a list of filtered words (solution: various sources including punctuation, currency symbols, mathematical symbols);
- Get a lemmatisation tool (solution: Simplemma);
- Get a dictionary (solution: kaikki.org).
The program that does this work address is
https://codeberg.org/GrimPixel/Text_to_Wordlist
Make sure you have installed the latest Python, and created a virtual environment. There is a guide on how to create a virtual environment.
(in progress)