Language/Multiple-languages/Culture/Tutorial-of-Text-to-Wordlist

From Polyglot Club WIKI
Jump to navigation Jump to search
This lesson can still be improved. EDIT IT NOW! & become VIP
Rate this lesson:
0.00
(0 votes)

Hi, polyglots.

About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word.

There are tools like Lute, VocabSieve. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on dictionary websites.

How about utilising tools? Which tools can be utilised? There is a guide on how to make a TSV file, where the advantages of TSV as the format for wordlists are described.

The entire procedure can be divided into the following steps:

  1. Extract all words from a text (also known as “word tokenisation”);
  2. Omit words that are not supposed to be included;
  3. Add lemmas from different forms of a word (also known as “lemmatisation”);
  4. Extract dictionary entries from each word.

There are hidden steps:

The program that does this work address is


https://codeberg.org/GrimPixel/Text_to_Wordlist


Make sure you have installed the latest Python, and created a virtual environment. There is a guide on how to create a virtual environment.

Definition[edit | edit source]

The “text” is like

Here are some words.

The “stop word” is like

,
.

The “wordlist” is like

Here
here
are
be
some
words
word

The “augmented wordlist” is like

word glosses translations
here {gloss1} • {gloss2} • {gloss3} • {gloss4} {translation:ar} • {translation:bg} • {translation:ca}
are {gloss1}
be {gloss1} • {gloss2} • {gloss3} • {gloss4} • {gloss5} • {gloss6} • {gloss7} {translation:ar} • {translation:ca}
some {gloss1} • {gloss2} • {gloss3} {translation:bg} • {translation:ca}
word {gloss1} • {gloss2} • {gloss3} • {gloss4} • {gloss5} {translation:ar} • {translation:bg} • {translation:ca}

The “filtered augmented wordlist” is like

word glosses translations
here {gloss1} • {gloss2} {translation:ar}
are {gloss1}
be {gloss1} • {gloss2} {translation:ar}
some {gloss1} • {gloss2}
word {gloss1} • {gloss2} {translation:ar}

Preparation[edit | edit source]

To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to the official document, it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`.

In `setting.yaml`:

The required subdirectories are defined by `s text directory`, `s wordlist directory`, `s augmented wordlist directory`, `s filtered augmented wordlist directory`.

Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory.

To install spaCy language models, follow its guide and set the model at `s spacy pipeline`; Then set the language code at `s wiktionary lang_code`.

Set the language for Simplemma at `s simplemma language code`.

If the word is missing in the dictionary, it will be added to a list and the list will be saved in a TXT file. Their prefixes and suffixes can be defined by `s prefix for list of the missing word` and `s suffix for list of the missing word`.

Download a dictionary file at kaikki.org. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` and set the `s json file` value, set the `s wiktionary lang_code` and select `ls dictionary content`.

Also take a look at `ls dictionary content after filtering` and the options for filtering sound tags, translation tags, limit for glosses/raw_glosses, and change them if needed.

Generation of Wordlist[edit | edit source]

  1. Place a text in TXT format at the directory defined by `s text directory`;
  2. Execute `python extract_text.py` and get a wordlist at the directory defined by `s wordlist directory`;
  3. Execute `python extract_dictionary.py` and get an “augmented wordlist” (wordlist with dictionary content attached to each entry) at the directory defined by `s augmented wordlist directory`;
  4. Execute `python filter_augmented_wordlist.py` and get a “filtered augmented wordlist” at the directory defined by `s filtered augmented wordlist directory`.

Contributors

GrimPixel


Create a new Lesson