Difference between revisions of "Language/Multiple-languages/Culture/Tutorial-of-Text-to-Wordlist"

From Polyglot Club WIKI
Jump to navigation Jump to search
(Created page with "Hi, polyglots. About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word. There are tools like [https://github.com/jzohrab/lute-v3 Lute], [https://github.com/FreeLanguageTools/vocabsieve VocabSieve]. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on...")
 
 
(7 intermediate revisions by the same user not shown)
Line 9: Line 9:
The entire procedure can be divided into the following steps:
The entire procedure can be divided into the following steps:
# Extract all words from a text (also known as “word tokenisation”);
# Extract all words from a text (also known as “word tokenisation”);
# Filter words that are not supposed to be included;
# Omit words that are not supposed to be included;
# Add lemmas from different forms of a word (also known as “lemmatisation”);
# Add lemmas from different forms of a word (also known as “lemmatisation”);
# Extract dictionary entries from each word.
# Extract dictionary entries from each word.
Line 15: Line 15:
There are hidden steps:
There are hidden steps:
* Get a tokenisation tool (solution: [https://spacy.io/ spaCy]);
* Get a tokenisation tool (solution: [https://spacy.io/ spaCy]);
* Get a list of filtered words (solution: various sources including [https://en.wikipedia.org/wiki/Template:Punctuation_marks_in_Unicode punctuation], [https://en.wikipedia.org/wiki/Currency_symbol currency symbols], [https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols mathematical symbols]);
* Get a list of omitted words (solution: various sources including [https://en.wikipedia.org/wiki/Template:Punctuation_marks_in_Unicode punctuation], [https://en.wikipedia.org/wiki/Currency_symbol currency symbols], [https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols mathematical symbols]);
* Get a lemmatisation tool (solution: [https://github.com/adbar/simplemma Simplemma]);
* Get a lemmatisation tool (solution: [https://github.com/adbar/simplemma Simplemma]);
* Get a dictionary (solution: [https://kaikki.org/ kaikki.org]).
* Get a dictionary (solution: [https://kaikki.org/ kaikki.org]).


The program that does this work address is
The program that does this work address is


<big><big><big>https://codeberg.org/GrimPixel/Text_to_Wordlist</big></big></big>
<big><big><big>https://codeberg.org/GrimPixel/Text_to_Wordlist</big></big></big>


Make sure you have installed the latest Python, and created a virtual environment. There is a [https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/ guide on how to create a virtual environment].
Make sure you have installed the latest Python, and created a virtual environment. There is a [https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/ guide on how to create a virtual environment].


(in progress)
== Definition ==
The “text” is like
{| class="wikitable"
|Here are some words.
|}
 
The “stop word” is like
{| class="wikitable"
|,
|-
|.
|}
 
The “wordlist” is like
{| class="wikitable"
|Here
|-
|here
|-
|are
|-
|be
|-
|some
|-
|words
|-
|word
|}
 
The “augmented wordlist” is like
{| class="wikitable"
!word
!glosses
!translations
|-
|here
|{gloss1} • {gloss2} • {gloss3} • {gloss4}
|{translation:ar} • {translation:bg} • {translation:ca}
|-
|are
|{gloss1}
|
|-
|be
|{gloss1} • {gloss2} • {gloss3} • {gloss4} • {gloss5} • {gloss6} • {gloss7}
|{translation:ar} • {translation:ca}
|-
|some
|{gloss1} • {gloss2} • {gloss3}
|{translation:bg} • {translation:ca}
|-
|word
|{gloss1} • {gloss2} • {gloss3} • {gloss4} • {gloss5}
|{translation:ar} • {translation:bg} • {translation:ca}
|}
 
The “filtered augmented wordlist” is like
{| class="wikitable"
!word
!glosses
!translations
|-
|here
|{gloss1} • {gloss2}
|{translation:ar}
|-
|are
|{gloss1}
|
|-
|be
|{gloss1} • {gloss2}
|{translation:ar}
|-
|some
|{gloss1} • {gloss2}
|
|-
|word
|{gloss1} • {gloss2}
|{translation:ar}
|}
 
== Preparation ==
To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to [https://pip.pypa.io/en/stable/user_guide/#requirements-files the official document], it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`.
 
In `setting.yaml`:
 
The required subdirectories are defined by `s text directory`, `s wordlist directory`, `s augmented wordlist directory`, `s filtered augmented wordlist directory`.
 
Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory.
 
To install spaCy language models, follow [https://spacy.io/usage/models its guide] and set the model at `s spacy pipeline`; Then set the language code at `s wiktionary lang_code`.
 
Set the language for Simplemma at `s simplemma language code`.
 
If the word is missing in the dictionary, it will be added to a list and the list will be saved in a TXT file. Their prefixes and suffixes can be defined by `s prefix for list of the missing word` and `s suffix for list of the missing word`.
 
Download a dictionary file at [https://kaikki.org/ kaikki.org]. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` and set the `s json file` value, set the `s wiktionary lang_code` and select `ls dictionary content`.
 
Also take a look at `ls dictionary content after filtering` and the options for filtering sound tags, translation tags, limit for glosses/raw_glosses, and change them if needed.


== Get a tokenisation tool, Extract all words from a text ==
== Generation of Wordlist ==
# Place a text in TXT format at the directory defined by `s text directory`;
# Execute `python extract_text.py` and get a wordlist at the directory defined by `s wordlist directory`;
# Execute `python extract_dictionary.py` and get an “augmented wordlist” (wordlist with dictionary content attached to each entry) at the directory defined by `s augmented wordlist directory`;
# Execute `python filter_augmented_wordlist.py` and get a “filtered augmented wordlist” at the directory defined by `s filtered augmented wordlist directory`.

Latest revision as of 18:22, 30 April 2024

Hi, polyglots.

About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word.

There are tools like Lute, VocabSieve. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on dictionary websites.

How about utilising tools? Which tools can be utilised? There is a guide on how to make a TSV file, where the advantages of TSV as the format for wordlists are described.

The entire procedure can be divided into the following steps:

  1. Extract all words from a text (also known as “word tokenisation”);
  2. Omit words that are not supposed to be included;
  3. Add lemmas from different forms of a word (also known as “lemmatisation”);
  4. Extract dictionary entries from each word.

There are hidden steps:

The program that does this work address is


https://codeberg.org/GrimPixel/Text_to_Wordlist


Make sure you have installed the latest Python, and created a virtual environment. There is a guide on how to create a virtual environment.

Definition[edit | edit source]

The “text” is like

Here are some words.

The “stop word” is like

,
.

The “wordlist” is like

Here
here
are
be
some
words
word

The “augmented wordlist” is like

word glosses translations
here {gloss1} • {gloss2} • {gloss3} • {gloss4} {translation:ar} • {translation:bg} • {translation:ca}
are {gloss1}
be {gloss1} • {gloss2} • {gloss3} • {gloss4} • {gloss5} • {gloss6} • {gloss7} {translation:ar} • {translation:ca}
some {gloss1} • {gloss2} • {gloss3} {translation:bg} • {translation:ca}
word {gloss1} • {gloss2} • {gloss3} • {gloss4} • {gloss5} {translation:ar} • {translation:bg} • {translation:ca}

The “filtered augmented wordlist” is like

word glosses translations
here {gloss1} • {gloss2} {translation:ar}
are {gloss1}
be {gloss1} • {gloss2} {translation:ar}
some {gloss1} • {gloss2}
word {gloss1} • {gloss2} {translation:ar}

Preparation[edit | edit source]

To install requirements ruamel.yaml (which provides the ability to read YAML files), spaCy and Simplemma, according to the official document, it's either `python -m pip install -r requirements.txt` or `py -m pip install -r requirements.txt`.

In `setting.yaml`:

The required subdirectories are defined by `s text directory`, `s wordlist directory`, `s augmented wordlist directory`, `s filtered augmented wordlist directory`.

Those words to be omitted (“stop words”) are either in the format of line-separated values or a TSV with header and the stop words placed at the first column, placed at the defined subdirectory defined by `s stop word directory`. The support for that TSV is based on the need of the user that learned words can be omitted conveniently, by moving wordlist files into the “stop words” directory.

To install spaCy language models, follow its guide and set the model at `s spacy pipeline`; Then set the language code at `s wiktionary lang_code`.

Set the language for Simplemma at `s simplemma language code`.

If the word is missing in the dictionary, it will be added to a list and the list will be saved in a TXT file. Their prefixes and suffixes can be defined by `s prefix for list of the missing word` and `s suffix for list of the missing word`.

Download a dictionary file at kaikki.org. Place those JSON/JSONL files at the subdirectory defined by `s dictionary directory` and set the `s json file` value, set the `s wiktionary lang_code` and select `ls dictionary content`.

Also take a look at `ls dictionary content after filtering` and the options for filtering sound tags, translation tags, limit for glosses/raw_glosses, and change them if needed.

Generation of Wordlist[edit | edit source]

  1. Place a text in TXT format at the directory defined by `s text directory`;
  2. Execute `python extract_text.py` and get a wordlist at the directory defined by `s wordlist directory`;
  3. Execute `python extract_dictionary.py` and get an “augmented wordlist” (wordlist with dictionary content attached to each entry) at the directory defined by `s augmented wordlist directory`;
  4. Execute `python filter_augmented_wordlist.py` and get a “filtered augmented wordlist” at the directory defined by `s filtered augmented wordlist directory`.