Language/Multiple-languages/Culture/Tutorial-of-Text-to-Wordlist

From Polyglot Club WIKI
< Language‎ | Multiple-languages‎ | Culture
Revision as of 17:02, 30 April 2024 by GrimPixel (talk | contribs) (Created page with "Hi, polyglots. About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word. There are tools like [https://github.com/jzohrab/lute-v3 Lute], [https://github.com/FreeLanguageTools/vocabsieve VocabSieve]. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Rate this lesson:
0.00
(0 votes)

Hi, polyglots.

About how to learn a language, it is a good method to build a wordlist. If only there is a program that reads a text and generates a wordlist with dictionary entries attached to each word.

There are tools like Lute, VocabSieve. Their limitations are obvious: Lute requires adding texts as books and needs many mouse clicks to add a word; VocabSieve has limited support on dictionary websites.

How about utilising tools? Which tools can be utilised? There is a guide on how to make a TSV file, where the advantages of TSV as the format for wordlists are described.

The entire procedure can be divided into the following steps:

  1. Extract all words from a text (also known as “word tokenisation”);
  2. Filter words that are not supposed to be included;
  3. Add lemmas from different forms of a word (also known as “lemmatisation”);
  4. Extract dictionary entries from each word.

There are hidden steps:

The program that does this work address is

https://codeberg.org/GrimPixel/Text_to_Wordlist

Make sure you have installed the latest Python, and created a virtual environment. There is a guide on how to create a virtual environment.

(in progress)

Get a tokenisation tool, Extract all words from a text

Contributors

GrimPixel


Create a new Lesson