Editing Language/Multiple-languages/Culture/How-to-make-a-TSV-file

Jump to navigation Jump to search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
{{Anki-menu}}
A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure. In can be very useful for storing data and languages related data in particular.
[[Category:Computer-Knowledge]]
[[File:tsv-file-polyglotclub-lessons.png|thumb]]
The tab-separated values (TSV) file is a simple format for storing data in a tabular structure.


You may have visited [[Language/Multiple-languages/Culture/Internet-Dictionaries]] as well as [https://tatoeba.org/en/downloads Tatoeba's download page] and want to utilise some downloadable stuff by creating flashcards on [https://apps.ankiweb.net/ Anki] or [https://mouse-dictionary.netlify.app/ Mouse Dictionary]. But it requires too much effort if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?
== What is TSV ==
You may have visited [[Language/Multiple-languages/Culture/Internet-Dictionaries]] and want to utilise some downloadable stuff, create flashcards on Anki or Mnemosyne. But it requires a lot of efforts if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?


You may have noticed that both programs have “File -> Import...” option. But they don't support XLS or XLSX files. What should you do?
You may have noticed that both programs have “File -> Import...” option. But they don't support XLS or XLSX files. What should you do?


== What is TSV ==
If you have opened a spreadsheet program (e.g. [https://www.libreoffice.org/ LibreOffice] Calc, [https://www.openoffice.org/ Apache OpenOffice] Calc, [https://www.onlyoffice.com/ ONLYOFFICE] Spreadsheet Editor, [https://www.office.com/ Microsoft Office] Excel) and click on “File -> Save  As...”, you can see some other formats to choose from, one of which is “CSV”.
If you have opened a spreadsheet program (e.g. [https://www.libreoffice.org/ LibreOffice] Calc, [https://www.openoffice.org/ Apache OpenOffice] Calc, [https://www.onlyoffice.com/ ONLYOFFICE] Spreadsheet Editor, [https://www.office.com/ Microsoft Office] Excel) and click on “File -> Save  As...”, you can see some other formats to choose from, one of which is “CSV”.


“[https://en.wikipedia.org/wiki/Comma-separated_values CSV]” means “Comma-separated values”, where columns are separated by columns. If there is a comma in the text, then the field needs to be surrounded by quotation marks, so the comma won't be counted as a column separator. If you have quotation marks in your text, then the [https://en.wikipedia.org/wiki/Escape_character escape character] backslash “\” need to be placed before the quotation mark. This is an example: https://gitlab.com/nirooj56/Nepdict/-/blob/master/database/data.csv (click on the icon of display source “</>”).
“[[wikipedia:Comma-separated_values|CSV]]” means “Comma-separated values”. It uses commas to separate columns. If you have a comma in the text, then it uses quotation marks to quote your text, so the comma won't be counted as a column separator. If you have quotation marks in your text, then the [[wikipedia:Escape_character|escape character]] backslash “\” need to be placed before the quotation mark. This is an example: https://github.com/skywind3000/ECDICT/blob/master/ecdict.mini.csv.


You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?
You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?


Not seems to be so, but its sibling TSV is supported. In Anki, it is called “Text separated by tabs or semicolons”; in Mnemosyne, it is called “Tab-separated text files”. What is it?
Not seems to be so, but its sibling TSV is supported. In Anki, it is called “Text separated by tabs or semicolons”; in Mnemosyne, it is called “Tab-separated text files”. What is it?
 
“[https://en.wikipedia.org/wiki/Tab-separated_values TSV]” means “Tab-separated values”. Some people call it “tabfile”. It is similar to CSV and has an advantage over that: columns are separated by “tabs”, so there is no need to use quotation marks to indicate commas as text instead of column separators. Both “TSV” and “CSV” belong to “[https://en.wikipedia.org/wiki/Delimiter-separated_values DSV]”, delimiter-separated values.


You may wonder what a “tab” means. [https://en.wikipedia.org/wiki/Tab_key The tabular key] is usually located above the “Caps Lock” key on a PC keyboard. It is used for aligning text in different lines to make it easier to form a table for typewriters and is inherited by computers. If you press this key in a text editor, it will look like some spaces; if you press this key in a browser, it will move the focus to the next element (link, textbox, button, etc.). In a spreadsheet program, you can press the Tab key to move to the next column or the Enter key to move to the next row. TSV files also use these two keys to separate columns and rows. It is more ideal than CSV. This is an example: https://gitlab.com/C0rn3j/NorwegianToEnglishDict/blob/master/4_finalDictionary/nb-NOtoENdictionary.txt.
“[[wikipedia:Tab-separated_values|TSV]]” means “Tab-separated values”. It is similar with CSV and has an advantage over CSV: it uses “tabs” to separate columns, so there is no need to use quotation marks to indicate commas as text instead of column separators. Both “TSV” and “CSV” belong to “[[wikipedia:Delimiter-separated_values|DSV]]”, delimiter-separated values.


How to save as TSV file? This is a bit confusing, because TSV is not so well-known as CSV. If you are using LibreOffice, click on that “Save As...”, select CSV, then in the dialogue box, choose {Tab} as “Field delimiter” and ignore “String delimiter”. If you have neutral quotation marks ⟨""⟩ in the text, they will be converted to typographic quotation marks ⟨“”⟩. The file you save has “CSV” as its file extension, but it's a TSV file essentially.
You may wonder what a “tab” means. [[wikipedia:Tab_key|The tabular key]] is the key above the “Caps Lock” key on your keyboard (in most cases). It is used for aligning text in different lines to make it easier to form a table for typewriters and is inherited by computers. If you press this key in a text editor, it will look like some spaces; if you press this key in a browser, it will move the focus to the next element (link, textbox, button, etc.). In a spreadsheet program, you can press the Tab key to move to the next column or the Enter key to move to the next row. TSV files also use these two keys to separate columns and rows. It is more ideal than CSV. This is an example: https://www.eki.ee/litsents/vaba/ies/eestiinglise.txt.


The problem with TSV is, it looks like spaces and can get confused with other whitespace characters. The solution is using a linter for sheets, like [https://github.com/mechatroner/vscode_rainbow_csv Rainbow CSV] for VSCodium, [https://github.com/mechatroner/rainbow_csv Rainbow CSV] for Vim, [https://github.com/emacs-vs/rainbow-csv rainbow-csv] for Emacs.
How to save as TSV file? This is a bit confusing, because TSV is not so well-known as CSV. If you are using LibreOffice, click on that “Save As...”, select CSV, then in the dialogue box, choose {Tab} as “Field delimiter” and ignore “String delimiter” (if you have neutral quotation marks ⟨"…"⟩ in the text, they will be converted to typographic quotation marks ⟨“…”⟩). The file you save has “CSV” as its file extension, but it's a TSV file essentially.
 
For VSCodium, there is another extention [https://github.com/janisdd/vscode-edit-csv vscode-edit-csv] to be mentioned.


== How to convert to TSV ==
== How to convert to TSV ==
You can use [https://github.com/ilius/pyglossary PyGlossary] or [https://github.com/thombashi/pytablewriter pytablewriter] directly.
XML, JSON, YAML are not in the form of a sheet. They are hierarchical, tree-like. You can find a list of tools at [https://github.com/dbohdan/structured-text-tools Structured text tools], convert to JSON, then TSV.
=== Formats ===
=== Formats ===
==== Spreadsheet formats ====
==== Spreadsheet formats ====
Line 47: Line 36:


===== Different delimiters =====
===== Different delimiters =====
In this case, you need to use a text editor to unify the delimiters, then do it with “same delimiter”.
In this case, you need to use a text editor like Notepad on Windows instead of spreadsheet program.


==== Other formats ====
==== Other formats ====


===== [https://www.sqlite.org/fileformat.html DB] =====
===== [https://www.sqlite.org/fileformat.html DB] =====
Open-source tools [https://sqlitebrowser.org/ DB Browser for SQLite], [https://sqlitestudio.pl/ SQLiteStudio] or something else are needed to open it. See which tables it contains, export tables you need, make sure “Field separator” is “Tab”.
You need open-source tools [https://sqlitebrowser.org/ DB Browser for SQLite], [https://sqlitestudio.pl/ SQLiteStudio] or something else to open it. See which tables it contains, export tables you need, make sure “Field separator” is “Tab”.


===== [https://en.wikipedia.org/wiki/DICT DICT/DICT.DZ] =====
===== [https://en.wikipedia.org/wiki/DICT DICT/DICT.DZ] =====
DICT is compressed as DICT.DZ, which is [https://en.wikipedia.org/wiki/Gzip gzip] in essence. Open-source tools [https://github.com/ib/xarchiver XAchiver] for Linux and BSD, [https://www.7-zip.org/ 7-Zip] for Windows can open DICT.DZ files without changing the file extension. DICT files can be edited with a text editor.
DICT is compressed as DICT.DZ, which is [https://en.wikipedia.org/wiki/Gzip gzip] in essence. Open-source tools [https://github.com/ib/xarchiver XAchiver] for Linux and BSD, [https://www.7-zip.org/ 7-Zip] for Windows can open DICT.DZ files without changing the file extension to GZ. DICT files can be edited with a text editor.
 
===== [https://github.com/itkach/slob SLOB] and some others =====
It is a format designed for [http://aarddict.org/ Aard 2]. Open-source tool [https://github.com/ilius/pyglossary PyGlossary] can do.
 
===== [https://en.wikipedia.org/wiki/Xml XML] and some others =====
Outdated format with incredible popularity.
 
This sort of conversion is a bit different, because XML, JSON and the rising star [https://yaml.org/ YAML] are not in the form of sheet. They are hierarchical, tree-like. You can find a list of tools at [https://github.com/dbohdan/structured-text-tools Structured text tools], convert to JSON, then TSV.
 
[https://www.youtube.com/watch?v=shc3CFOKp-0 A video about how to import XML into LibreOffice Calc].


=== Practice ===
=== Skill training ===
It requires the skill of combined use of a text editor and a spreadsheet program. You need to look through the file, discover its patterns and differentiate contents through the patterns.
It requires the skill of combined use of text editor and spreadsheet program. You need to look through the file, discover its patterns and differentiate contents through the patterns. It's similar to chemistry.


In a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows. “t” means “tab”, while “r” and “n” mean “[https://en.wikipedia.org/wiki/Carriage_return return]” and “[https://en.wikipedia.org/wiki/Newline newline]” respectively (not to be mentioned in the examples below). The backslash is an “[https://en.wikipedia.org/wiki/Escape_character escape character]”. What's more, you can use something called “[https://en.wikipedia.org/wiki/Regular_expression regular expression]”, which is complex and powerful. You can visit [https://regexone.com/ RegexOne] and [https://www.regular-expressions.info/ Regular-Expressions.info] to learn more about it.
In a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows (not to be mentioned again in the examples). “t” means “tab”, while “r” and “n” mean “return” and “newline” respectively, and different systems adopt different approaches. Further reading: [https://en.wikipedia.org/wiki/Newline newline] and [https://en.wikipedia.org/wiki/Carriage_return return].


Different text editors may process regular expressions and escape characters differently. Examples here are using [https://www.geany.org/ Geany].
If you select one or more columns or rows and use “replace” function in a spreadsheet program, then it can replace characters in the selected area only.


If you find [https://en.wikipedia.org/wiki/Mojibake mojibake] in your text editor, you probably need to open the text editor and click on “open file” in it, instead of directly double-clicking on the file; in the encoding selection, try other possible [https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings encodings] than UTF8.
If you find [https://en.wikipedia.org/wiki/Mojibake mojibake] in your text editor, you probably need to click on “open file” in the editor instead of dragging the file into it; in the encoding selection, try some possible [https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings encodings].


Please try to solve the problems yourself. See the steps as late as possible.
Please try to solve the problems yourself. See the answers as late as possible.


==== Example 1: [https://freedict.org/downloads/#dictionary-downloads FreeDict English - French] ====
Solutions involving command line: [[Language/Multiple-languages/Culture/Licensed-Free-Databases#Manually_convert_to_TSV]].
 
==== Example 1: English-French FreeDict Dictionary ====
To be specific, this is the DICT.DZ file, not the SLOB file.
{| class="wikitable"
{| class="wikitable"
|a /ə/
|a /ə/
Line 94: Line 96:
3. renoncer, résigner
3. renoncer, résigner
|}
|}
First, you need to put multiple lines into one.


Steps:
Steps for putting multiple lines into one:
# Turn off “Use regular expression”, turn on “Use escape characters”;
# Replace “\n2.” with “; 2.”;
# Replace “\n2.” with “; 2.”;
# Replace “\n3.” with “; 3.”;
# Replace “\n3.” with “; 3.”;
# Increase the number until no result found;
# Increase the number until such string not found;
# Replace “/\n1.” with “/\t1.”;
# Save;
# Replace “ /” with “\t/”;


Result:
Result:
{| class="wikitable"
|a /ə/
1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par; 4. une
abacus /æbəkəs/
1. abaque; 2. boulier
abandon /əbændən/
1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner
|}
Steps for distinguishing odd and even numbers:
# Open with a spreadsheet program;
# Type “0” in B1, move the cursor to the cell's lower-right corner and double click;
# Type “=MOD(B1,2)”, which returns the remainder divided by 2, in C1, move the cursor to the cell's lower-right corner and double click.
# Select column A, B, C, open “AutoFilter”;
# Create another spreadsheet;
# Filter 0 on column C;
# Copy English entries to the other spreadsheet's column A;
# Filter 1 on column C;
# Copy French entries to the other spreadsheet's column B;
# Save the other spreadsheet;
Result by step 3:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
|-
!'''1'''
|a /ə/
|0
|0
|-
!'''2'''
|1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par 4. une
|1
|1
|-
!'''3'''
|abacus /æbəkəs/
|2
|0
|-
!'''4'''
|1. abaque; 2. boulier
|3
|1
|-
!'''5'''
|abandon /əbændən/
|4
|0
|-
!'''6'''
|1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner
|5
|1
|}
Result by step 10:
{| class="wikitable"
!
!'''A'''
!'''B'''
|-
!'''1'''
|a /ə/
|1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par 4. une
|-
!'''2'''
|abacus /æbəkəs/
|1. abaque; 2. boulier
|-
!'''3'''
|abandon /əbændən/
|1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner
|}
Steps for separating English IPA:
# Open with a text editor;
# Replace “ /” with “\t/”;
# Save;
Final result:
{| class="wikitable"
{| class="wikitable"
!
!
Line 113: Line 199:
|a
|a
|/ə/
|/ə/
|1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par; 4. une
|1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par 4. une
|-
|-
!'''2'''
!'''2'''
Line 126: Line 212:
|}
|}


==== Example 2: [https://www.mdbg.net/chinese/dictionary?page=cedict CC-CEDICT] ====
==== Example 2: Free Vietnamese Dictionary Project English-Vietnamese dictionary ====
This one contains hierarchy, but we can only stuff the branches into one cell.
 
{| class="wikitable"
|@aba /'ɑ:/
 
<nowiki>*</nowiki>  danh từ
 
- áo aba (áo ngoài giống hình cái túi người A-Rập)
 
@abaci /'æbəkəs/
 
<nowiki>*</nowiki>  danh từ,  số nhiều abaci,  abacuses
 
- bàn tính
 
=to move counters of an abacus; to work an abacus+ tính bằng bàn tính, gảy bàn tính
 
- (kiến trúc)
 
- đầu cột, đỉnh cột
 
@aback /ə'bæk/
 
<nowiki>*</nowiki>  phó từ
 
- lùi lại, trở lại phía sau
 
=to stand aback from+ đứng lùi lại để tránh
 
- (hàng hải) bị thổi ép vào cột buồm (buồm)
 
=to be taken aback+ (hàng hải) bị gió thổi ép vào cột buồm
 
- (nghĩa bóng) sửng sốt, ngạc nhiên
 
=to be taken aback by the news+ sửng sốt vì cái tin đó
|}
Steps:
# Open with a spreadsheet program;
# Delete useless information (entries starting with “@00-”);
# Replace “\n@” with “\n”;
# Replace “\n=” with “ =”;
# Replace “\n-” with “ -”;
# Replace “\n*” with “ *”;
# Replace “/ ” with “/\t”;
# Replace “ /” with “\t/”;
# Save.
Final result:
{| class="wikitable"
!
!A
!B
!C
|-
!1
|aba
|/'ɑ:bə/
|*  danh từ - áo aba (áo ngoài giống hình cái túi người A-Rập)
|-
!2
|abaci
|/'æbəkəs/
|*  danh từ,  số nhiều abaci,  abacuses - bàn tính =to move counters of an abacus; to work an abacus+ tính bằng bàn tính, gảy bàn tính - (kiến trúc) - đầu cột, đỉnh cột
|-
!3
|aback
|/ə'bæk/
|*  phó từ - lùi lại, trở lại phía sau =to stand aback from+ đứng lùi lại để tránh - (hàng hải) bị thổi ép vào cột buồm (buồm) =to be taken aback+ (hàng hải) bị gió thổi ép vào cột buồm - (nghĩa bóng) sửng sốt, ngạc nhiên =to be taken aback by the news+ sửng sốt vì cái tin đó
|}
 
==== Example 3: CC-CEDICT ====
{| class="wikitable"
{| class="wikitable"
|一 一 [yi1] /one/single/a (article)/as soon as/entire; whole; all; throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit/
|一 一 [yi1] /one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit/
一一 一一 [yi1 yi1] /one by one/one after another/
一一 一一 [yi1 yi1] /one by one/one after another/


Line 134: Line 291:
|}
|}
Steps:
Steps:
# Turn on “Use regular expressions”, turn off “Use escape sequences”;
# Open with a text editor;
# Replace “^(\S*)\s” with “\1\t”;
# Delete useless information (lines starting with hash);
# Turn off “Use regular expressions”, turn on “Use escape sequences”;
# Replace “pr. [” with “pr.[”;
# Replace “ [” with “\t”;
# Replace “ [” with “\t”;
# Replace “pr.\t” with “pr. [”;
# Replace “pr.[” with “pr. [”;
# Replace “] /” with “\t”;
# Replace “] /” with “\t”;
# Replace “/\n” with “\n”;
# Replace “/\n” with “\n”;
# Delete the last “/” in the file;
# Save;
# Open with a spreadsheet program;
# Select column A, replace “ ” with “replace_me”;
# Save;
# Open with a text editor;
# Replace “replace_me” with “\t”;
# Save.
Explanation:
Most of the customised field separators are unified to Tab in the first edition. Step 2, 3, 4, makes sure that the square parenthesis in “meaning” field won't be affected, as that “pr. [” is a universal pattern throughout the whole file.


Result:
By step 10, the field separator of Traditional and Simplified Chinese, which is a space, need to be replaced to a more distinguishable one. In this case, it is changed to string “replace_me”. It would be more rigorous to search for “replace_me” and make sure it doesn't exist in the file at first.
 
Result by step 10:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
|-
!'''1'''
|一 一
|yi1
|one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one"
radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for
greater clarity when spelling out numbers digit by digit
|-
!'''2'''
|一一 一一
|yi1 yi1
|one by one/one after another
|-
!'''3'''
|一一對應 一一对应
|yi1 yi1 dui4 ying4
|one-to-one correspondence
|}
Final Result:
{| class="wikitable"
{| class="wikitable"
!
!
Line 154: Line 348:
|一
|一
|yi1
|yi1
|one/single/a (article)/as soon as/entire; whole; all; throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit
|one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one"  
radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for  
greater clarity when spelling out numbers digit by digit
|-
|-
!'''2'''
!'''2'''
Line 167: Line 363:
|yi1 yi1 dui4 ying4
|yi1 yi1 dui4 ying4
|one-to-one correspondence
|one-to-one correspondence
|}
==== Example 4: JMdict ====
CC-CEDICT has a similar format to it, but their differences are large enough for creating a new subheading for it.
{| class="wikitable"
|あああ;あーあ;あーー;アアア;アーア;アーー /(int) (expression of despair, resignation, boredom, disgust, etc.) (See 嗚呼・ああ・1) aah!/ooh!/oh no!/oh boy!/EntL2205270X/
ああいう(P);ああゆう /(exp,adj-pn) that sort of/like that/(P)/EntL2085090X/
ああいう風に [ああいうふうに] /(exp) (uk) in that way/like that/EntL2424550X/
|}
Its differences with CC-CEDICT are:
* Many words don't have Kanji, thus there is no squared parenthesis in the entry;
* There are sequence numbers;
* No square parenthesis in meaning field.
Steps:
# Open with a text editor (select Japanese (EUC-JP) as encoding);
# Delete useless information (the first line);
# Replace “ /” with “[] /”;
# Replace “][]” with “]”;
# Replace “[]” with “ []”;
# Replace “ [” with “\t”;
# Replace “] /” with “\t”;
# Replace “/EntL” with “\t”;
# Save;
# Open with a spreadsheet program;
# Delete column D (sequence number and a slash);
# Save.
Explanation:
Step 3, 4, 5 add square parentheses to all entries and delete them in entries that already had them.
Result by step 10:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
!'''D'''
|-
!'''1'''
|あああ;あーあ;あーー;アアア;アーア;アーー
|
|(int) (expression of despair, resignation, boredom, disgust, etc.) (See 嗚呼・ああ・1) aah!/ooh!/oh no!/oh boy!
|2205270X/
|-
!'''2'''
|ああいう(P);ああゆう
|
|(exp,adj-pn) that sort of/like that/(P)
|2085090X/
|-
!'''3'''
|ああいう風に
|ああいうふうに
|(exp) (uk) in that way/like that
|2424550X/
|}
Final Result:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
|-
!'''1'''
|あああ;あーあ;あーー;アアア;アーア;アーー
|
|(int) (expression of despair, resignation, boredom, disgust, etc.) (See 嗚呼・ああ・1) aah!/ooh!/oh no!/oh boy!
|-
!'''2'''
|ああいう(P);ああゆう
|
|(exp,adj-pn) that sort of/like that/(P)
|-
!'''3'''
|ああいう風に
|ああいうふうに
|(exp) (uk) in that way/like that
|}
==== Example 5: HanDeDict ====
Its format is similar to CC-CEDICT, but there is a lot of useless information to delete.
{| class="wikitable"
|
<nowiki>#</nowiki> ID-a00af3L
<nowiki>#</nowiki> Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei
<nowiki>#</nowiki> 直前 直前 [zhi2 qian2] /geradeaus (u.E.)/
<nowiki>#</nowiki> Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
直前 直前 [zhi2 qian2] /geradeaus/
<nowiki>#</nowiki> ID-a00aV1M
<nowiki>#</nowiki> Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei
<nowiki>#</nowiki> 公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (u.E.) (S)/
<nowiki>#</nowiki> Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (S)/
<nowiki>#</nowiki> ID-a00cf2E
<nowiki>#</nowiki> Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei
<nowiki>#</nowiki> 馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (u.E.) (S)/
<nowiki>#</nowiki> Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (S)/
|}
Steps:
# Open with a text editor;
# Delete the first lines starting with hash;
# Replace “\n#” with “#”;
# Save;
# Open with a spreadsheet program;
# You know what to do.
Result by step 5:
{| class="wikitable"
!
!'''A'''
|-
!'''1'''
|<nowiki>#</nowiki> ID-a00af3L<nowiki>#</nowiki> Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei<nowiki>#</nowiki> 直前 直前 [zhi2 qian2] /geradeaus (u.E.)/<nowiki>#</nowiki> Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
|-
!'''2'''
|直前 直前 [zhi2 qian2] /geradeaus/
|-
!'''3'''
|<nowiki>#</nowiki> ID-a00aV1M<nowiki>#</nowiki> Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei<nowiki>#</nowiki> 公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (u.E.) (S)/<nowiki>#</nowiki> Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
|-
!'''4'''
|公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (S)/
|-
!'''5'''
|<nowiki>#</nowiki> ID-a00cf2E<nowiki>#</nowiki> Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei<nowiki>#</nowiki> 馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (u.E.) (S)/<nowiki>#</nowiki> Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
|-
!'''6'''
|馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (S)/
|}
Final result:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
!'''D'''
|-
!'''1'''
|直前
|直前
|zhi2 qian2
|geradeaus
|-
!'''2'''
|公所堂區市府
|公所堂区市府
|gong1 suo3 tang2 qu1 shi4 fu3
|Gemeindehaus (S)
|-
!'''3'''
|馬口鐵
|马口铁
|ma3 kou3 tie3
|Weißblech, verniertes Blech (S)
|}
|}


== How to combine data with same column from two spreadsheets ==
== How to combine data with same column from two spreadsheets ==
You have a dictionary file in spreadsheet format, but it has so many entries that you don't want to memorise them all. Then you get a list of common words in spreadsheet format. How to combine them?
You have a dictionary file in spreadsheet format, but it has so many entries that you don't want to memorise them all. Then you get a list of common words in spreadsheet format. How to combine them?
Here's [https://www.youtube.com/watch?v=VmanL-Vf8Eg a guidance for Microsoft Office Excel].


For LibreOffice Calc and other spreadsheet programs, you need to put the two sheets alongside, then use VLOOKUP (vertical lookup).
For LibreOffice Calc and other spreadsheet programs, you need to put the two sheets alongside, then use VLOOKUP (vertical lookup).
Line 187: Line 561:
|frequency
|frequency
|word
|word
|meaning
|''meaning''
|
|
|word
|word
Line 244: Line 618:
|frequency
|frequency
|word
|word
|meaning
|''meaning''
|
|
|word
|word
Line 282: Line 656:
|}
|}
Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, click on “OK”. Delete column E and F, save as TSV.
Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, click on “OK”. Delete column E and F, save as TSV.
You can see there may be some entries in the frequency list but not in the dictionary. In this case, [https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Producing-dictionaries-with-web-scraping web scraping] can be helpful.
<span links></span>

Please note that all contributions to Polyglot Club WIKI may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see PolyglotClub-WIKI:Copyrights for details). Do not submit copyrighted work without permission!

Cancel Editing help (opens in new window)

Template used on this page: