Language/Multiple-languages/Culture/How-to-make-a-TSV-file

From Polyglot Club WIKI
Jump to navigation Jump to search
Rate this lesson:
5.00
(2 votes)

Tab-separated values (TSV) file is a simple format for storing data in a tabular structure.

You may have visited Language/Multiple-languages/Culture/Internet-Dictionaries and want to utilise some downloadable stuff by creating flashcards on Anki or Mnemosyne. But it requires too much effort if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?

You may have noticed that both programs have “File -> Import...” option. But they don't support XLS or XLSX files. What should you do?

What is TSV

If you have opened a spreadsheet program (e.g. LibreOffice Calc, Apache OpenOffice Calc, ONLYOFFICE Spreadsheet Editor, Microsoft Office Excel) and click on “File -> Save As...”, you can see some other formats to choose from, one of which is “CSV”.

CSV” means “Comma-separated values”. It uses commas to separate columns. If you have a comma in the text, then it uses quotation marks to quote your text, so the comma won't be counted as a column separator. If you have quotation marks in your text, then the escape character backslash “\” need to be placed before the quotation mark. This is an example: https://github.com/skywind3000/ECDICT/blob/master/ecdict.mini.csv.

You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?

Not seems to be so, but its sibling TSV is supported. In Anki, it is called “Text separated by tabs or semicolons”; in Mnemosyne, it is called “Tab-separated text files”. What is it?

TSV” means “Tab-separated values”. It is similar with CSV and has an advantage over CSV: it uses “tabs” to separate columns, so there is no need to use quotation marks to indicate commas as text instead of column separators. Both “TSV” and “CSV” belong to “DSV”, delimiter-separated values.

You may wonder what a “tab” means. The tabular key is the key above the “Caps Lock” key on your keyboard (in most cases). It is used for aligning text in different lines to make it easier to form a table for typewriters and is inherited by computers. If you press this key in a text editor, it will look like some spaces; if you press this key in a browser, it will move the focus to the next element (link, textbox, button, etc.). In a spreadsheet program, you can press the Tab key to move to the next column or the Enter key to move to the next row. TSV files also use these two keys to separate columns and rows. It is more ideal than CSV. This is an example: https://www.eki.ee/litsents/vaba/ies/eestiinglise.txt.

How to save as TSV file? This is a bit confusing, because TSV is not so well-known as CSV. If you are using LibreOffice, click on that “Save As...”, select CSV, then in the dialogue box, choose {Tab} as “Field delimiter” and ignore “String delimiter” (if you have neutral quotation marks ⟨"…"⟩ in the text, they will be converted to typographic quotation marks ⟨“…”⟩). The file you save has “CSV” as its file extension, but it's a TSV file essentially.

How to convert to TSV

Formats

Spreadsheet formats

In XLS format or what else, you just need to open it and save as TSV.

Sheets in other document formats

If it is a sheet in DOC, PDF or other document formats, then you can select the first several characters in the table and scroll to the bottom of the table, press a Shift key, then select the last character in the table. Copy and paste to a spreadsheet program, save as TSV.

But this can cause problems sometimes: all content are stuffed in the first cell. In this case, you can use open-source tools Tabula, Excalibur or something else to do this work. But they may be not always reliable. If they have failed, you can try online services. Some online services have page limits, then you need open-source tools PDFsam, PDFTK Builder, PDF-Shuffler or something else to split PDF files.

Custom format

Same delimiter

You need to copy the delimiter between two columns. Open the file with a spreadsheet program, paste the delimiter into “Field delimiter” (for LibreOffice).

Different delimiters

In this case, you need to use a text editor like Notepad on Windows instead of spreadsheet program.

Other formats

DB

You need open-source tools DB Browser for SQLite, SQLiteStudio or something else to open it. See which tables it contains, export tables you need, make sure “Field separator” is “Tab”.

DICT/DICT.DZ

DICT is compressed as DICT.DZ, which is gzip in essence. Open-source tools XAchiver for Linux and BSD, 7-Zip for Windows can open DICT.DZ files without changing the file extension to GZ. DICT files can be edited with a text editor.

SLOB and some others

It is a format designed for Aard 2. Open-source tool PyGlossary can do.

XML and some others

Outdated format with incredible popularity.

This sort of conversion is a bit different, because XML, JSON and the rising star YAML are not in the form of sheet. They are hierarchical, tree-like. You can find a list of tools at Structured text tools, convert to JSON, then TSV.

A video about how to import XML into LibreOffice Calc.

Skill training

It requires the skill of combined use of text editor and spreadsheet program. You need to look through the file, discover its patterns and differentiate contents through the patterns. It's similar to chemistry.

In a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows. “t” means “tab”, while “r” and “n” mean “return” and “newline” respectively, and different operating systems adopt different approaches and can be represented by “\R” for any operating system with “regular expression” option on (not to be mentioned in the examples below). Further reading: newline and return. What's more, you can use “.+” to represent all characters between two lines.

If you select one or more columns or rows and use “replace” function in a spreadsheet program, then it can replace characters in the selected area only.

If you find mojibake in your text editor, you probably need to open the text editor and click on “open file” in it, instead of directly double-clicking on the file; in the encoding selection, try some possible encodings.

Several operations require waiting because of the large amount of character replacements.

Please try to solve the problems yourself. See the steps as late as possible.

Solutions involving command line: Language/Multiple-languages/Culture/Licensed-Free-Databases#Manually_convert_to_TSV.

Example 1: English-French FreeDict Dictionary

a /ə/

1. à, au milie de, en, dans, parmi

2. un, quelqu'un

3. àraisonde, par

4. une

abacus /æbəkəs/

1. abaque

2. boulier

abandon /əbændən/

1. abdiquer

2. abandonner, délaisser, livrer, quitter

3. renoncer, résigner

Steps for putting multiple lines into one:

  1. Open with a text editor;
  2. Replace “\R2.” with “; 2.”;
  3. Replace “\R3.” with “; 3.”;
  4. Increase the number until such string not found;
  5. Save;

Result:

a /ə/

1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par; 4. une

abacus /æbəkəs/

1. abaque; 2. boulier

abandon /əbændən/

1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner

Steps for distinguishing odd and even numbers:

  1. Open with a spreadsheet program;
  2. Type “=MOD(ROW(),2)” in B2, double click the cell's lower-right corner;
  3. Add a new sheet;
  4. Select column B, open “AutoFilter”, click on “No” when asked if the first line is to be the column header;
  5. Select 1 in the filter;
  6. Copy column A and paste to the new sheet's column A;
  7. Select 0 in the filter;
  8. Copy column A and paste to the new sheet's column B;
  9. Delete column 1 in the new sheet;

Explanation:

After step 7, the reverse triangle of the filter will disappear. If you want to see it again, you need to close AutoFilter and open it again, or simply undo (shortcut Ctrl+Z).

MOD function returns the remainder of the division operation. Here is the row number divided by 2. Alternatively, you can type “=ISODD(ROW())” or “=ISEVEN(ROW())” to test if the row is odd or even, get Boolean (“TRUE” and “FALSE”) values.

Result by step 2:

A B
1 a /ə/ 1
2 1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par; 4. une 0
3 abacus /æbəkəs/ 1
4 1. abaque; 2. boulier 0
5 abandon /əbændən/ 1
6 1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner 0

Result:

A B
1 a /ə/ 1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par 4. une
2 abacus /æbəkəs/ 1. abaque; 2. boulier
3 abandon /əbændən/ 1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner

Steps for separating English IPA:

  1. Add a column between column A and B;
  2. Select column A, replace “ /” with “\/” (backslash and slash);
  3. Select column A, open “Text to Columns”, select “Other” in “Separated by”, type “\”, click on “OK”;
  4. Save the new sheet;

Explanation:

It would be more rigorous if you search “\” in column A at first to make sure it didn't exist.

In LibreOffice Calc, “Text to Columns” can be found under menu option “Data”. Other spreadsheet programs may put it elsewhere.

Final result:

A B C
1 a /ə/ 1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par 4. une
2 abacus /æbəkəs/ 1. abaque; 2. boulier
3 abandon /əbændən/ 1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner

Example 2: Free Vietnamese Dictionary Project English-Vietnamese dictionary

This one contains hierarchy, but we can only stuff the branches into one cell.

@abacus /'æbəkəs/

* danh từ, số nhiều abaci, abacuses

- bàn tính

=to move counters of an abacus; to work an abacus+ tính bằng bàn tính, gảy bàn tính

- (kiến trúc)

- đầu cột, đỉnh cột

@abaddon /ə'bædən/

* danh từ

- âm ti, địa ngục

- con quỷ

@abaft /ə'bɑ:ft/

* phó từ

- (hàng hải) ở phía sau bánh lái, gần phía bánh lái

* giới từ

- (hàng hải) sau, ở đằng sau, ở phía sau

=abaft the mast+ sau cột buồm

Steps:

  1. Open with a text editor;
  2. Delete useless information (entries starting with “@00-”);
  3. Replace “\R=” with “ =”;
  4. Replace “\R-” with “ -”;
  5. Replace “\R*” with “ *”;
  6. Replace “@” with “”;
  7. Replace “/ *” with “/\t*”;
  8. Replace “ /” with “\t/”;
  9. Save.

Final result:

A B C
1 abacus /'æbəkəs/ * danh từ, số nhiều abaci, abacuses - bàn tính =to move counters of an abacus; to work an abacus+ tính bằng bàn tính, gảy bàn tính - (kiến trúc) - đầu cột, đỉnh cột
2 abaddon /ə'bædən/ * danh từ - âm ti, địa ngục - con quỷ
3 abaft /ə'bɑ:ft/ * phó từ - (hàng hải) ở phía sau bánh lái, gần phía bánh lái * giới từ - (hàng hải) sau, ở đằng sau, ở phía sau =abaft the mast+ sau cột buồm

Example 3: CC-CEDICT

一 一 [yi1] /one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit/

一一 一一 [yi1 yi1] /one by one/one after another/

一一對應 一一对应 [yi1 yi1 dui4 ying4] /one-to-one correspondence/

Steps:

  1. Open with a text editor;
  2. Delete useless information (lines starting with hash);
  3. Replace “pr. [” with “pr.[”;
  4. Replace “ [” with “\t”;
  5. Replace “pr.[” with “pr. [”;
  6. Replace “] /” with “\t”;
  7. Replace “/\R” with “\R”;
  8. Delete the last “/” in the file;
  9. Save;
  10. Open with a spreadsheet program;
  11. Add a column between column A and B;
  12. Select column A;
  13. Open “Text to Columns”, select “space” in “Separated by”, click on “OK”;
  14. Save.

Explanation:

Most of the customised field separators are unified to Tab in the first edition. Step 3, 5 guarantee that the square parenthesis in “meaning” field won't be affected, as that “pr. [” is a universal pattern throughout the whole file.

Result by step 10:

A B C
1 一 一 yi1 one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit
2 一一 一一 yi1 yi1 one by one/one after another
3 一一對應 一一对应 yi1 yi1 dui4 ying4 one-to-one correspondence

Final Result:

A B C D
1 yi1 one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit
2 一一 一一 yi1 yi1 one by one/one after another
3 一一對應 一一对应 yi1 yi1 dui4 ying4 one-to-one correspondence

Example 4: JMdict

CC-CEDICT has a similar format to it, but their differences are large enough for creating a new subheading for it.

あああ;あーあ;あーー;アアア;アーア;アーー /(int) (expression of despair, resignation, boredom, disgust, etc.) (See 嗚呼・ああ・1) aah!/ooh!/oh no!/oh boy!/EntL2205270X/

ああいう(P);ああゆう /(exp,adj-pn) that sort of/like that/(P)/EntL2085090X/

ああいう風に [ああいうふうに] /(exp) (uk) in that way/like that/EntL2424550X/

Its differences with CC-CEDICT are:

  • Many words don't have Kanji, thus there is no squared parenthesis in these entires;
  • There are sequence numbers;
  • No square parenthesis in meaning field.

Steps:

  1. Open with a text editor (select Japanese (EUC-JP) as encoding);
  2. Delete useless information (the first line);
  3. Replace “ /” with “\ /”;
  4. Replace “]\” with “]”;
  5. Replace “\ /” with “ [] /”;
  6. Replace “] /” with “\t”;
  7. Replace “ [” with “\t”;
  8. Replace “/EntL” with “\t”;
  9. Save;
  10. Open with a spreadsheet program;
  11. Delete column D;
  12. Save.

Explanation:

Step 3, 4, 5 add a backslash to all entries and delete them in entries with square parenthesis.

Result by step 10:

A B C D
1 あああ;あーあ;あーー;アアア;アーア;アーー (int) (expression of despair, resignation, boredom, disgust, etc.) (See 嗚呼・ああ・1) aah!/ooh!/oh no!/oh boy! 2205270X/
2 ああいう(P);ああゆう (exp,adj-pn) that sort of/like that/(P) 2085090X/
3 ああいう風に ああいうふうに (exp) (uk) in that way/like that 2424550X/

Final Result:

A B C
1 あああ;あーあ;あーー;アアア;アーア;アーー (int) (expression of despair, resignation, boredom, disgust, etc.) (See 嗚呼・ああ・1) aah!/ooh!/oh no!/oh boy!
2 ああいう(P);ああゆう (exp,adj-pn) that sort of/like that/(P)
3 ああいう風に ああいうふうに (exp) (uk) in that way/like that

Example 5: HanDeDict

Its format is similar to CC-CEDICT, but there is a lot of useless information to delete.

# ID-a00af3L

# Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei

# 直前 直前 [zhi2 qian2] /geradeaus (u.E.)/

# Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung

直前 直前 [zhi2 qian2] /geradeaus/


# ID-a00aV1M

# Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei

# 公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (u.E.) (S)/

# Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung

公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (S)/


# ID-a00cf2E

# Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei

# 馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (u.E.) (S)/

# Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung

馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (S)/

Steps:

  1. Open with a text editor;
  2. Delete the first lines starting with a hash;
  3. Replace “\R#” with “#”;
  4. Save;
  5. Open with a spreadsheet program;
  6. You know what to do.

Result by step 5:

A
1 # ID-a00af3L# Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei# 直前 直前 [zhi2 qian2] /geradeaus (u.E.)/# Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
2 直前 直前 [zhi2 qian2] /geradeaus/
3 # ID-a00aV1M# Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei# 公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (u.E.) (S)/# Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
4 公所堂區市府 公所堂区市府 [gong1 suo3 tang2 qu1 shi4 fu3] /Gemeindehaus (S)/
5 # ID-a00cf2E# Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei# 馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (u.E.) (S)/# Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenbereinigung
6 馬口鐵 马口铁 [ma3 kou3 tie3] /Weißblech, verniertes Blech (S)/

Final result:

A B C D
1 直前 直前 zhi2 qian2 geradeaus
2 公所堂區市府 公所堂区市府 gong1 suo3 tang2 qu1 shi4 fu3 Gemeindehaus (S)
3 馬口鐵 马口铁 ma3 kou3 tie3 Weißblech, verniertes Blech (S)

How to combine data with same column from two spreadsheets

You have a dictionary file in spreadsheet format, but it has so many entries that you don't want to memorise them all. Then you get a list of common words in spreadsheet format. How to combine them?

Here's a guidance for Microsoft Office Excel.

For LibreOffice Calc and other spreadsheet programs, you need to put the two sheets alongside, then use VLOOKUP (vertical lookup).

In this example, column A and B are from the frequency list, column E and F are from the dictionary. Column C will be the result: meaning from the dictionary attached to the frequency list:

A B C D E F
1 frequency word meaning word meaning
2 1 b a ma
3 2 a b mb
4 3 c c mc
5 4 e d md

Steps:

  1. Select C2, write “=VLOOKUP(B2,$E$2:$F$5,2,0)”;
  2. Move the cursor to the cell's lower-right corner and double click.

Explanation to the VLOOKUP:

The 1st value is the cell, the value in which is to be looked up; the 2nd value is the range to be looked up, in which the leftmost cell's value is to match the 1st value; the 3rd value is the cell's position to the left border of the range defined by the 2nd value to be looked up; the 4th value is either 0, left column unsorted and look for an exact match or 1, left column sorted and may look for something else. Further reading on Apache OpenOffice and LibreOffice.

Result:

A B C D E F
1 frequency word meaning word meaning
2 1 b mb a ma
3 2 a ma b mb
4 3 c mc c mc
5 4 e #N/A d md

Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, click on “OK”. Delete column E and F, save as TSV.

Contributors

GrimPixel and Maintenance script


Create a new Lesson