Editing Language/Multiple-languages/Culture/How-to-make-a-TSV-file

{{Anki-menu}}
[[Category:Computer-Knowledge]]
[[File:tsv-file-polyglotclub-lessons.png|thumb]]
The tab-separated values (TSV) file is a simple format for storing data in a tabular structure.

You may have visited [[Language/Multiple-languages/Culture/Internet-Dictionaries]] as well as [https://tatoeba.org/en/downloads Tatoeba's download page] and want to utilise some downloadable stuff by creating flashcards on [https://apps.ankiweb.net/ Anki] or [https://mouse-dictionary.netlify.app/ Mouse Dictionary]. But it requires too much effort if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?

You may have noticed that both programs have “File -> Import...” option. But they don't support XLS or XLSX files. What should you do?

== What is TSV ==
If you have opened a spreadsheet program (e.g. [https://www.libreoffice.org/ LibreOffice] Calc, [https://www.openoffice.org/ Apache OpenOffice] Calc, [https://www.onlyoffice.com/ ONLYOFFICE] Spreadsheet Editor, [https://www.office.com/ Microsoft Office] Excel) and click on “File -> Save  As...”, you can see some other formats to choose from, one of which is “CSV”.

“[https://en.wikipedia.org/wiki/Comma-separated_values CSV]” means “Comma-separated values”, where columns are separated by columns. If there is a comma in the text, then the field needs to be surrounded by quotation marks, so the comma won't be counted as a column separator. If you have quotation marks in your text, then the [https://en.wikipedia.org/wiki/Escape_character escape character] backslash “\” need to be placed before the quotation mark. This is an example: https://gitlab.com/nirooj56/Nepdict/-/blob/master/database/data.csv (click on the icon of display source “</>”).

You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?

Not seems to be so, but its sibling TSV is supported. In Anki, it is called “Text separated by tabs or semicolons”; in Mnemosyne, it is called “Tab-separated text files”. What is it?

“[https://en.wikipedia.org/wiki/Tab-separated_values TSV]” means “Tab-separated values”. Some people call it “tabfile”. It is similar to CSV and has an advantage over that: columns are separated by “tabs”, so there is no need to use quotation marks to indicate commas as text instead of column separators. Both “TSV” and “CSV” belong to “[https://en.wikipedia.org/wiki/Delimiter-separated_values DSV]”, delimiter-separated values.

You may wonder what a “tab” means. [https://en.wikipedia.org/wiki/Tab_key The tabular key] is usually located above the “Caps Lock” key on a PC keyboard. It is used for aligning text in different lines to make it easier to form a table for typewriters and is inherited by computers. If you press this key in a text editor, it will look like some spaces; if you press this key in a browser, it will move the focus to the next element (link, textbox, button, etc.). In a spreadsheet program, you can press the Tab key to move to the next column or the Enter key to move to the next row. TSV files also use these two keys to separate columns and rows. It is more ideal than CSV. This is an example: https://gitlab.com/C0rn3j/NorwegianToEnglishDict/blob/master/4_finalDictionary/nb-NOtoENdictionary.txt.

How to save as TSV file? This is a bit confusing, because TSV is not so well-known as CSV. If you are using LibreOffice, click on that “Save As...”, select CSV, then in the dialogue box, choose {Tab} as “Field delimiter” and ignore “String delimiter”. If you have neutral quotation marks ⟨""⟩ in the text, they will be converted to typographic quotation marks ⟨“”⟩. The file you save has “CSV” as its file extension, but it's a TSV file essentially.

The problem with TSV is, it looks like spaces and can get confused with other whitespace characters. The solution is using a linter for sheets, like [https://github.com/mechatroner/vscode_rainbow_csv Rainbow CSV] for VSCodium, [https://github.com/mechatroner/rainbow_csv Rainbow CSV] for Vim, [https://github.com/emacs-vs/rainbow-csv rainbow-csv] for Emacs.

For VSCodium, there is another extention [https://github.com/janisdd/vscode-edit-csv vscode-edit-csv] to be mentioned.

== How to convert to TSV ==
You can use the under-construction tool [https://codeberg.org/GrimPixel/Towards_TSV Towards_TSV] or [https://github.com/thombashi/pytablewriter pytablewriter] or [https://github.com/ilius/pyglossary PyGlossary].

XML, JSON, YAML are not in the form of a sheet. They are hierarchical, tree-like. You can find a list of tools at [https://github.com/dbohdan/structured-text-tools Structured text tools], convert to JSON, then TSV.

=== Formats ===
==== Spreadsheet formats ====
In XLS format or what else, you just need to open it and save as TSV.

==== Sheets in other document formats ====
If it is a sheet in DOC, PDF or other document formats, then you can select the first several characters in the table and scroll to the bottom of the table, press a Shift key, then select the last character in the table. Copy and paste to a spreadsheet program, save as TSV.

But this can cause problems sometimes: all content are stuffed in the first cell. In this case, you can use open-source tools [https://tabula.technology/ Tabula], [https://excalibur-py.readthedocs.io/en/master/ Excalibur] or something else to do this work. But they may be not always reliable. If they have failed, you can try online services. Some online services have page limits, then you need open-source tools [https://pdfsam.org/ PDFsam], [http://angusj.com/pdftkb/ PDFTK Builder], [https://sourceforge.net/projects/pdfshuffler/ PDF-Shuffler] or something else to split PDF files.

==== Custom format ====

===== Same delimiter =====
You need to copy the delimiter between two columns. Open the file with a spreadsheet program, paste the delimiter into “Field delimiter” (for LibreOffice).

===== Different delimiters =====
In this case, you need to use a text editor to unify the delimiters, then do it with “same delimiter”.

==== Other formats ====

===== [https://www.sqlite.org/fileformat.html DB] =====
Open-source tools [https://sqlitebrowser.org/ DB Browser for SQLite], [https://sqlitestudio.pl/ SQLiteStudio] or something else are needed to open it. See which tables it contains, export tables you need, make sure “Field separator” is “Tab”.

===== [https://en.wikipedia.org/wiki/DICT DICT/DICT.DZ] =====
DICT is compressed as DICT.DZ, which is [https://en.wikipedia.org/wiki/Gzip gzip] in essence. Open-source tools [https://github.com/ib/xarchiver XAchiver] for Linux and BSD, [https://www.7-zip.org/ 7-Zip] for Windows can open DICT.DZ files without changing the file extension. DICT files can be edited with a text editor.

=== Practice ===
It requires the skill of combined use of a text editor and a spreadsheet program. You need to look through the file, discover its patterns and differentiate contents through the patterns.

In a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows. “t” means “tab”, while “r” and “n” mean “[https://en.wikipedia.org/wiki/Carriage_return return]” and “[https://en.wikipedia.org/wiki/Newline newline]” respectively (not to be mentioned in the examples below). The backslash is an “[https://en.wikipedia.org/wiki/Escape_character escape character]”. What's more, you can use something called “[https://en.wikipedia.org/wiki/Regular_expression regular expression]”, which is complex and powerful. You can visit [https://regexone.com/ RegexOne] and [https://www.regular-expressions.info/ Regular-Expressions.info] to learn more about it.

Different text editors may process regular expressions and escape characters differently. Examples here are using [https://www.geany.org/ Geany].

If you find [https://en.wikipedia.org/wiki/Mojibake mojibake] in your text editor, you probably need to open the text editor and click on “open file” in it, instead of directly double-clicking on the file; in the encoding selection, try other possible [https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings encodings] than UTF8.

Please try to solve the problems yourself. See the steps as late as possible.

==== Example 1: [https://freedict.org/downloads/#dictionary-downloads FreeDict English - French] ====
{| class="wikitable"
|a /ə/

1. à, au milie de, en, dans, parmi

2. un, quelqu'un

3. àraisonde, par

4. une

abacus /æbəkəs/

1. abaque

2. boulier

abandon /əbændən/

1. abdiquer

2. abandonner, délaisser, livrer, quitter

3. renoncer, résigner
|}

Steps:
# Turn off “Use regular expression”, turn on “Use escape characters”;
# Replace “\n2.” with “; 2.”;
# Replace “\n3.” with “; 3.”;
# Increase the number until no result found;
# Replace “/\n1.” with “/\t1.”;
# Replace “ /” with “\t/”;

Result:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
|-
!'''1'''
|a
|/ə/
|1. à, au milie de, en, dans, parmi; 2. un, quelqu'un; 3. àraisonde, par; 4. une
|-
!'''2'''
|abacus
|/æbəkəs/
|1. abaque; 2. boulier
|-
!'''3'''
|abandon
|/əbændən/
|1. abdiquer; 2. abandonner, délaisser, livrer, quitter; 3. renoncer, résigner
|}

==== Example 2: [https://www.mdbg.net/chinese/dictionary?page=cedict CC-CEDICT] ====
{| class="wikitable"
|一 一 [yi1] /one/single/a (article)/as soon as/entire; whole; all; throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit/
一一 一一 [yi1 yi1] /one by one/one after another/

一一對應 一一对应 [yi1 yi1 dui4 ying4] /one-to-one correspondence/
|}
Steps:
# Turn on “Use regular expressions”, turn off “Use escape sequences”;
# Replace “^(\S*)\s” with “\1\t”;
# Turn off “Use regular expressions”, turn on “Use escape sequences”;
# Replace “ [” with “\t”;
# Replace “pr.\t” with “pr. [”;
# Replace “] /” with “\t”;
# Replace “/\n” with “\n”;

Result:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
!'''D'''
|-
!'''1'''
|一
|一
|yi1
|one/single/a (article)/as soon as/entire; whole; all; throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit
|-
!'''2'''
|一一
|一一
|yi1 yi1
|one by one/one after another
|-
!'''3'''
|一一對應
|一一对应
|yi1 yi1 dui4 ying4
|one-to-one correspondence
|}

== How to combine data with same column from two spreadsheets ==
You have a dictionary file in spreadsheet format, but it has so many entries that you don't want to memorise them all. Then you get a list of common words in spreadsheet format. How to combine them?

For LibreOffice Calc and other spreadsheet programs, you need to put the two sheets alongside, then use VLOOKUP (vertical lookup).

In this example, column A and B are from the frequency list, column E and F are from the dictionary. Column C will be the result: meaning from the dictionary attached to the frequency list:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
!'''D'''
!'''E'''
!'''F'''
|-
!'''1'''
|frequency
|word
|meaning
|
|word
|meaning
|-
!'''2'''
|1
|b
|
|
|a
|ma
|-
!'''3'''
|2
|a
|
|
|b
|mb
|-
!'''4'''
|3
|c
|
|
|c
|mc
|-
!'''5'''
|4
|e
|
|
|d
|md
|}
Steps:
# Select C2, write “=VLOOKUP(B2,$E$2:$F$5,2,0)”;
# Move the cursor to the cell's lower-right corner and double click.
Explanation to the VLOOKUP:

The 1<sup>st</sup> value is the cell, the value in which is to be looked up; the 2<sup>nd</sup> value is the range to be looked up, in which the leftmost cell's value is to match the 1<sup>st</sup> value; the 3<sup>rd</sup> value is the cell's position to the left border of the range defined by the 2<sup>nd</sup> value to be looked up; the 4<sup>th</sup> value is either 0, left column unsorted and look for an exact match or 1, left column sorted and may look for something else. Further reading on [https://wiki.openoffice.org/wiki/Documentation/How_Tos/Calc:_VLOOKUP_function Apache OpenOffice] and [https://wiki.documentfoundation.org/Documentation/Calc_Functions/VLOOKUP LibreOffice].

Result:
{| class="wikitable"
!
!'''A'''
!'''B'''
!'''C'''
!'''D'''
!'''E'''
!'''F'''
|-
!'''1'''
|frequency
|word
|meaning
|
|word
|meaning
|-
!'''2'''
|1
|b
|mb
|
|a
|ma
|-
!'''3'''
|2
|a
|ma
|
|b
|mb
|-
!'''4'''
|3
|c
|mc
|
|c
|mc
|-
!'''5'''
|4
|e
|#N/A
|
|d
|md
|}
Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, click on “OK”. Delete column E and F, save as TSV.

You can see there may be some entries in the frequency list but not in the dictionary. In this case, [https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Producing-dictionaries-with-web-scraping web scraping] can be helpful.
<span links></span>