Difference between revisions of "Language/Multiple-languages/Culture/How-to-make-a-TSV-file"

From Polyglot Club WIKI
Jump to navigation Jump to search
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure. In can be very useful for storing data and languages related data in particular.
A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure. In can be very useful for storing data and languages related data in particular.


== Introduction to TSV ==
== What is TSV ==
You may have visited [[Language/Multiple-languages/Culture/Internet-Dictionaries]] and want to utilise some downloadable stuff, create flashcards on Anki or Mnemosyne. But it requires a lot of efforts if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?
You may have visited [[Language/Multiple-languages/Culture/Internet-Dictionaries]] and want to utilise some downloadable stuff, create flashcards on Anki or Mnemosyne. But it requires a lot of efforts if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?


Line 8: Line 8:
If you have opened a spreadsheet program (e.g. [https://www.libreoffice.org/ LibreOffice] Calc, [https://www.openoffice.org/ Apache OpenOffice] Calc, [https://www.onlyoffice.com/ ONLYOFFICE] Spreadsheet Editor, [https://www.office.com/ Microsoft Office] Excel) and click on “File -> Save  As...”, you can see some other formats to choose from, one of which is “CSV”.
If you have opened a spreadsheet program (e.g. [https://www.libreoffice.org/ LibreOffice] Calc, [https://www.openoffice.org/ Apache OpenOffice] Calc, [https://www.onlyoffice.com/ ONLYOFFICE] Spreadsheet Editor, [https://www.office.com/ Microsoft Office] Excel) and click on “File -> Save  As...”, you can see some other formats to choose from, one of which is “CSV”.


“[[wikipedia:Comma-separated_values|CSV]]” means “Comma-separated values”. It uses commas to separate columns. If you have a comma in the text, then it uses quotation marks to quote your text, so the comma won't be counted as a column separator. If you have quotation marks in your text, then another pair of quotation marks will be used. This is an example: https://github.com/skywind3000/ECDICT/blob/master/ecdict.mini.csv.
“[[wikipedia:Comma-separated_values|CSV]]” means “Comma-separated values”. It uses commas to separate columns. If you have a comma in the text, then it uses quotation marks to quote your text, so the comma won't be counted as a column separator. If you have quotation marks in your text, then the [[wikipedia:Escape_character|escape character]] backslash “\” need to be placed before the quotation mark. This is an example: https://github.com/skywind3000/ECDICT/blob/master/ecdict.mini.csv.


You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?
You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?
Line 38: Line 38:
In this case, you need to use a text editor instead of spreadsheet program.  
In this case, you need to use a text editor instead of spreadsheet program.  


Some hint: in a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows. “t” means “tab”, while “r” and “n” mean “return” and “newline” respectively, and different systems adopt different approaches. Further reading: https://en.wikipedia.org/wiki/Newline, https://en.wikipedia.org/wiki/Carriage_return.
Some hint: in a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows. “t” means “tab”, while “r” and “n” mean “return” and “newline” respectively, and different systems adopt different approaches. Further reading: [[wikipedia:Newline|newline]] and [[wikipedia:Carriage_return|return]].


Please try to solve the problems yourself. If you can't, see solutions: [[Language/Multiple-languages/Culture/Licensed-Free-Databases#Manually_convert_to_TSV]].
Please try to solve the problems yourself. If you can't, see solutions: [[Language/Multiple-languages/Culture/Licensed-Free-Databases#Manually_convert_to_TSV]].
Line 49: Line 49:


=== [https://github.com/itkach/slob SLOB format] ===
=== [https://github.com/itkach/slob SLOB format] ===
It is a format designed for [http://aarddict.org/ Aard 2]. That slob.py seems to be the only thing to read it.
It is a format designed for [http://aarddict.org/ Aard 2]. That slob.py seems to be the only thing to read it. Further research required here.


== How to combine data with same column from two spreadsheets ==
== How to combine data with same column from two spreadsheets ==
Line 113: Line 113:
Explanation to the VLOOKUP:
Explanation to the VLOOKUP:


The 1<sup>st</sup> value is the cell, the value in which is to be looked up; the 2<sup>nd</sup> value is the range to be looked up, in which the leftmost cell's value is to match the 1<sup>st</sup> value; the 3<sup>rd</sup> value is the cell's position to the left border of the range defined by the 2<sup>nd</sup> value to be looked up; the 4<sup>th</sup> value is either 1, left column sorted or 0, left column unsorted. Further reading on [https://wiki.openoffice.org/wiki/Documentation/How_Tos/Calc:_VLOOKUP_function Apache OpenOffice] and [https://wiki.documentfoundation.org/Documentation/Calc_Functions/VLOOKUP LibreOffice].
The 1<sup>st</sup> value is the cell, the value in which is to be looked up; the 2<sup>nd</sup> value is the range to be looked up, in which the leftmost cell's value is to match the 1<sup>st</sup> value; the 3<sup>rd</sup> value is the cell's position to the left border of the range defined by the 2<sup>nd</sup> value to be looked up; the 4<sup>th</sup> value is either 0, left column unsorted and look for an exact match or 1, left column sorted and may look for something else. Further reading on [https://wiki.openoffice.org/wiki/Documentation/How_Tos/Calc:_VLOOKUP_function Apache OpenOffice] and [https://wiki.documentfoundation.org/Documentation/Calc_Functions/VLOOKUP LibreOffice].


Result:
Result:
Line 165: Line 165:
|md
|md
|}
|}
Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, paste. Delete column E and F, save as TSV.
Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, click on “OK”. Delete column E and F, save as TSV.

Revision as of 14:43, 30 June 2020

A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure. In can be very useful for storing data and languages related data in particular.

What is TSV

You may have visited Language/Multiple-languages/Culture/Internet-Dictionaries and want to utilise some downloadable stuff, create flashcards on Anki or Mnemosyne. But it requires a lot of efforts if you copy-paste entry by entry. If we can use a spreadsheet, things will be much easier. Can we do that?

You may have noticed that both programs have “File -> Import...” option. But they don't support XLS or XLSX files. What should you do?

If you have opened a spreadsheet program (e.g. LibreOffice Calc, Apache OpenOffice Calc, ONLYOFFICE Spreadsheet Editor, Microsoft Office Excel) and click on “File -> Save As...”, you can see some other formats to choose from, one of which is “CSV”.

CSV” means “Comma-separated values”. It uses commas to separate columns. If you have a comma in the text, then it uses quotation marks to quote your text, so the comma won't be counted as a column separator. If you have quotation marks in your text, then the escape character backslash “\” need to be placed before the quotation mark. This is an example: https://github.com/skywind3000/ECDICT/blob/master/ecdict.mini.csv.

You may have realised that a CSV file doesn't store any styling data. If you save as a CSV file, all the information about fonts, colours, hyperlinks, etc. will be lost. CSV files are lightweight, so when you just need pure data, this format is ideal. Do Anki and Mnemosyne support it?

No, but its sibling TSV is supported. In Anki, it is called “Text separated by tabs or semicolons”; in Mnemosyne, it is called “Tab-separated text files”. What is it?

TSV” means “Tab-separated values”. It is similar with CSV and has an advantage over CSV: it uses “tabs” to separate columns, so there is no need to use quotation marks to indicate commas as text instead of column separators. Both “TSV” and “CSV” belong to “DSV”, delimiter-separated values.

You may wonder what a “tab” means. The tabular key is the key above the “Caps Lock” key on your keyboard (in most cases). It is used for making a table easily for typewriters and is inherited by computers. When you use a spreadsheet program, you can press the Tab key to move to the next column or the Enter key to move to the next row. TSV files also use these two keys to separate columns and rows. It is more ideal than CSV. This is an example: https://www.eki.ee/litsents/vaba/ies/eestiinglise.txt.

How to save as TSV file? This is a bit confusing, because TSV is not so well-known as CSV. If you are using LibreOffice, click on that “Save As...”, select CSV, then in the dialogue box, choose {Tab} as “Field delimiter” and ignore “String delimiter”. The file you save has “CSV” as its file extension, but it' actually a TSV file.

How to convert to TSV

Spreadsheet formats

In XLS format or what else, you just need to open it and save as TSV.

Sheets in other document formats

If it is a sheet in DOC, PDF or other document formats, then you can select the first several characters in the table and scroll to the bottom of the table, press a Shift key, then select the last character in the table. Copy and paste to a spreadsheet program, save as TSV.

But this can cause problems sometimes: all content are stuffed in the first cell. In this case, you can use open-source tools Tabula, Excalibur or something else to do this work. But they may be not always reliable. If they have failed, you can try online services. Some online services have page limits, then you need open-source tools PDFsam, PDFTK Builder, PDF-Shuffler or something else to split PDF files.

Custom sheet format

Same delimiter

You need to copy the delimiter between two columns. Open the file with a spreadsheet program, paste the delimiter into “Field delimiter” (for LibreOffice).

Different delimiters

In this case, you need to use a text editor instead of spreadsheet program.

Some hint: in a text editor's “replace” function, “Tab” can be represented with “\t”, “Enter” can be represented with “\n” on Linux, BSD, macOS, Solaris, or “\r\n” on Windows. “t” means “tab”, while “r” and “n” mean “return” and “newline” respectively, and different systems adopt different approaches. Further reading: newline and return.

Please try to solve the problems yourself. If you can't, see solutions: Language/Multiple-languages/Culture/Licensed-Free-Databases#Manually_convert_to_TSV.

DB format

Some people use DB format. You need open-source tools DB Browser for SQLite, SQLiteStudio or something else to open it. If you use DB Browser for SQLite, open the DB file, see which tables it contains, select “File -> Export -> Table(s) as CSV file”, select tables you want to export, make sure “Field separator” is “Tab”.

DICT/DICT.DZ format

DICT is compressed as DICT.DZ. Open-source tools XAchiver for Linux and BSD, 7-Zip for Windows can extract DICT files from DICT.DZ files. DICT files can be edited with a text editor.

SLOB format

It is a format designed for Aard 2. That slob.py seems to be the only thing to read it. Further research required here.

How to combine data with same column from two spreadsheets

You have a dictionary file in spreadsheet format, but it has so many entries that you don't want to memorise them all. Then you get a list of common words in spreadsheet format. How to combine them?

Here's a guidance for Microsoft Office Excel.

For LibreOffice Calc and other spreadsheet programs, you need to put the two sheets alongside, then use VLOOKUP (vertical lookup).

In this example, column A and B are from the frequency list, column E and F are from the dictionary. Column C will be the result: meaning from the dictionary attached to the frequency list:

A B C D E F
1 frequency word meaning word meaning
2 1 b a ma
3 2 a b mb
4 3 c c mc
5 4 e d md

Steps:

  1. Select C2, write “=VLOOKUP(B2,$E$2:$F$5,2,0)”;
  2. Move the cursor to the cell's lower-right corner and double click.

Explanation to the VLOOKUP:

The 1st value is the cell, the value in which is to be looked up; the 2nd value is the range to be looked up, in which the leftmost cell's value is to match the 1st value; the 3rd value is the cell's position to the left border of the range defined by the 2nd value to be looked up; the 4th value is either 0, left column unsorted and look for an exact match or 1, left column sorted and may look for something else. Further reading on Apache OpenOffice and LibreOffice.

Result:

A B C D E F
1 frequency word meaning word meaning
2 1 b mb a ma
3 2 a ma b mb
4 3 c mc c mc
5 4 e #N/A d md

Then you need to copy or cut column C, then paste special (shortcut Shift+Ctrl+V), make sure “formula” is unchecked, click on “OK”. Delete column E and F, save as TSV.