Tab Separated Values (TSV): a format for tabular data exchange

TSV is a very simple textual data format which allows tabular data
to be exhanged between applications that use different internal data
formats. This document briefly explains this standardized format
and gives simple examples of using it.

Content:

  • The TSV format
  • Not to be shown as is!
  • Importing TSV data to MS Excel
  • Generating TSV data
  • Exporting data in TSV format from Excel
  • Converting from TSV format to HTML table format

The TSV format

The Tab Separated Values has been
officially registered
as an Internet media type
(MIME type) under the name
text/tab-separated-values. Note that
major media type text means that the data is in textual
format which can be viewed and edited using a normal text editor; in practice
however this is seldom done- instead, the format is programmatically written
and read.

The abbreviation TSV (or tsv) is commonly used, and usually the file name
suffix for TSV files is ".tsv".

The registry entry is very short, since the format is very simple, but the
description is paraphrased here in plain English (as opposite to the partly formalized
notation in the registration):

  • A file in TSV format consists of lines.
  • Each line contain fields separated
    from each other by TAB characters
    (horizontal tab, HT, Ascii control code9).
  • "Field" means here just any string of characters, excluding TABs.
    The point is simply that
    TABs divide a line into pieces, components.
  • Each line must contain the same number of fields.
  • The first line contains the names for the fields (on all lines), i.e.
    column headers.

The last rule could be taken just as allowing the
labeling of columns that way. Mostly if you wish to include just tabular data
as such, it probably works OK. But it's usually recommendable to use some labeling
there.

Since TAB is used as a separator between fields, a field cannot contain a TAB.
However, TABs usually don't appear in data items that you wish to tabulate,
so this is seldom a restriction. There are various other formats which are very
similar to TSV but use a different separator, such as
Comma Separated Values
(CSV).
Note that commas, spaces, and other characters often used
as separators in such formats
appear rather often in data to be tabulated, at least in
header fields. Special conventions can be made to allow the inclusion of
the separator into data fields, but the extreme simplicity of the data
format is then lost.

The example given in the registration is the following,
where <TAB> denotes a TAB character:

Name<TAB>Age<TAB>Address
Paul<TAB>23<TAB>1115 W Franklin
Bessy the Cow<TAB>5<TAB>Big Farm Way
Zeke<TAB>45<TAB>W Main St

Not to be shown as is!

Although TSV is a text format,
a TSV file is not expected to appear in a nice tabular format
when displayed or printed as such, e.g. by an editor. In special cases it might do that,
if tab stops are set suitably in the environment and if the fields in a column
have roughly the same width. But even the simple example probably fails
when embedded into an HTML document (inside a
PRE element, which is the best shot):

Name	Age	Address
Paul	23	1115 W Franklin
Bessy the Cow	5	Big Farm Way
Zeke	45	W Main St

Using TABs in HTML is unreliable, but more fundamentally, tabbing with TAB is
in itself not structural. It just takes to some predefined tab stop; typically the
tab stops are set at 1st, 9th, 17th, etc. character position. This means that
the data would look OK by accident only. And when using TSV, you are not
supposed to try to "fix" things by using multiple TABs; that
would mean that there are empty fields between
the TABs, defeating the very idea of TSV.

What happens if you ask your Web browser to show a TSV file, e.g. by following a
link to a file containing our example? It really depends on
many things; it's a matter of
handling media types, which can be a bit confusing.
It most essentially depends on the settings of your browser, but
also on how the server sends it. But in any case, you as a user have
at least as one option the possibility of saving the file
onto local disk, to be processed separately using some suitable program (e.g.
MSExcel).
So TSV is one way of making tabular data available on a network in a universally
accessible way, though it may require some skills on the user side to know what
could be done with the data and do something useful with it.

Importing TSV data to MS Excel

We could give a TSV file as input to a
spreadsheet application, like MSExcel.
Such programs generally accept TSV format data.
The data can then be viewed, printed, and manipulated in tabular format.
And you can save the data in MSExcel's "native" format (which cannot
be processed by other programs unless they have been specifically written
to process it) or in TSV format, or in a few other formats.
Our simple example data would get displayed roughly as follows:

Cells with long content may look truncated. This is
Excel's feature and often useful when manipulating spreadsheets.
The presentation, both on screen and on paper, can be tuned as needed,
using various tools in Excel.

When you have Excel in use, you can open the the File menu,
then select Open, and pick up the file to be opened. For finding the file,
you may need to set value of the "Files of type:" setting (which is
effectively a file type filter) to "Allfiles(*.*)". (The details
may vary depending on the version and configuration of the program.)

But you can make things easier by setting things up so that your system
(we're assuming some flavor of Windows here) automatically recognizes
.tsv files as something to be opened in Excel, when you click on their
icons. Assuming there is no such association for .tsv in your system yet,
you'll be prompted for one when you first click on the icon of a .tsv file.
When prompted for "Description of '.tsv' files:", you could reply e.g.
"text/tab-separated-values". Then pick up Excel from the list of programs
and clickOK.

When TSV data is imported into Excel, some data
transformation and interpretation may take place. For example,
some strings might be interpreted as dates and converted to a different
date format, and Excel generally treats a string of digits as a number
and right-justifies it. In such cases, "opening by clicking"
could be inadequate, since it might not give you the option of
specifying how the data is to be interpreted. Explicitly opening
the file in Excel normally lets you use "Text import wizard" where you
can specify the interpretation of each column.

Generating TSV data

For example, if you have a program of your own (say, a Fortran or C program,
or a Perl script) that writes out some tabular data, and you'd like to be
able to open (or let others open) the data in Excel, the TSV approach is rather
simple. You just need to know how to write data in TSV format, which is alot
easier than trying to generate MSExcel format. Usually you
can just use TAB characters in a string used to format the output, but see next
paragraph as regards to editors; in C and Perl for example, you can alternatively use
the notation \t inside a character constant or string constant.

And, of course, you could type TSV data "by hand", using a text editor.
Note that typing the TAB character may require some extra trick, since
many editors process it as command-like, instead of entering it into data
as a character. For example, in some modes Emacs turns TABs to sequences of
spaces, which won't do in TSV; to prevent this, type control-Q before hitting
the TABkey.

As an example, here is
Fortran 77 code that writes a matrix is TSV format:

      parameter(m=42)
      parameter(n=5)
      real a(m,n)
... code that writes data to matrix a ...
      do 200 i=1,m
 200     print 210, (a(i,j),j=1,n)
 210     format(F10.6,4('	',F10.6))

The code is not particularly elegant, since the number of
columns is hard-wired into the format statement, which
causes the first number to printed as such and 4 more with a tab character
before each of them. Between the apostrophes (single quotation marks), there
is exactly one TAB character and nothing more; the appearance on your
browser may vary.

Exporting data in TSV format from Excel

When you save a spreadsheet in Excel (using Save in the File menu,
or using control-S), it normally uses by default its own internal data format.
You can override this by using SaveAs... (in the File menu)
instead and selecting the desired format. Among the available formats,
you can pick up "Text (Tab delimited) (*.txt)", which means TSV.
(Here, too, you may encounter some variation between versions of Excel.)
Note that the default file name suffix will be .txt when; you may
wish to override this, reserving .txt for plain text files.

Saving as TSV generally loses formatting information.
Quite often this is desirable, since if you e.g. need to E-mail some data
or put a table onto the Web, information about
the specific font faces, cell widths, and other
presentational details are not needed and can cause problems.
Note that
Excel's own "Save As HTML" includes a horrendous amount of
such data. There is software called
Office 2000 HTML Filter2.0
available from Microsoft for free, but it fixes only part of the
problem (for Excel files - for Word files, it's pretty good).

You could use this possibility for the job of making data
available on the Internet, or in an intranet. If you have the data in Excel
format, it will be less universally accessible than e.g. TSV format. Although there are
Excel viewers available free of charge for Windows, not everyone
wants to install such software just to casually view some tabular data,
and not everyone uses Windows at all. And even if people have Excel installed,
they might be unwilling to open documents from unknown origin, in fear of
macro viruses for example. So TSV might be a good idea at least as an alternative.
You could always make the data available both in Excel format and as TSV, especially
since saving as TSV may remove some formatting information.

Converting from TSV format to HTML table format

For putting data onto the Web in particular, you could additionally
convert the data from TSV format into an
HTML table and
make it a separate HTML document, or insert it into an HTML document, e.g. as follows:

Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St

Since TSV is a very simple format, and since the basic structure of HTML
tables is very simple, such a conversion is rather trivial as soon as you
have any programming tool. At the simplest, you could simply add the
<table> tag before the data
and the
</table> tag after it, and
insert
<tr><td> at the start of each data line,
and replace each tab with a <td>.
That way, each line would be turned to a table row and each field to a table cell.
However, this would misrepresent the first line, which is a line of header fields,
at least by TSV definition, so it would be better to use th rather than
td there. There are some additional considerations which make a
somewhat more advanced strategy better. The following Perl code
(also available as a separate file)
tries to produce nice HTML markup and uses very simple heuristics:
it assumes that any field which contains only digits
and spaces
should be right-aligned in a cell.

print <<END;
<table border="1" cellspacing="0" cellpadding="4">
<thead>
<tr><th>
END
chomp ($_ = <STDIN>);
s?\t?</th><th>?g;
print "$_</th></tr>\n";
print "</tr>\n</thead>\n<tbody>\n";
while (<>) {
    print "<tr>\n";
    my @fields = split('\t');
    for $cell(@fields) {
	if($cell =~ /^(\d|\s)+$/) {
	    print "<td align=\"right\">$cell</td>"; }
	else {
	    print "<td>$cell</td>"; } }
    print "</tr>\n"; }
print "</tr>\n</tbody>\n</table>\n";

The code creates just the table markup, so the result can be simply
inserted into an existing HTML document. The example table above was created that way.

You don't need to know
Perl
in order to use that program.
(Of course you'd need some understanding of Perl in order to modify
the program for your needs. But you might also consider editing the result
using an editor, if you just wish to e.g. add an align="left"
attribute into the first <tr> tag, to make the header cells
left-aligned.)
But you need a Perl interpreter, and you
know how to run it. (Typically, you would, after saving the program under the
name tsv2h.pl in a file, and having TSV data in a file tsv.tsv,
go to command prompt level and
just give a command like perltsv2h.pl<tsv.tsv>tsvex.html
to get the converted data into file tsv.html.)
There are
free Perl interpreters available for virtually every platform.
For Win32, many people recommend ActivePerl.

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。