A straightforward way to extend CSV with metadata

dang · on Aug 19, 2021

Recent and related:

Time to retire the CSV? - https://news.ycombinator.com/item?id=28221654 - Aug 2021 (544 comments)

mamcx · on Aug 19, 2021

Nope, Straightforward Way is like this:

    name:String, date:Date, value:Int
    "Miami", 2021-08-19 11:54:19.721376-05, 2

And making header mandatory.

I think forcing quoting of strings and forcing "," for separation and "\n" for lines. Dates are ISO, decimals use .

That is all.

P.D: This is similar how I done this for my mini-lang https://tablam.org/syntax. Tabular data is very simple and the only problem with csv is that is too flexible and let everyone to change it, but making the "schema" outside from csv is against it purpose.

riedel · on Aug 19, 2021

This addresses part of the problem. Don't forget the character encoding, null encodings, separator, quote and escape chars,...

No doubt, you can use csv in a well defined way if there is agreement and specs. The problem is there simply isn't any defacto standard and it won't be able to establish one in this domain...

Btw. : The python intake package allows you to specify metadata.

munk-a · on Aug 19, 2021

Null encoding is entirely uncovered by the CSV format - if you want to communicate nulls you need to have an internal convention - some sort of reserved string. Quoting characters are pretty commonly agreed to be " (aka &quot) and escape characters are technically unnecessary. Deciding to support \\ and \n is a pretty nice thing to do since some producers and parsers will erroneously choke on mid-string newlines but the only character that actually needs to be escaped (the double quote) can just be doubled up as per RFC4180[1]. The indecision on escaping means that the literal character sequence "\n" is very hard to represent in CSV consistently this is an unfortunate malus of C programmers getting all up in the standard.

RFC4180 is an after-the-fact RFC to try and codify common practices in the industry so some tools may refuse to conform to it - but it's really clearly written and like... super short. If you've never read it just go ahead and do so since it lays things out with exceeding care.

1. https://datatracker.ietf.org/doc/html/rfc4180#section-2

alecst · on Aug 20, 2021

I work with CSVs often and we usually just use blanks like

  item1,item2,,item4,,,item7

to represent nulls. What do you think the pitfalls are here? (Sincere question, wondering how we can improve.)

mrobins · on Aug 20, 2021

This breaks in systems where an empty string “” != null but I don’t have an elegant solution.

saltcured · on Aug 20, 2021

A typical hack is to distinguish an empty field from a field with a quoted empty string. The former is null, the latter an empty string.

mamcx · on Aug 19, 2021

But is like with json: We are not debating about that details because is a bit more well defined.

For a improved CSV, that minutae must be defined from the start, but also must take in account how is CSV used, and keep it spirit.

P.D: A example on something like this is https://commonmark.org, that codify the rules for markdown in a better defined speck.

zzleeper · on Aug 19, 2021

Why do you prefer comma for delimiters? I found tabs are less likely to appear within the strings (so less need to quote the strings if a delimiter is found, and thus the files are smaller). And both can be typed with a single keystroke.

The only advantage I would see for commas is if you are editing the file in a text editor that doesn't highlight strings (like notepad) so it's hard to distinguish them from spaces

munk-a · on Aug 19, 2021

The advantage you mentioned is the one I've seen that has the most benefit. If users are ever modifying the file raw they're going to bork the tab characters - nevermind some developer opening the file who has his editor configured to aggressively convert tabs to spaces.

Tabs are trivial to actually type (so they're much off than, say, FormFeed) but they're difficult to visually distinguish. I also think it's generally a good habit to just force quoting of all fields in CSV to dodge any impending user errors.

Human readable file formats have surrendered space efficiency for legibility and the main attribute they need to focus on, IMO, is defense against human errors.

FroshKiller · on Aug 19, 2021

If the delimiter isn't a comma, it isn't a CSV. I prefer tabs as well, but that's why I ask for a "tab-delimited text file" instead of a CSV.

sobriquet9 · on Aug 20, 2021

In locales where comma is used instead of decimal point, CSV fields are separated with semicolons.

fomine3 · on Aug 20, 2021

So it called "Character Separated Values".

tmountain · on Aug 19, 2021

Simple, elegant, and backward compatible; albeit with degradation, with the existing "standard". I like this approach.

teknopaul · on Aug 19, 2021

IMHO the point about .csv is that writing a parser is trivial so none of this matters. Many writers bork on certain types of data. No biggie. Yes there are variants, no they are not complicated. Trying to support _all_ variants of something that is not well-defined is a mugs game. Most of the differences are by mistake. Some of the differences can't be fixed in the parser.

If you need csv to work, pick any one interpretation that does work.

karmakaze · on Aug 19, 2021

Anything that requires multiple files to be passed around together is a non-starter. Even with just one file it's so hard for people to keep track of the 'current' one.

sigzero · on Aug 19, 2021

Yeah, if you are going to change the spec just put the format in the first lines of the file.

  %csvddfVersion: 1.2
  %delimiter: ","
  %doubleQuote: true
  %lineTerminator: "\r\n"
  %quoteChar: "\""
  %skipInitialSpace: true
  %header: true
  %commentChar: "#"
  %mandatory: address name
  %type: address:string
  %type: name:string
  %type: birthday:date
  %type: telephone:number

  address,name,birthday,telephone
  177A Bleecker Street,Stephen Strange,1963-01-01

Or something like that, where everything up to the first blank line is the schema.

larrydag · on Aug 19, 2021

Thinking same thing. CSV is already a text file. Why not just append the metadata to the file already. Call it a CSF file or something like that.

hinkley · on Aug 19, 2021

I think that's why the archive is part of the format. But then you have a process change where you expect people to work with the archive and not the contents. That can be done but it requires some institutional will.

For instance in a code signing situation it was socialized that only the archive was 'safe' and you used one of several tools to build them or crack them open. However that company was already used to thinking about physical shipping manifests and so learning by analogy worked, after a fashion.

karmakaze · on Aug 19, 2021

That's a distinction without a difference. It's either a binary format--not a CSV. Or it's a CSV file + another file. You can't have it both ways.

swader999 · on Aug 19, 2021

Agree but there are whole industries that's work this way. Looking at you banking and finance.

bachmeier · on Aug 19, 2021

I put any metadata in the first lines of the file. You can trivially have R skip those lines.

munk-a · on Aug 19, 2021

Even better, you can just bake that metadata into the column headers. CSVs have no mechanism or widely accepted habit on skipping over entries - it's safer and more portable to bake any control data you need into the place reserved for it - the header line.

Hackbraten · on Aug 19, 2021

Can we please have a third file in the zip?

* format.txt

* mydata.csv

* .DS_Store

It will be there if the zip was created on a Mac so might as well include it in the standard.

kaeruct · on Aug 19, 2021

I cannot tell if this is a joke or not.

If you are serious, then what about just ignoring any files in the zip that are not specified in the standard?

teknopaul · on Aug 19, 2021

The point is to proliferate any decisions taken by Apple because Apple = God and thefore such decisions are beyond human understanding.

You cant reason your way out of Apple being right _always_.

munk-a · on Aug 19, 2021

No, we can't - since people who aren't on Macs may potentially need to use this standard and that metadata descriptor file has failed to gain any adoption traction outside of Apple.

That said - all sorts of programs that bundle files inject rando metadata files and if there was a codified standard it should be flexible enough to ignore these files (possibly agreeing to never require any files to function that lead with a `.` since most of the random junk that gets thrown into zip files tends to at least consistently follow that habit.

_uvvk · on Aug 19, 2021

Why JSON? I don't see any benefits in storing `format.txt` as JSON (specially in a file with the txt extension). It just makes editing harder.

This would be better, IMHO:

    csvddfVersion: 1.2
    delimiter: ";"
    doubleQuote: true
    lineTerminator: "\r\n"
    quoteChar: "\""
    skipInitialSpace: true
    header: true
    commentChar: "#"

mtVessel · on Aug 19, 2021

Because what better way to configure a parser than to require it to use a completely different parser for configuration? /s

VenTatsu · on Aug 19, 2021

I thought the same thing at first glance, but it looks like the author is just reusing an existing spec https://specs.frictionlessdata.io/csv-dialect/

41209 · on Aug 19, 2021

Am I insane here, or does Excel generally do a very good job of handling almost all CSV files.

I don't really see a need for a metadata file, nor would I ever see Excel or other tools accepting it. The main problem is adoption, CSV isn't perfect but it's what we have. Now if you wrote this as a member of the Excel team at Microsoft, and then Excel had the option of exporting CSV files with a metadata file, then I'd be a bit more excited.

VenTatsu · on Aug 19, 2021

Excel does a fairly bad job when moving data between two computers that aren't configured the same way, which kind of defeats the purpose of using a data interchange format.

Where I work we have offices in the US, and in Europe where installing a localized version of windows will swap ',' and '.' when used as the group and decimal separator. Excel when loading a value 100,002 in the US will see one hundred thousand and two, in some parts of Europe it will see one hundred and 2 thousandths.

Character set handing can be just as bad, there is no good way to get Excel to auto open a CSV file as UTF-8 that won't break every other CSV parser in existence. The only cross platform option is ASCII. Excel will happily load your local OS encoding, likely some variant of ISO-8859, but any other encoding requires jumping through hoops.

Tagbert · on Aug 19, 2021

Excel has some major problems with CSV ingestion though I don’t think that this proposal will address those problems.

Here are a couple of cases that I run into frequently:

* Excel is very aggressive about forcing type conversion based on its own assumptions. It will convert strings to dates or numbers, even if data is lost in the process. It will ignore quotes to convert long numeric IDs into scientific notation which truncates the ID unrecoverable.

* Excel cannot deal with quoted strings containing line breaks. It treats them as separate records and you get truncated records and partial records on separate rows.

mark-r · on Aug 20, 2021

My two favorite problems with Excel automatic type conversions:

Zip codes that lose leading zeros.

Gene names that get converted to dates. The names of some genes were recently changed because too many databases were being corrupted by researchers using Excel to read CSV files: https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

IanCal · on Aug 19, 2021

Excel does an atrocious job of handling CSV files. It regularly alters data, messes up encodings and either can't or couldn't (I haven't checked in a few years) open CSV files that start with a capital letter I.

Source: dealing with CSV files people exported from Excel and the horrors that flowed from there.

bitwize · on Aug 19, 2021

> either can't or couldn't (I haven't checked in a few years) open CSV files that start with a capital letter I.

It can't do this because it confuses such files with files in SYLK format, which was YET ANOTHER attempt to standardize spreadsheet data interchange, dating from the 80s.

IanCal · on Aug 19, 2021

What I find incredible about this is that it decides that the file ending ".CSV" must not be a CSV file but SYLK. Then loading it as SYLK fails, and it doesn't then try and load it as a CSV file instead.

swader999 · on Aug 19, 2021

I've donated hours of my life to resolving excel corrupted csv files. They haven't been satisfying hours.

IanCal · on Aug 19, 2021

Same. If it helps, the python package ftfy can fix a lot of encoding issues.

swader999 · on Aug 19, 2021

It's tempting to consider different languages...

41209 · on Aug 19, 2021

Does any widely used application do it better.

I always view CSV as a lowest common denominator, of course more precise formats exist, but not everyone can use those. Csvs normally get the job done, but like anything else you need to know it's limitations. Something like a basic phone book should work, your scientific data, with dozens upon dozens of floating point numbers may not work.

IanCal · on Aug 19, 2021

These are not limitations of CSV files, that's excel. Not being able to open a csv file starting with the letter I has nothing to do with CSV as a format.

There just aren't that many widely used applications that deal with things like this, but Google sheets seems to be the obvious one.

capeterson · on Aug 19, 2021

LibreOffice is generally a little less aggressive about formatting, but still has issues similar to Excel.

For me personally, I usually view things in Excel/LO but don't save, and if I need to modify anything I'll use a text editor for one-off changes, or I'll use something like Python with the pandas library for more programmatic changes. Pandas does have issues forcing timestamps to convert sometimes, but that can be easily configured. Otherwise, no problems with this method.

There's definitely a lot of friction if you want to edit .CSV's without your formatting/encoding being altered, unfortunately.

munk-a · on Aug 19, 2021

I much prefer ingesting CSVs into LibreOffice since it's a lot more upfront about what encoding it's going to try and use and the delimiter it has chosen - and those choices are extremely easy to modify and flexible. Excel also has a habit of utterly clobbering character encoding on export - once you start seeing umlauts and cedillas in your text you'll notice that Excel has a really bad habit of chosen latin-1.

capeterson · on Aug 20, 2021

Yeah I suppose that's the crux of it. I really like that first popup that LibreOffice has when you open it up where you can specify encoding, delimiters, etc.

LibreOffice does a similar amount of clobbering, but in general it notifies you of any would-be clobbering and allows you to abort beforehand, which is really nice. I still avoid exporting via LibreOffice though for very sensitive situations, but it is noticeably better still.

pbreit · on Aug 19, 2021

Google Sheets handles CSVs MUCH better.

jpeloquin · on Aug 19, 2021

> Am I insane here, or does Excel generally do a very good job of handling almost all CSV files.

If the user does Data > From Text/CSV > Transform > use PowerQuery to set the first row as the header, I think it's true that Excel does a good job. It provides the basics at least: configurable charset and column type detection. When the user re-saves to CSV, I'm not aware of any way to configure the output (e.g., force quotation of text content), but that's sort of ok.

The easy path—double clicking on a CSV file or using File > Open—is where all the weird auto-conversion of values happens. But other posts have covered that part.

awild · on Aug 19, 2021

Excel will strip leading zeros of your data, even when you're escaping it, it will also assume that that single cell of a Column of dot-separated floats is in fact a date. Excel is in fact so good at detecting formats that even if you construct an xls fill it with ids which partially start with zeros, properly mark them as strings, it will still nag you at every single cell of your file that there is something fishy about the file.

pbreit · on Aug 19, 2021

The stripping of leading zeroes is dreadful when working with zip codes and SSNs.

isoprophlex · on Aug 19, 2021

No, excel does about the worst possible job of handling CSVs.

You can't even hope to keep a file intact upon opening...

pbreit · on Aug 19, 2021

You are insane :-).

Excel has the most inexplicably horrific handling of long number strings that it is borderline unusable.

https://excel.uservoice.com/forums/304921-excel-for-windows-...

slim · on Aug 19, 2021

There's already a common way to add metadata to csv : use the first row. It is used for column names, but we can extend it to add any metadata. It would be easy to implement in current parsers and it would fail in a, if not graceful, predictable way probably already accounted for (i.e. garbage in the first row)

iask · on Aug 19, 2021

This just make it more complex…and then what? We end up with multiple specifications like an x12 document? No, it’s not time to retire it. It’s just another option. Just like any other delimited file format. Remember.INI files? They’re still being used when there is need for simplicity.

If you think CSV is complicated for your app requirement choose a different delimiter like pipe, else look at other alternatives. Simple as that.

I’ve spent years building parsers for different document in the retail juggernaut businesses.

kuroguro · on Aug 19, 2021

We could also...

1. select sane defaults on all of those options and

2. create a new file extension (.scsv for strict csv or sth)

and call it a day

_edit_

Oh, someone already did it: https://github.com/code4fukui/StrictCSV

munk-a · on Aug 19, 2021

Oh neat - they actually decided to support the technically correct quote escaping approach for CSV files instead of defaulting to the c-string approach!

delusional · on Aug 19, 2021

Amazing, a standard to define the characters we couldn't standardize on.

If you can decide this file format, couldn't you just normalize the CSV file instead?

barbazoo · on Aug 19, 2021

I think they're simply acknowledging the fact that there are many permutations of formats out there and what they propose is a way to communicate what your particular format is without abandoning CSV completely.

I'm assuming it's in response to the post yesterday that outlined all the things that are wrong with CSV and how other formats like Parquet are better.

hinkley · on Aug 19, 2021

The chaos of CSV files is older than XML. And while many of the same issues have played out in HTML, I have a strong suspicion that the particular levels of hell familiar to anyone using CSV files professionally informed some of the opinionated nature of XML. Especially since XML delves into data transport instead of just typesetting of text meant for human consumption.

I know I had specific conversations about CSV versus XML and those referred to a substantial body of literature on the topic.

peanut_worm · on Aug 19, 2021

This seems kind of silly. So you need 2 parsers now and two files you might as well just put your data in JSON.

dec0dedab0de · on Aug 19, 2021

If we're going to change the format, why not just use the record separator, and field separator characters. They've been around since ascii.

floatingatoll · on Aug 19, 2021

People don't know how to type them on a US keyboard, so it'll never pass the "if I can't hand-write this in ed(1), it isn't a valid file format" hacker adoption barrier without someone of Google's size to force us all to adopt it (e.g. protobuf).

dec0dedab0de · on Aug 19, 2021

Fair enough, but imagine how much easier things would have been if we had those two extra keys on our keyboards for the last 40 years.

floatingatoll · on Aug 19, 2021

I believe you could type it in DOS with the same trick that works today: Alt-028, though nowadays you have to say Alt-001E due to Unicode. (Actual ascii/uni codes may vary.)

Perl6 tried incorporating non-US-keyboard characters into the language and that went very badly. I'm sure it works fine for Perl6 people, but beyond that boundary, I still encounter people who can't type é on a Mac with the keyboard alone today, much less handle Alt-001E. So I am extremely pessimistic.

Zizizizz · on Aug 19, 2021

We actually use these as part of bulk loading CSV s into SQL server 2016 because bulk insert can't discern commas in a field with one being used as a delimiter. Pain in the ass to debug with though.

timwis · on Aug 19, 2021

Interesting, hadn’t heard of these! My guess is the complexity of parsing csv consistently extends a bit beyond that (e.g. handling line breaks), but the bigger problem is getting enough traction behind any solution, no matter how spot on it is, and outpacing csv. Not to mention csv is a far catchier acronym than “rssv” :P

dreyfan · on Aug 19, 2021

Drop the notion of packaging them together and you have a winning idea. There's an optional metadata file that may or may not be present alongside the exact same CSV that we've always had. For systems that implement the metadata they can take advantage.

We do not need a new format. There are numerous superior options if we're just going to provide a different format (parquet, avro, orc, etc).

omarhaneef · on Aug 19, 2021

I see a similar pattern in electronic health records startups, javascript frameworks, backends for APIs and many other domains:

1. There are way too many standards, we have n standards

2. The ecosystem is fragmented and we have to let them talk to each other

3. I think we should solve it by introducing our new standard

4. We now have n+1 standards

And I am not saying I have solution.

timwis · on Aug 19, 2021

Sounds a bit like a data package, a spec defined by the same people who wrote the csv spec referenced in the article: https://specs.frictionlessdata.io/data-package/

a9h74j · on Aug 19, 2021

Could this be too general a problem to "solve" without narrowing down major uses-cases and industries, and likely first-adopters? Will there need to be different evolutionary/adoption paths? Will Postal's rule need to play a role?

A priori, the design space I'd want to look at is something between protobuf and csv. Perhaps optionally at the end of each line [somewhere non-breaking] you add metadata for that line, including specifying a parser implementation you know can perform perfectly.

The OP suggestion might address this if metadata(s) could apply to particular ranges of lines. It would be more fragile to the extent of being in different file.

avnigo · on Aug 19, 2021

There was an interesting talk on CSV-LD (CSV linked data and metadata) at csv,conf,v6 [0],[1].

[0]: https://raw.githubusercontent.com/csv-ld/ns/main/2021-05-csv...

[1]: https://www.youtube.com/watch?v=qIXcnSCOyZs

newbamboo · on Aug 19, 2021

I thought the “time to retire csv?” thread was tongue in cheek satire, but it seems real people think it can happen. Maybe my thinking is too static, but I feel like as long as a delimited string is a thing, csv will remain ubiquitous. Everything is meta data and also not meta data.

asdff · on Aug 19, 2021

You could also add a header line(s) marked with a leading # and skip these lines when using the file for whatever task. It's used in some bioinformatics file formats to specify metadata and is simple enough to work with.

teknopaul · on Aug 19, 2021

metadata in a file is as annoying as a BOM. Make decisions about how you plan to interpret wooly specs outside the file and do this for all files. Problem solved: with no code.

fitba72 · on Aug 19, 2021

Interesting. Bridges the gap between old school and new school. Us old school folks have to walk the fine line between extinction or having some legacy in the DNA :)

thebelal · on Aug 19, 2021

An earlier attempt of this is http://csvy.org/, but AFAIK it has never really caught on.

swader999 · on Aug 19, 2021

Some notion of control data should be added to this spec. Total lines to parse and so on could be included. Perhaps a hash...

bitwize · on Aug 19, 2021

Single file with an open standard binary format based on Protobuf or CapnProto. This idea of using text files by default is ancient Unix cruft. Use binary formats unless you have a damn good reason not to, always, and release open-source tools for reading and processing files in your format. Then you can do some semblance of enforcing structure and types on your data.

dima55 · on Aug 20, 2021

vnlog: https://github.com/dkogan/vnlog/

Magic comment with field labels is required, and you can then use the usual unix tools for processing. It's magical.

comeonseriously · on Aug 19, 2021

Or just use some other text format instead? CSV is supposed to be simple.

prpl · on Aug 19, 2021

There is CSVW

lolive · on Aug 19, 2021

Thanks to show that a little semantics can come a long way.