I think forcing quoting of strings and forcing "," for separation and "\n" for lines. Dates are ISO, decimals use .
That is all.
P.D: This is similar how I done this for my mini-lang https://tablam.org/syntax. Tabular data is very simple and the only problem with csv is that is too flexible and let everyone to change it, but making the "schema" outside from csv is against it purpose.
This addresses part of the problem. Don't forget the character encoding, null encodings, separator, quote and escape chars,...
No doubt, you can use csv in a well defined way if there is agreement and specs. The problem is there simply isn't any defacto standard and it won't be able to establish one in this domain...
Btw. : The python intake package allows you to specify metadata.
Null encoding is entirely uncovered by the CSV format - if you want to communicate nulls you need to have an internal convention - some sort of reserved string. Quoting characters are pretty commonly agreed to be " (aka ") and escape characters are technically unnecessary. Deciding to support \\ and \n is a pretty nice thing to do since some producers and parsers will erroneously choke on mid-string newlines but the only character that actually needs to be escaped (the double quote) can just be doubled up as per RFC4180[1]. The indecision on escaping means that the literal character sequence "\n" is very hard to represent in CSV consistently this is an unfortunate malus of C programmers getting all up in the standard.
RFC4180 is an after-the-fact RFC to try and codify common practices in the industry so some tools may refuse to conform to it - but it's really clearly written and like... super short. If you've never read it just go ahead and do so since it lays things out with exceeding care.
Why do you prefer comma for delimiters? I found tabs are less likely to appear within the strings (so less need to quote the strings if a delimiter is found, and thus the files are smaller). And both can be typed with a single keystroke.
The only advantage I would see for commas is if you are editing the file in a text editor that doesn't highlight strings (like notepad) so it's hard to distinguish them from spaces
The advantage you mentioned is the one I've seen that has the most benefit. If users are ever modifying the file raw they're going to bork the tab characters - nevermind some developer opening the file who has his editor configured to aggressively convert tabs to spaces.
Tabs are trivial to actually type (so they're much off than, say, FormFeed) but they're difficult to visually distinguish. I also think it's generally a good habit to just force quoting of all fields in CSV to dodge any impending user errors.
Human readable file formats have surrendered space efficiency for legibility and the main attribute they need to focus on, IMO, is defense against human errors.
IMHO the point about .csv is that writing a parser is trivial so none of this matters.
Many writers bork on certain types of data.
No biggie.
Yes there are variants, no they are not complicated.
Trying to support _all_ variants of something that is not well-defined is a mugs game. Most of the differences are by mistake. Some of the differences can't be fixed in the parser.
If you need csv to work, pick any one interpretation that does work.
Anything that requires multiple files to be passed around together is a non-starter. Even with just one file it's so hard for people to keep track of the 'current' one.
I think that's why the archive is part of the format. But then you have a process change where you expect people to work with the archive and not the contents. That can be done but it requires some institutional will.
For instance in a code signing situation it was socialized that only the archive was 'safe' and you used one of several tools to build them or crack them open. However that company was already used to thinking about physical shipping manifests and so learning by analogy worked, after a fashion.
Even better, you can just bake that metadata into the column headers. CSVs have no mechanism or widely accepted habit on skipping over entries - it's safer and more portable to bake any control data you need into the place reserved for it - the header line.
No, we can't - since people who aren't on Macs may potentially need to use this standard and that metadata descriptor file has failed to gain any adoption traction outside of Apple.
That said - all sorts of programs that bundle files inject rando metadata files and if there was a codified standard it should be flexible enough to ignore these files (possibly agreeing to never require any files to function that lead with a `.` since most of the random junk that gets thrown into zip files tends to at least consistently follow that habit.
Am I insane here, or does Excel generally do a very good job of handling almost all CSV files.
I don't really see a need for a metadata file, nor would I ever see Excel or other tools accepting it. The main problem is adoption, CSV isn't perfect but it's what we have. Now if you wrote this as a member of the Excel team at Microsoft, and then Excel had the option of exporting CSV files with a metadata file, then I'd be a bit more excited.
Excel does a fairly bad job when moving data between two computers that aren't configured the same way, which kind of defeats the purpose of using a data interchange format.
Where I work we have offices in the US, and in Europe where installing a localized version of windows will swap ',' and '.' when used as the group and decimal separator. Excel when loading a value 100,002 in the US will see one hundred thousand and two, in some parts of Europe it will see one hundred and 2 thousandths.
Character set handing can be just as bad, there is no good way to get Excel to auto open a CSV file as UTF-8 that won't break every other CSV parser in existence. The only cross platform option is ASCII. Excel will happily load your local OS encoding, likely some variant of ISO-8859, but any other encoding requires jumping through hoops.
Excel has some major problems with CSV ingestion though I don’t think that this proposal will address those problems.
Here are a couple of cases that I run into frequently:
* Excel is very aggressive about forcing type conversion based on its own assumptions. It will convert strings to dates or numbers, even if data is lost in the process. It will ignore quotes to convert long numeric IDs into scientific notation which truncates the ID unrecoverable.
* Excel cannot deal with quoted strings containing line breaks. It treats them as separate records and you get truncated records and partial records on separate rows.
Excel does an atrocious job of handling CSV files. It regularly alters data, messes up encodings and either can't or couldn't (I haven't checked in a few years) open CSV files that start with a capital letter I.
Source: dealing with CSV files people exported from Excel and the horrors that flowed from there.
> either can't or couldn't (I haven't checked in a few years) open CSV files that start with a capital letter I.
It can't do this because it confuses such files with files in SYLK format, which was YET ANOTHER attempt to standardize spreadsheet data interchange, dating from the 80s.
What I find incredible about this is that it decides that the file ending ".CSV" must not be a CSV file but SYLK. Then loading it as SYLK fails, and it doesn't then try and load it as a CSV file instead.
I always view CSV as a lowest common denominator, of course more precise formats exist, but not everyone can use those. Csvs normally get the job done, but like anything else you need to know it's limitations. Something like a basic phone book should work, your scientific data, with dozens upon dozens of floating point numbers may not work.
These are not limitations of CSV files, that's excel. Not being able to open a csv file starting with the letter I has nothing to do with CSV as a format.
There just aren't that many widely used applications that deal with things like this, but Google sheets seems to be the obvious one.
LibreOffice is generally a little less aggressive about formatting, but still has issues similar to Excel.
For me personally, I usually view things in Excel/LO but don't save, and if I need to modify anything I'll use a text editor for one-off changes, or I'll use something like Python with the pandas library for more programmatic changes. Pandas does have issues forcing timestamps to convert sometimes, but that can be easily configured. Otherwise, no problems with this method.
There's definitely a lot of friction if you want to edit .CSV's without your formatting/encoding being altered, unfortunately.
I much prefer ingesting CSVs into LibreOffice since it's a lot more upfront about what encoding it's going to try and use and the delimiter it has chosen - and those choices are extremely easy to modify and flexible. Excel also has a habit of utterly clobbering character encoding on export - once you start seeing umlauts and cedillas in your text you'll notice that Excel has a really bad habit of chosen latin-1.
Yeah I suppose that's the crux of it. I really like that first popup that LibreOffice has when you open it up where you can specify encoding, delimiters, etc.
LibreOffice does a similar amount of clobbering, but in general it notifies you of any would-be clobbering and allows you to abort beforehand, which is really nice. I still avoid exporting via LibreOffice though for very sensitive situations, but it is noticeably better still.
> Am I insane here, or does Excel generally do a very good job of handling almost all CSV files.
If the user does Data > From Text/CSV > Transform > use PowerQuery to set the first row as the header, I think it's true that Excel does a good job. It provides the basics at least: configurable charset and column type detection. When the user re-saves to CSV, I'm not aware of any way to configure the output (e.g., force quotation of text content), but that's sort of ok.
The easy path—double clicking on a CSV file or using File > Open—is where all the weird auto-conversion of values happens. But other posts have covered that part.
Excel will strip leading zeros of your data, even when you're escaping it, it will also assume that that single cell of a Column of dot-separated floats is in fact a date. Excel is in fact so good at detecting formats that even if you construct an xls fill it with ids which partially start with zeros, properly mark them as strings, it will still nag you at every single cell of your file that there is something fishy about the file.
There's already a common way to add metadata to csv : use the first row. It is used for column names, but we can extend it to add any metadata. It would be easy to implement in current parsers and it would fail in a, if not graceful, predictable way probably already accounted for (i.e. garbage in the first row)
This just make it more complex…and then what? We end up with multiple specifications like an x12 document? No, it’s not time to retire it. It’s just another option. Just like any other delimited file format. Remember.INI files? They’re still being used when there is need for simplicity.
If you think CSV is complicated for your app requirement choose a different delimiter like pipe, else look at other alternatives. Simple as that.
I’ve spent years building parsers for different document in the retail juggernaut businesses.
Oh neat - they actually decided to support the technically correct quote escaping approach for CSV files instead of defaulting to the c-string approach!
I think they're simply acknowledging the fact that there are many permutations of formats out there and what they propose is a way to communicate what your particular format is without abandoning CSV completely.
I'm assuming it's in response to the post yesterday that outlined all the things that are wrong with CSV and how other formats like Parquet are better.
The chaos of CSV files is older than XML. And while many of the same issues have played out in HTML, I have a strong suspicion that the particular levels of hell familiar to anyone using CSV files professionally informed some of the opinionated nature of XML. Especially since XML delves into data transport instead of just typesetting of text meant for human consumption.
I know I had specific conversations about CSV versus XML and those referred to a substantial body of literature on the topic.
People don't know how to type them on a US keyboard, so it'll never pass the "if I can't hand-write this in ed(1), it isn't a valid file format" hacker adoption barrier without someone of Google's size to force us all to adopt it (e.g. protobuf).
I believe you could type it in DOS with the same trick that works today: Alt-028, though nowadays you have to say Alt-001E due to Unicode. (Actual ascii/uni codes may vary.)
Perl6 tried incorporating non-US-keyboard characters into the language and that went very badly. I'm sure it works fine for Perl6 people, but beyond that boundary, I still encounter people who can't type é on a Mac with the keyboard alone today, much less handle Alt-001E. So I am extremely pessimistic.
We actually use these as part of bulk loading CSV s into SQL server 2016 because bulk insert can't discern commas in a field with one being used as a delimiter. Pain in the ass to debug with though.
Interesting, hadn’t heard of these! My guess is the complexity of parsing csv consistently extends a bit beyond that (e.g. handling line breaks), but the bigger problem is getting enough traction behind any solution, no matter how spot on it is, and outpacing csv. Not to mention csv is a far catchier acronym than “rssv” :P
Drop the notion of packaging them together and you have a winning idea. There's an optional metadata file that may or may not be present alongside the exact same CSV that we've always had. For systems that implement the metadata they can take advantage.
We do not need a new format. There are numerous superior options if we're just going to provide a different format (parquet, avro, orc, etc).
Could this be too general a problem to "solve" without narrowing down major uses-cases and industries, and likely first-adopters? Will there need to be different evolutionary/adoption paths? Will Postal's rule need to play a role?
A priori, the design space I'd want to look at is something between protobuf and csv. Perhaps optionally at the end of each line [somewhere non-breaking] you add metadata for that line, including specifying a parser implementation you know can perform perfectly.
The OP suggestion might address this if metadata(s) could apply to particular ranges of lines. It would be more fragile to the extent of being in different file.
I thought the “time to retire csv?” thread was tongue in cheek satire, but it seems real people think it can happen. Maybe my thinking is too static, but I feel like as long as a delimited string is a thing, csv will remain ubiquitous. Everything is meta data and also not meta data.
You could also add a header line(s) marked with a leading # and skip these lines when using the file for whatever task. It's used in some bioinformatics file formats to specify metadata and is simple enough to work with.
metadata in a file is as annoying as a BOM.
Make decisions about how you plan to interpret wooly specs outside the file and do this for all files. Problem solved: with no code.
Interesting. Bridges the gap between old school and new school. Us old school folks have to walk the fine line between extinction or having some legacy in the DNA :)
Single file with an open standard binary format based on Protobuf or CapnProto. This idea of using text files by default is ancient Unix cruft. Use binary formats unless you have a damn good reason not to, always, and release open-source tools for reading and processing files in your format. Then you can do some semblance of enforcing structure and types on your data.
Time to retire the CSV? - https://news.ycombinator.com/item?id=28221654 - Aug 2021 (544 comments)