Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm sorry to sound pessimistic, but I don't think that there's any technical solution to dirty data.

The problem is usually not "This data was sent as a JSON without any schema and with syntax errors", it's "This avro file has a completely useless schema (e.g. everything is typed as string|null) and there are multiple enumerations where the same value is encoded as 3 different strings (e.g. yes, y, true)"



If we make it 10x+ easier to define schemas and reuse existing ones, by making authoring, concatenation, extending, sharing, and discovering such schemas easier; and provide great translation mechanisms to/from existing languages like sql, Json, csv and xml; and provide new auto scheme generation and data fixing from better deep learning models....I think we can get there.

We are working on all of that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: