Dat – A git-like tool for large datasets

Camillo · on April 26, 2014

I am surprised at how far along the PR effort is compared to the actual product development. The code seems to be still in its embryonic stage, and most of the features are simply desiderata at this point; yet there is already a catchy name, a snazzy logo, a nice website, and even stickers! And this Hacker News post, of course.

Now, it is possible that this happened by chance: for instance, perhaps the author has a friend who is a graphic designer, and this friend got really excited about the project and put together all of this as a favor. But perhaps it is a deliberate strategy, in which case I wonder: what are the benefits of front-loading the PR work like this? Did it help in securing grants, for instance?

clxl · on April 26, 2014

Substack (yes, that substack) discussed this recently:

> a slick web page - this is very often (but not always) a sign of a library that put all of its time into slick marketing but has overly-broad scope and is very likely to become abandoned

http://www.reddit.com/r/javascript/comments/2378xo/the_best_...

maxogden · on April 26, 2014

Hi, dat maintainer here. I made the website in a couple of hours (https://github.com/maxogden/dat-data.com/commits/gh-pages) but have spent about 9 months on the code so far (read through e.g. closed issues for an example of the scope there). I also have tried to convey that dat is still pre-alpha and that people shouldn't use it unless they wanna get involved w/ early development, but someone (not me) decided to post it to hacker news nonetheless.

prayag · on April 26, 2014

I am sorry but are we looking at the same website. I see two static pages and a logo. Compared to more than 2 months of work that has been put in the code base by what seems like a very talented team, the efforts they have made on the "PR" is minimal (a few hours, probably even less). I think you are being extremely unfair.

drewda · on April 26, 2014

DAT is funded by the Knight Foundation: http://www.knightfoundation.org/grants/201346305/

qwerta · on April 26, 2014

I have opposite problem. My project is almost production ready, but I have not even published performance benchmarks yet :-)

I would find helpful if author would outline how PR is done.

BenoitP · on April 27, 2014

It's helps test the market.

Why spend two months building a product if you don't know it will be well received? To me, this is startup development done right: reducing the unknown, reducing the risk in the most efficient way. Even if it means showing a mockup of your service.

microtonal · on April 26, 2014

For those interested in a solution that you can use now, iRODS may be interesting:

https://irods.org/

iRODS is a data storage system that provides replication (e.g. on local filesystems, S3, and HPSS. Deployments at different organizations can be linked, to make (part of) the namespace of one company be visible to that of another company. It has a permission system on namespaces.

But one of the nicest features is that it has a built-in C-like trigger language. You can specify triggers for certain namespaces. E.g., for research, it is often necessary to pointers to data (URLs) that are permanent and can be published in e.g. papers. For this reason EUDat[1] has triggers on namespaces that automatically create PIDs[2].

But since the trigger language is very complete, you can do nearly anything with it.

Note: I am not involved in iRODS anyway, I just know a couple of colleagues who use it for safe replication, sharing, etc.

[1] http://www.eudat.eu/ [2] http://www.pidconsortium.eu/

maxogden · on April 26, 2014

Hi, maintainer of dat here. We're working on the first release now. The first use case is tabular data (a large volume of small rows), but we'll be adding a content-addressable blob store (for big binary attachments) on top for the beta, as well as some fun p2p (webrtc data channel) powered sync stuff (will be somewhat similar to BitTorrent sync) to maximize download speeds and let universities seed big scientific datasets.

_prometheus · on April 26, 2014

if people are interested in the p2p blob store, poke me to write it faster. github.com/jbenet

flarg · on April 26, 2014

great work so far in an area I'm really interested. I'm especially interested in the synchronisation and transformation modules, how are they coming along? sync is an issue that distinguishes you from the work of the libre api project, but did you look at that project at all for the sharing aspect because I think they do that really well (but sync they do still with kill and load). also, did you look at all at git assistant for sync?

coherentpony · on April 26, 2014

I don't think this is the right approach. I'm fairly certain, after reading their website and git repo, that dat really does only deal with textual data. As someone that deals with huge datasets on a daily basis, text files are really out of the question. I have to rely on netCDF or HDF5 for my data stores.

I could be missing something here so anybody more knowledgeable should chime in.

platz · on April 26, 2014

It might not be optimal; however I wonder if most of the primary sources this is intended to target (i.e. govt) prefers to use textual data?

mistursinistur · on April 26, 2014

Nope. Any really large government data is not text -- it's mostly in older open binary formats (e.g. weather data in GRIB from NOAA) or expensive proprietary binary formats (e.g. healthcare data in SAS from CMS).

coherentpony · on April 26, 2014

>I wonder if most of the primary sources this is intended to target (i.e. govt) ...

The second sentence on their website appears to state otherwise: "As a team we have a bias towards supporting scientific + research data use cases."

Edit: Well, on further reading they do go on to say, "What we're building isn't quite ready for prime time yet, but if you want to play around...", so perhaps I should not jump the gun :)

derefr · on April 26, 2014

I've been pondering hacking on git to give it a pluggable object-storage API. That way, you could have one repo dump its objects into a Fossil file-system server, or an S3 bucket, or a DHT, and then, when another repo fetched from it, it'd just be pulling metadata (branch-heads and tags), and getting the actual data from the object-storage instead. I'm pretty sure this'd fix the "git can't store large files" problem, especially if you also allowed composite file objects, and perhaps per-object concurrent BitTorrent retrieval.

beyang · on April 26, 2014

See also: https://camlistore.org/ (unlike Dat, doesn't target datasets, but more generic user data, like photos and files; more Dropbox than npm). It has a pretty cool blob model, exposes a FUSE interface, has a slick photo-browsing UI, and integrates with S3 and (I think) Glacier.

sciurus · on April 26, 2014

For a recent look at camlistore see http://lwn.net/SubscriberLink/595746/2adfdecdb0de2577/

(and if you like that article, consider subscribing to LWN!)

rpedela · on April 26, 2014

Isn't that what git-annex is for?

https://git-annex.branchable.com/

derefr · on April 26, 2014

git-annex is an ugly hack to, basically, reimplement git-with-this-feature on top of git. What I'm suggesting would be to reimplement git itself on top of distributed object storage -- effectively, to split git into two tools:

1. one that commits treeish diffs into objects, and then stuffs them into arbitrary content-addressable storage;

2. and one that manipulates (local or remote) mutable tables of refs to content-hash-URNs -- or "git repositories" for short.

The key component would be a facility (either a library or an OS service) for resolving content-hash-URNs into file-descriptors in some content-addressable storage system. The critical point is that this would not be a git component -- the committing tool wouldn't be in charge of where its objects go, and the repository tool wouldn't be in charge of resolving their URNs. Instead, both would just be relying on the host to decide how to get objects from/put objects into storage, and it's the host that would have a distribution strategy set up for all of its content-addressable objects, not just the ones that happen to be attached to git.

platz · on April 26, 2014

Also some interesting related projects on https://git-annex.branchable.com/not/

espeed · on April 26, 2014

You can use JGit (https://github.com/eclipse/jgit) to store to S3.

See http://www.fancybeans.com/blog/2012/08/24/how-to-use-s3-as-a...

beagle3 · on April 26, 2014

Libgit2 does this internally, although it doesn't expose the underlying store (so you wouldn't know to use S3 even if you could get objects from there)

michaelmior · on April 26, 2014

You would end up with pretty huge latency for a lot of common operations in that case. Every checkout, diff, and status for example.

derefr · on April 26, 2014

That depends on how replicated/distributed the object storage is. In a DHT scenario, you might already be storing some of the objects yourself before you've even requested them.

The remote server can obviously be running an object-storage node itself, and so can your own local git repo -- the point is that you can have objects in more places than just your machine and the repo it's pulling from.

zobzu · on April 26, 2014

yeah it sounds like svn all over again

TillE · on April 26, 2014

You could probably modify Mercurial's largefiles extension to use S3, etc as a central store.

addisaden · on April 26, 2014

just plain git with also large datafiles triggers my mind a lot. not databases ... just all kinds of data. (like large pdfs) ..

kcorbitt · on April 26, 2014

Dat is really exciting. A common format for sharing and hosting datasets, to say nothing of its versioning and transforming of data, is really necessary right now. I've been working on an open-source site that will act kind of like a Github for Dat[0][1]. I started four months ago, but decided that dat was still too unstable at the time to really build an API client against, so I'm waiting for at least an alpha release before going further. Super excited about the project though.

Currently, the site is just a bare-bones version that lets you submit the URL of a publicly available dataset, rather than including any actual dat integration. I'm planning on changing that though once dat becomes just a bit more stable.

[0] source: https://github.com/kcorbitt/datrepo [1] current site: https://datrepo.com/about

_prometheus · on April 26, 2014

Hey kcorbitt, cool stuff. I'm working with Max on dat. Before I knew about dat, I built http://datadex.io (which is like an npm for datasets -- example: http://datadex.io/jbenet/cifar-100 ). datadex will support dat data first class. Would be good to get your help when all this happens. :)

mtrn · on April 26, 2014

I like the Datadex idea and the promise to make datasets as easily accessible as source code repositories. I can see a number of use cases already for a standardized and fast way to collaborate on structured data sets.

sciurus · on April 26, 2014

I can't find any information on the tools capabilities or implementation on that website. The README on github isn't helpful either.

coherentpony · on April 26, 2014

Yeah, they're hidden away in their git repo [1]. They have a good set of example usage commands [2], and a nice introductory document explaining what dat is [3]. Perhaps most important are the technical notes [4] that describe the supported formats.

[1]: https://github.com/maxogden/dat

[2]: https://github.com/maxogden/dat/blob/master/usage.md

[3]: https://github.com/maxogden/dat/blob/master/what-is-dat.md

[4]: https://github.com/maxogden/dat/blob/master/notes.md

ma2rten · on April 26, 2014

There is another readme that is more informative:

https://github.com/maxogden/dat/blob/master/what-is-dat.md

t1m · on April 26, 2014

Git is pretty good at handling large files right now.

We use git to push configuration data to (near) real-time systems that keep this data in local memory mapped key-value stores. It's essentially a fully replicated, eventually consistent key-value store, and git plays a staring role.

We regularly push multi-gigabyte files through this system with ease, with only a few tweaks on git's configuration. Git has some major advantages. It's fast, can handle large files, and you can piggyback on the versioning system to ensure that multiple writers aren't competing. Also, git is highly configurable and has a lot of out-of-the-box options for easily setting up git servers.

handelaar · on April 26, 2014

Git, in common with every other version control system I ever used, considers adding a column to a CSV file to be "every line was altered" because Lines are all-important.

skybrian · on April 26, 2014

How do you deal with old data, like multigigabye files deleted last year that everyone has to download because it's git?

sheetjs · on April 26, 2014

Is there any sort of Excel integration? The usage documents reference dat-excel, but I don't see any code

maxogden · on April 26, 2014

not yet, but keep an eye on https://github.com/jbenet/transformer which will provide the data format conversion API. It would be really cool to see some sheetjs stuff hooked up to dat/transformer!

alexdean · on April 26, 2014

What data sources are already available in dat format (aka which data producers are publically trialling dat)?

niix · on April 26, 2014

Max is a great developer, I hope all the best for him.