I am surprised at how far along the PR effort is compared to the actual product development. The code seems to be still in its embryonic stage, and most of the features are simply desiderata at this point; yet there is already a catchy name, a snazzy logo, a nice website, and even stickers! And this Hacker News post, of course.
Now, it is possible that this happened by chance: for instance, perhaps the author has a friend who is a graphic designer, and this friend got really excited about the project and put together all of this as a favor. But perhaps it is a deliberate strategy, in which case I wonder: what are the benefits of front-loading the PR work like this? Did it help in securing grants, for instance?
Substack (yes, that substack) discussed this recently:
> a slick web page - this is very often (but not always) a sign of a library that put all of its time into slick marketing but has overly-broad scope and is very likely to become abandoned
Hi, dat maintainer here. I made the website in a couple of hours (https://github.com/maxogden/dat-data.com/commits/gh-pages) but have spent about 9 months on the code so far (read through e.g. closed issues for an example of the scope there). I also have tried to convey that dat is still pre-alpha and that people shouldn't use it unless they wanna get involved w/ early development, but someone (not me) decided to post it to hacker news nonetheless.
I am sorry but are we looking at the same website. I see two static pages and a logo. Compared to more than 2 months of work that has been put in the code base by what seems like a very talented team, the efforts they have made on the "PR" is minimal (a few hours, probably even less). I think you are being extremely unfair.
Why spend two months building a product if you don't know it will be well received? To me, this is startup development done right: reducing the unknown, reducing the risk in the most efficient way. Even if it means showing a mockup of your service.
iRODS is a data storage system that provides replication (e.g. on local filesystems, S3, and HPSS. Deployments at different organizations can be linked, to make (part of) the namespace of one company be visible to that of another company. It has a permission system on namespaces.
But one of the nicest features is that it has a built-in C-like trigger language. You can specify triggers for certain namespaces. E.g., for research, it is often necessary to pointers to data (URLs) that are permanent and can be published in e.g. papers. For this reason EUDat[1] has triggers on namespaces that automatically create PIDs[2].
But since the trigger language is very complete, you can do nearly anything with it.
Note: I am not involved in iRODS anyway, I just know a couple of colleagues who use it for safe replication, sharing, etc.
Hi, maintainer of dat here. We're working on the first release now. The first use case is tabular data (a large volume of small rows), but we'll be adding a content-addressable blob store (for big binary attachments) on top for the beta, as well as some fun p2p (webrtc data channel) powered sync stuff (will be somewhat similar to BitTorrent sync) to maximize download speeds and let universities seed big scientific datasets.
great work so far in an area I'm really interested. I'm especially interested in the synchronisation and transformation modules, how are they coming along? sync is an issue that distinguishes you from the work of the libre api project, but did you look at that project at all for the sharing aspect because I think they do that really well (but sync they do still with kill and load). also, did you look at all at git assistant for sync?
I don't think this is the right approach. I'm fairly certain, after reading their website and git repo, that dat really does only deal with textual data. As someone that deals with huge datasets on a daily basis, text files are really out of the question. I have to rely on netCDF or HDF5 for my data stores.
I could be missing something here so anybody more knowledgeable should chime in.
Nope. Any really large government data is not text -- it's mostly in older open binary formats (e.g. weather data in GRIB from NOAA) or expensive proprietary binary formats (e.g. healthcare data in SAS from CMS).
>I wonder if most of the primary sources this is intended to target (i.e. govt) ...
The second sentence on their website appears to state otherwise: "As a team we have a bias towards supporting scientific + research data use cases."
Edit: Well, on further reading they do go on to say, "What we're building isn't quite ready for prime time yet, but if you want to play around...", so perhaps I should not jump the gun :)
I've been pondering hacking on git to give it a pluggable object-storage API. That way, you could have one repo dump its objects into a Fossil file-system server, or an S3 bucket, or a DHT, and then, when another repo fetched from it, it'd just be pulling metadata (branch-heads and tags), and getting the actual data from the object-storage instead. I'm pretty sure this'd fix the "git can't store large files" problem, especially if you also allowed composite file objects, and perhaps per-object concurrent BitTorrent retrieval.
See also: https://camlistore.org/ (unlike Dat, doesn't target datasets, but more generic user data, like photos and files; more Dropbox than npm). It has a pretty cool blob model, exposes a FUSE interface, has a slick photo-browsing UI, and integrates with S3 and (I think) Glacier.
git-annex is an ugly hack to, basically, reimplement git-with-this-feature on top of git. What I'm suggesting would be to reimplement git itself on top of distributed object storage -- effectively, to split git into two tools:
1. one that commits treeish diffs into objects, and then stuffs them into arbitrary content-addressable storage;
2. and one that manipulates (local or remote) mutable tables of refs to content-hash-URNs -- or "git repositories" for short.
The key component would be a facility (either a library or an OS service) for resolving content-hash-URNs into file-descriptors in some content-addressable storage system. The critical point is that this would not be a git component -- the committing tool wouldn't be in charge of where its objects go, and the repository tool wouldn't be in charge of resolving their URNs. Instead, both would just be relying on the host to decide how to get objects from/put objects into storage, and it's the host that would have a distribution strategy set up for all of its content-addressable objects, not just the ones that happen to be attached to git.
Libgit2 does this internally, although it doesn't expose the underlying store (so you wouldn't know to use S3 even if you could get objects from there)
That depends on how replicated/distributed the object storage is. In a DHT scenario, you might already be storing some of the objects yourself before you've even requested them.
The remote server can obviously be running an object-storage node itself, and so can your own local git repo -- the point is that you can have objects in more places than just your machine and the repo it's pulling from.
Dat is really exciting. A common format for sharing and hosting datasets, to say nothing of its versioning and transforming of data, is really necessary right now. I've been working on an open-source site that will act kind of like a Github for Dat[0][1]. I started four months ago, but decided that dat was still too unstable at the time to really build an API client against, so I'm waiting for at least an alpha release before going further. Super excited about the project though.
Currently, the site is just a bare-bones version that lets you submit the URL of a publicly available dataset, rather than including any actual dat integration. I'm planning on changing that though once dat becomes just a bit more stable.
Hey kcorbitt, cool stuff. I'm working with Max on dat. Before I knew about dat, I built http://datadex.io (which is like an npm for datasets -- example: http://datadex.io/jbenet/cifar-100 ). datadex will support dat data first class. Would be good to get your help when all this happens. :)
I like the Datadex idea and the promise to make datasets as easily accessible as source code repositories. I can see a number of use cases already for a standardized and fast way to collaborate on structured data sets.
Yeah, they're hidden away in their git repo [1]. They have a good set of example usage commands [2], and a nice introductory document explaining what dat is [3]. Perhaps most important are the technical notes [4] that describe the supported formats.
Git is pretty good at handling large files right now.
We use git to push configuration data to (near) real-time systems that keep this data in local memory mapped key-value stores. It's essentially a fully replicated, eventually consistent key-value store, and git plays a staring role.
We regularly push multi-gigabyte files through this system with ease, with only a few tweaks on git's configuration. Git has some major advantages. It's fast, can handle large files, and you can piggyback on the versioning system to ensure that multiple writers aren't competing. Also, git is highly configurable and has a lot of out-of-the-box options for easily setting up git servers.
Git, in common with every other version control system I ever used, considers adding a column to a CSV file to be "every line was altered" because Lines are all-important.
not yet, but keep an eye on https://github.com/jbenet/transformer which will provide the data format conversion API. It would be really cool to see some sheetjs stuff hooked up to dat/transformer!
Now, it is possible that this happened by chance: for instance, perhaps the author has a friend who is a graphic designer, and this friend got really excited about the project and put together all of this as a favor. But perhaps it is a deliberate strategy, in which case I wonder: what are the benefits of front-loading the PR work like this? Did it help in securing grants, for instance?