I have been thinking about what this solves in respect to other datasets. Nearly all shape recognition datasets have a restriction that you can't use unless you are an academic. I feel like that Open Sourcing data sets will allow us to be more democratic with data and the things that are generated by them. Creative Commons seems like a good license for this though. Once you have the data is half the battle . The rest is to make open models (google is good at this) and then you could take pretrained models and not have your data leave your house . I hope and dream we can do this.
Also, many open data projects choose to put their stuff into the public domain (or license it as "CC0" which means exactly the same). This is, for example, what Wikidata does:
I agree that CC licenses aren't a good fit for data collections. I have been meaning to create a new license but I don't have the expertise or credibility to reasonably market it. ODbL seems really close to what I was envisioning. Thanks!
Nearly all shape recognition datasets have a restriction that you can't use unless you are an academic. I feel like that Open Sourcing data sets will allow us to be more democratic with data
Just because something is open source doesn't mean it can't have an academia-only restriction. Data sets should cost money in for-profit uses, open or not.
“Open data and content can be freely used, modified, and shared by anyone for any purpose” http://opendefinition.org/
"The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research." https://opensource.org/osd
It looks like RStudio is AGPL v3, which is a free and open source license. Enterprise users only need a different license if they don't want to abide by the AGPL's strong copyleft requirements.
I agree that people who use a dataset for profit should be required for a for profit use. I didn't mean to imply that academics should give their data away for free. Maybe there needs to be a way for academics to sell their data or any other tangible benefit.
Open data is a concept separate from free and open source software. However, they have similar concerns such as copyleft: GPL for source code and ODbL for data both require changes to be published.
Most of the consumers so far have been neuroscience researchers and statisticians, but we do hope (and think) that there's value for a wide variety of interests.
There's a bunch of different data, but the highlights are fMRI scans of people watching and/or listening to the movie Forrest Gump, eye tracking, and detailed annotations of the movie. We are also about to begin acquiring simultaneous EEG and fMRI.
Forgive my ignorance, but I'm not sure for what a dataset like the "Collectible Card Game to Code"[0] might be used. Can anyone explain how and for what it might be used?
One question that is not clear to me is what should the dataset license to allow / restrict, in the perfect world. For me (just a personal opinion) it would allow free (as in liberty) use, but somehow encourage those who use it to share the benefits (data, software or algorithms) under the same license.
Unfortunately, Open Source does not help here -- I do not see how OS can be used with data sets. The main OS leverage with software development is that if you use software X to build software Y, X is usually present in some way, shape or form in your deliverable Y. Not so with training data -- once algorithm development is done you can (and usually do) strip training data out and have a finished product that does not require X to run.
Even if one were to require open sourcing derived datasets it is usually easy to segregate the dataset with a tainted (open source) license as you build up your data so the new datasets are not formally "derived" and thus would not need open sourcing.
I would love a better way forward on this, or at least a cleaner explanation of options.
OS helps tremendously in reproducibility. Without the underlying data, there is no way to audit an analysis. Moreover, an algorithm is only ever "done" in the same way that software is ever "done". New techniques might come along that could enhance the model, or the business requirements might change that necessitate re-tuning the algorithm.
The benefits of OS data are the same as the benefits of OS software. The distinction between "Free" and "Open" is the same as well.
Edit 1: OS data sets are nothing new. The UCI Machine Learning Repository[1] has been around for years. There is also an entire Open Data Stack Exchange site [2], and an Open Data Subreddit [3].
Edit 2: OS data sets are essential for developing new algorithms because they can be used as benchmarks. Nobody should trust a model that's been developed on a proprietary data set for use on anything other than that one data set.
Maybe I was not clear -- I am not arguing for proprietary datasets. Validation on publicly available data is a key component in comparisons, assessments, etc.
However, there is a whole bestiary of open source licenses that span the spectrum of "use any way you want" to much more restrictive. But they were mostly thought through for software and data is different; what may prevent proprietary abuse in software may not have any teeth for data.
This brings up a huge point about how important data sets are to analysis and machine learning. There are so many libraries out there that make learning algorithms quick to run, and the absolute most important part of a project of that type is correct and formatted data.