Open Source Datasets

zitterbewegung · on June 16, 2017

I have been thinking about what this solves in respect to other datasets. Nearly all shape recognition datasets have a restriction that you can't use unless you are an academic. I feel like that Open Sourcing data sets will allow us to be more democratic with data and the things that are generated by them. Creative Commons seems like a good license for this though. Once you have the data is half the battle . The rest is to make open models (google is good at this) and then you could take pretrained models and not have your data leave your house . I hope and dream we can do this.

vog · on June 16, 2017

> Creative Commons seems like a good license for this though

I don't think that CC licenses are a good fit for data collections in general. Let's see what other open data projects do:

Some open data projects were not satisfied with CC licenses, which is why ODbL was created:

https://opendatacommons.org/licenses/odbl/

Also, many open data projects choose to put their stuff into the public domain (or license it as "CC0" which means exactly the same). This is, for example, what Wikidata does:

https://www.wikidata.org/wiki/Wikidata:Main_Page

(Wikidata the structured sister project of Wikipedia.)

zitterbewegung · on June 16, 2017

I agree that CC licenses aren't a good fit for data collections. I have been meaning to create a new license but I don't have the expertise or credibility to reasonably market it. ODbL seems really close to what I was envisioning. Thanks!

nerdponx · on June 16, 2017

Nearly all shape recognition datasets have a restriction that you can't use unless you are an academic. I feel like that Open Sourcing data sets will allow us to be more democratic with data

Just because something is open source doesn't mean it can't have an academia-only restriction. Data sets should cost money in for-profit uses, open or not.

tuukkah · on June 16, 2017

“Open data and content can be freely used, modified, and shared by anyone for any purpose” http://opendefinition.org/

"The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research." https://opensource.org/osd

nerdponx · on June 17, 2017

Interesting. So is a nominally open-source program like RStudio not actually open-source because enterprise uses need a license?

SloopJon · on June 18, 2017

It looks like RStudio is AGPL v3, which is a free and open source license. Enterprise users only need a different license if they don't want to abide by the AGPL's strong copyleft requirements.

zitterbewegung · on June 16, 2017

I agree that people who use a dataset for profit should be required for a for profit use. I didn't mean to imply that academics should give their data away for free. Maybe there needs to be a way for academics to sell their data or any other tangible benefit.

TeMPOraL · on June 16, 2017

Academia doesn't cover all non-profit uses though. What about hobbyists?

tuukkah · on June 16, 2017

Open data is a concept separate from free and open source software. However, they have similar concerns such as copyleft: GPL for source code and ODbL for data both require changes to be published.

Outside of copyleft, Creative Commons 4.0 licenses (CC-BY 4.0, CC-BY-SA 4.0, CC0) from 2013 are good licenses for open data: https://theodi.org/blog/cc-40-and-open-data

Kpourdeilami · on June 15, 2017

Somewhat unrelated: Deepmind's website is so cluttered and distracting to the extent that it is almost unusable

im_anon_on_hn · on June 15, 2017

I came to the comments to say exactly that... What is up with that weird floating hamburger search bar?

saagarjha · on June 16, 2017

Especially since it covers a ton of content on the left side…

arunmib · on June 16, 2017

Glad to know it is not just me.

Abtin88 · on June 16, 2017

reminds me of the motherfucking website! http://motherfuckingwebsite.com/

guidopallemans · on June 17, 2017

http://bettermotherfuckingwebsite.com/

I actually use their css when formatting my .md files

vshan · on June 16, 2017

For all the talk on how clients and managers tend to bikeshed...

visarga · on June 16, 2017

Backwards navigation puts me back at the top in the previous page. It's broken UX.

Alexqw85 · on June 16, 2017

The lab I work in publishes, and has continues to extend, the studyforrest dataset for quite a few years now.

http://studyforrest.org

Most of the consumers so far have been neuroscience researchers and statisticians, but we do hope (and think) that there's value for a wide variety of interests.

There's a bunch of different data, but the highlights are fMRI scans of people watching and/or listening to the movie Forrest Gump, eye tracking, and detailed annotations of the movie. We are also about to begin acquiring simultaneous EEG and fMRI.

http://studyforrest.org/data.html

Accessing the data is easy, and, as great admirers of Joey Hess, we also have it available in a git annex repo. :-)

http://studyforrest.org/access.html

---Alex

[EDIT] Given that this thread is about open source datasets, it's probably worth mentioning that the license is PDDL.

https://opendatacommons.org/licenses/pddl/1.0/

iandev · on June 16, 2017

Forgive my ignorance, but I'm not sure for what a dataset like the "Collectible Card Game to Code"[0] might be used. Can anyone explain how and for what it might be used?

[0] https://github.com/deepmind/card2code

Houshalter · on June 16, 2017

They use it to train an AI to program. It reads the descriptions of the cards' effects and produces computer code that generates that behavior.

ptero · on June 16, 2017

One question that is not clear to me is what should the dataset license to allow / restrict, in the perfect world. For me (just a personal opinion) it would allow free (as in liberty) use, but somehow encourage those who use it to share the benefits (data, software or algorithms) under the same license.

Unfortunately, Open Source does not help here -- I do not see how OS can be used with data sets. The main OS leverage with software development is that if you use software X to build software Y, X is usually present in some way, shape or form in your deliverable Y. Not so with training data -- once algorithm development is done you can (and usually do) strip training data out and have a finished product that does not require X to run.

Even if one were to require open sourcing derived datasets it is usually easy to segregate the dataset with a tainted (open source) license as you build up your data so the new datasets are not formally "derived" and thus would not need open sourcing.

I would love a better way forward on this, or at least a cleaner explanation of options.

nerdponx · on June 16, 2017

OS helps tremendously in reproducibility. Without the underlying data, there is no way to audit an analysis. Moreover, an algorithm is only ever "done" in the same way that software is ever "done". New techniques might come along that could enhance the model, or the business requirements might change that necessitate re-tuning the algorithm.

The benefits of OS data are the same as the benefits of OS software. The distinction between "Free" and "Open" is the same as well.

Edit 1: OS data sets are nothing new. The UCI Machine Learning Repository[1] has been around for years. There is also an entire Open Data Stack Exchange site [2], and an Open Data Subreddit [3].

Edit 2: OS data sets are essential for developing new algorithms because they can be used as benchmarks. Nobody should trust a model that's been developed on a proprietary data set for use on anything other than that one data set.

[1]: https://archive.ics.uci.edu/ml/

[2]: https://opendata.stackexchange.com/

[3]: https://reddit.com/r/opendata/

ptero · on June 16, 2017

Maybe I was not clear -- I am not arguing for proprietary datasets. Validation on publicly available data is a key component in comparisons, assessments, etc.

However, there is a whole bestiary of open source licenses that span the spectrum of "use any way you want" to much more restrictive. But they were mostly thought through for software and data is different; what may prevent proprietary abuse in software may not have any teeth for data.

jackschultz · on June 16, 2017

This brings up a huge point about how important data sets are to analysis and machine learning. There are so many libraries out there that make learning algorithms quick to run, and the absolute most important part of a project of that type is correct and formatted data.

deepnet · on June 16, 2017

Dear Deepmind, as you have retired AlphaGo please open source the dataset of Go games used to train it.

caniszczyk · on June 16, 2017

check out https://data.world who is doing a decent job in organizing a variety of data sets out there

blazespin · on June 16, 2017

Cool, when deep mind originally joined google it was on the condition that google would be moral about its use of AI.