I don't understand how this can be considered a technical report. No information on model architecture, distributed training methodology, or optimizations. The "Training dataset" section is a pathetic 0.5 pages long.
In that sense, it's very similar to the GPT-4 Technical Report.
The era of being "open" about LLMs or other "secret sauce" models in published papers may be over, since these things have become existential threats to companies.
Btw I've arrived at a different interpretation of the "Open" in OpenAI. It's open in the sense that the generic LLM is exposed via an API, allowing companies to build anything they want on top.
Companies like Google have been working on language models (and AI more broadly) for years but have hid the generic intelligence of their models, exposing it only via improvements to their products. OpenAI bucked this trend and exposed an API to generic LLMs.
> Btw I've arrived at a different interpretation of the "Open" in OpenAI.
I don't understand why people have to keep trying to wrap their head around the word 'Open' in OpenAI. If you ever saw a commercial like a product has a 'great new taste' but then you tried it and it tasted bad, would you twist yourself into knots trying to understand how you went wrong in your interpretation of 'great'? No that's ridiculous. Same with 'Open' in 'OpenAI'. It's just some letters that form part of the name that they chose for themselves when they filled the form to incorporate their company.
You mean when they filled out a form to incorporate their non-profit. Which they later turned into a for-profit company after reaping all the goodwill. The “Open” used to mean something.
That is a bit reductionist. They turned it into a for-profit company controlled by a non-profit entity, with profits / returns being capped for employees / investors.
When they founded? Yes. The issue was that the big AI players (Google, Facebook, etc.) were keeping their models and training data secret. People (rightly, IMHO) saw this opaque development style as a risk. The OpenAI founders made a big splash by registering as a non-profit and declaring that they were going to do all their model training in public, and share the weights for everyone to use. In other words they were claiming to do something more like what Stability AI is today, except with a stronger legal non-profit organization.
Because of that framing, they poached a lot of very good talent and built one of the best AI teams that has ever been assembled. Then they perverted their corporate structure to be effective for-profit, and renegaded on open access to their trained models, turning into a bog standard service-oriented company.
Nonprofit status makes it much harder to extract large profits. A charity founder can pay himself a million-dollar salary, but he can't sell his shares in the nonprofit and become a billionaire.
> Nonprofit status makes it much harder to extract large profits. A charity founder can pay himself a million-dollar salary, but he can't sell his shares in the nonprofit and become a billionaire.
What difference does it make for a non-public company? They can pay themselves more salary either way. The shares aren't really valuable until then.
As to a charity - if you really believe so. It doesn't even enter the books. Have you not seen an in-person donation site? Someone gives $100, the staff keeps the $100, takes out $50, records $50 and puts that in the donation box. After a few more layers the actual donation could be just $1. I've seen these at your regular big name charities - all the time.
And let's not get started on the sponsor a child that doesn't exist options...
They are not confused. "OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. Since our research is free from financial obligations, we can better focus on a positive human impact."[1]
I think the strong connotation of the word "open" in the software community comes from "open source". If OSS was called "great new source" and a new closed source company called itself GreatNewAI you'd have a similar phenomenon of people taking apart the name.
That's true but also not relevant to current wide spread use of the term. Concepts and common understanding evolve with language and I'm not even sure what point you're trying to make by pointing this out. Your first link even includes the language:
> "open source" marketed as trumping "open system".
Common use and understanding of the use "open" evolved decades ago.
Your comment also tries to side step the issue at heart people are annoyed and frustrated by. The founding principles of the OpenAI foundation laid out exactly what that usage of "Open" meant for their organization and they have since backtracked on their own principles.
We're discussing the use of the word 'Open'. Which was first applied to systems. Then "source", which actually did in fact argue that openness of source was more important that system open-ness. As to system open-ness, that is well understood as open access to blackbox via open (non-proprietary) APIs. Which is precisely what "OpenAI" is providing.
> Your comment also tries to side step the issue ..
We disagree. Narrowly directed and addressing the "issue", in fact.
I don't agree it's scummy. Scummy is getting someone to build a business on a 1 Billion dollar donation, going for a hostile takeover 10% of the way there, then reneging when that doesn't work.
Salvaging your business from that sort of tantrum by working with MS is called surviving.
> Companies like Google have been working on language models (and AI more broadly) for years but have hid the generic intelligence of their models, exposing it only via improvements to their products. OpenAI bucked this trend and exposed an API to generic LLMs.
That's true. My thought was they're still 'open', in an important way, even though it's not the open source way. If they were smart they'd adopt my interpretation in their PR materials.
I wonder how special these architectures are compared to what's published.
The "secret sauce" may just be getting 2 pages (~200) worth of engineers collaborating and either rolling out your own cloud service or spending $$$ at someone else's.
Also not sure how much it matters other than academic interest of course. Realistically, there's only 4-5 (US) companies with the human resources and capital to roll something similar to these models out for what is most likely a complete write-off?
They could claim whatever they wanted and it would be near impossible to validate.
I think the secret sauce is just bucket loads of cash to spend on compute.
And because of this I don’t buy that AI is an existential threat to Google at this point. If they were really worried they could spend a tiny portion of their ~280 billion dollars in revenue to train a bigger model.
I assume this is just a PR/IR-driven project to stay the "Google is Dead" headlines hence the budget, especially considering an oversized chunk was spent on the scaling law, doesn't seem they were serious about building a GPT4-killer.
I wasn't aware autoregressive LLMs were still considered an existential threat to Google. What's the threat supposed to be, ChatGPT is just going to keep eating Google search market share burning Microsoft capital on infra a la the Uber model or do they make money off of that at some point?
Seems farfetched OpenAI can compete with Google's resources, vertical integration down to the TPU and access to significantly more training data.
I agree that if training data is what matters, it is likely that no one can compete with Google with Google Books, which scanned 25 million volumes (source: http://www.nytimes.com/2015/10/29/arts/international/google-...), which is approximately all the books.
DeepMind's RETRO paper https://arxiv.org/abs/2112.04426 mentions a dataset called MassiveText, which includes 20 million books of 3T tokens. So we know Google is using Google Books, since there is simply no other source of 20 million books. Also as far as I know 3T tokens is more than publicly known to be used by anyone so far: Google could train on more data than anyone else, solely from Google Books, even without using its web crawl.
Edit: it was 2005(!), so it is possible that many of you haven't heard of this. George Dyson, in Turing's Cathedral written in 2005 says:
> My visit to Google? Despite the whimsical furniture and other toys, I felt I was entering a 14th-century cathedral: not in the 14th century but in the 12th century, while it was being built. Everyone was busy carving one stone here and another stone there, with some invisible architect getting everything to fit. The mood was playful, yet there was a palpable reverence in the air. "We are not scanning all those books to be read by people," explained one of my hosts after my talk. "We are scanning them to be read by an AI."
> The era of being "open" about LLMs or other "secret sauce" models in published papers may be over, since these things have become existential threats
Yeah, this is a holdover from where LLMs grew out of: academia. "Technical report" is what you reach for when you don't want to compare to actual competitive baselines.
I'm sorry, this is nonsense. Technical reports exist to fill in information that is useful for readers but not necessary to understand the key contributions of the work, and/or that don't fit within the journal or conference's page limit. I'm not sure where you got the idea that it is something people do to avoid competitive baselines; IME, the peer-reviewed portion of the publication is far more likely to contain misleading benchmarks than the technical report, since the paper is trying to "sell" the work in a way the technical report is not.
What this is an instance of is Google's approach to academic publishing of releasing a paper that contains almost no actionable information, but which is considered important and publishable solely because it came from Google and therefore is used in industry. This has been exhibited many times before--e.g. see the original Spanner paper, which was so light on details and confusing that they needed to release a followup paper several years later to explain what the system was even using the atomic clocks for!
I agree that's what TR's are for. However, my point is, if you want to publish academic writing without peer review, a TR is a way to go about that. You can also just publish a preprint somewhere, which - surprise surprise - is also common for these same actors.
I get what you're saying, I just think this is more of a Google thing than a TR thing. Their peer reviewed papers have the same issue as their preprints, TRs, and whitepapers, generally speaking--Google researchers feel no incentive to actually share how they did things, perform accurate or up-to-date comparisons to comparable frameworks, or even bother outlining their key contributions, because they know the paper will be published, widely read, widely cited, and influential even if they don't do any of those things. It's to the point that I think it might actually be house policy to neuter their papers of specific details as much as possible, presumably to retain what they perceive as Google's competitive advantage, because it makes no sense otherwise that wildly different papers with different authorship groups coming from so many different areas of CS could all have these same problems.
This is (IMO) quite different from, e.g., the cases of academics publishing misleading benchmarks, which is more often just being wedded to a bad idea because you spent years of work on it and your position is at risk if you didn't end up outperforming existing approaches. Often I can still get a lot out of papers with misleading benchmarks, even if what I get is "don't try this technique, it doesn't work." Whereas I frequently get nothing at all out of Google publications. If I had to describe the way Google seems to view academic publishing in one word, it would be "marketing"--it's advertising for people to either come work at Google or use their products, not something written with the intent of advancing the wider state of the art, or even the less noble goal justifying the time and money they put into whatever they're writing about.
Come on, Google.