Yeah, I know about how current automatic MT benchmarks don't reflect "user satisfaction" very accurately, and that it's an open problem to get one that is serviceable. However you make it sound like all deep learning solutions to tasks perform well on the task benchmark but poorly on the real-world task the benchmarks try to approximate, whereas that's not true for MT - they are bad at the benchmarks, but outperform non-deep-learning-based translation approaches at the real-world-task.
On the subject of benchmarks, how about speech transcription? I was under the impression that those benchmarks are pretty reliable indicators of "real-world accuracy" (or about as reliable as benchmarks are in general)
> How do you mean that machine translation outperforms non deep learning based approaches at the real world task? How is this evaluated?
There were two ways that performance can be evaluated that I had in mind:
1. Commercial success - what approach does popular sites like Google Translate use?
2. Human evaluation - in a research setting, ask humans to score translations - this is mentioned in your link
OK, thanks for clarifying. The problem with such metrics is that they're not objective results. It doesn't really help us learn much to say that a system outperforms another based on subjective evaluations like that. You might as well try to figure out which is the best team in football by asking the fans of Arsenal and Manchester United.
The subjective human evaluations used in research are blinded - the humans rate the accuracy of translation without knowing what produced the translation (whether a NMT system, a non-ML MT system, or a human translator), whereas the football fans in your scenario are most definitely not blinded. There are some criticisms you could make about human evaluation, but as far as how well they correspond to the real-world task, I think they're pretty much the best we can do. I'm very curious to know if you actually think they're a bad target to optimize for.
More to the point, you still have yet to show that NMT "serves no other purpose than to show how well modern techniques can model large datasets", given that they do well on human evaluations and they're actually producing value by serving actual production traffic (you know, things humans actually want to translate) in Google Translate. If serving production traffic like this is not "serving a purpose", what is?
Regarding whether human evaluations are a good target to optimise for, no, I
certainly don't think so. That's not very different than calculating BLEU
scores, except that instead of comparing a machine generated translation with
one reference text, it's compared with peoples' subjective criteria, which
actually serves to muddle the waters even more -because who knows why
different people thought the same translation was good or bad? Are they all
using the same criteria? Doubtful! But if they're not, then what have we
learned? That a bunch of humans agreed that some translation was good, or bad,
each for their own reasons. So what? It doesn't make any difference when the
human evaluators are blinded, you can make the same experiment with human
translations only and you'll still have not learned anything about the quality
of the translation- just the subjective opinions of a particular group of
humans about it.
See, the problem is not just with machine translation. Evaluating human
translaion results is also very hard to do because translation itself is a
very poorly characterised task. The question "what is a good translation is
very difficult to answer. We don't have, say, a science of translation to tell
us how a text should be translated between two languages. So in machine
translation people try to approximate not only the task of translation, but
also its evaluation- without understanding either. That's a very bad spot to
be in.
In fact, a "science of translation" would be a useful goal for AI research,
but it's the kind of thing that I complain is not done anymore, having been
replaced with beating meaningless benchmarks.
Regarding the fact that neural machine translation "generates value", you mean
that it's useful because it's deployed in production and people use it? Well,
"even a broken clock is right twice a day" so that's really not a good
criterion of quality at all. In fact, as a criterion for an AI approach it's
very disappointing. Look at the promise of AI: "machines that think like
humans!". Look at the reality: "We're generating value!". OK. Or we could make
an app that adds bunny ears and cat noses to peoples' selfies (another
application of AI). People buy the app- so it's "generating value". Or we can
generate value by selling canned sardines. Or selling footballs. Or selling
foot massages. Or in a myriad other ways. So why do we need AI? It's just
another useless trinket that is sold and bought while remaining completely
useless. And that for me is a big shame.
> Look at the promise of AI: "machines that think like humans!". Look at the reality: "We're generating value!". OK.
OK, I hadn't realised we had such different implicit views on what the "goal" of AI / AI research was. Of course, I agree that the goal of "having machines think like humans" is a valid goal, and "generates value by serving production traffic" is not a good subgoal for that. However, this is not the only goal of AI research, nor is it clear to me that for e.g. public funding bodies see it as the only goal.
I use MT (at least) every week for my job and for my hobbies, mostly translating stuff I want to read in another language. I love learning languages but I could not learn all the languages I need to a high enough level to read the stuff I want to read. The old non-NMT approaches produced translations that were often useless, whereas the NMT-based translations I use now (mostly deepl.com), while not perfect, are often quite good, and definitely enough for my needs. Without NMT, realistically speaking, there is no alternative for me (i.e., I can't afford to pay a human translator, and I can't afford to wait until I'd learned the language well enough). So how can you say that AI "remains completely useless"?
Basically, you have implicitly assumed that "make machines that think like humans" is the only valid goal of AI research. And, from that point of view, it is understandable that evaluating NMT systems for how well they approach that goal using human evaluations, has many downsides. However, while some people working on NMT do have that goal, many of them also have the goal of "help people (like zodiac) translate stuff", and in the context of that goal, human evaluation is a much better benchmark target.
In general, yes, that's it. But to be honest I'm actually not that interested in making "machines that think like humans". I say that's the "promise of AI" because it was certainly the goal at the beginning of the field, specifically at the Dartmouth workshop where John McCarthy coined the term "Artificial Intelligence" [1]. Researchers in the field have varying degrees of interest in that lofty goal but the public certainly has great expectations, as seen everytime OpenAI releases a language model and people start writing or tweeting stuff about how AGI is right around the corner etc.
Personally, I came into AI (I'm a PhD research student) because I got (really, really) interested in logic programming languages and well, to be frank, there's no other place than in academia that I can work on them. On the other hand, my interest in logic programming is very much an interest in ways to make computers not be so infuriatingly dumb as they are right now.
This explains why I dislike neural machine translation and similar statistical NLP approaches: because while they can model the structure of language well, they do nothing for the meaning carried by those structures which they completely ignore by design. My favourite example is treating sentences as a "bag of words", as if order doesn't make a difference- and yet this is a popular technique... because it improves performance on benchmarks (by approximately 1.5 fired linguists).
The same goes with google translate. While I'd have to be more stubborn than I am to realise that people use it and like it, I find it depends on the use case -and on the willingness of users to accept its dumbness. For me, it's good where I don't need it and bad where I do. For example, it translates well enough between languages I know and can translate to some extent myself, say English and French. But if I want to translate between a language that is very far from the ones I know - say I want to translate from Hungarian to my native Greek- that's just not going to work, not least because the translation goes through English (because of a dearth of parallel texts and despite the fact that Google laughably claims its model actually has an automatically learned "interlingua") and the result is mangled twice and I get gibberish on the other end.
I could talk at length on why and how this happens, but the gist of it is that Google translate stubbornly refuses to use any information to decide what translation to choose, among many possible translations of an expression, by looking at the frequencies of token collocations- and nothing else. So for example, if I ask it to translate a single word, "χελιδόνι", meaning the bird swallow, from Greek to French, I get back "avaler" which is the word for the verb to swallow- because translation goes through English where "swallow" has two meanings and the verb happens to be more common than the bird. The information that "χελιδόνι" is a noun and "avaler" is a verb exists, but Google translate will just not use it. Why? Well, because the current trend in AI is to learn everything end-to-end from raw data and without prior knowledge. And that's because prior knowledge doesn't help to beat benchmarks, which are not designed to test world-knowledge in the first place. It's a vicious circle.
So, yes, to some extent it's what you say- I don't quite expect "machines that think like humans", but I do want machines that can interact with a human user in a slightly more intelligent manner than now. I gave the example of SHRDLU above because it was such a system. I'm sad the effort to reproduce such results, even in a very limited domain, has been abandoned.
P.S. Sorry, this got a bit too long, especially for the late stages of an HN conversation :)
I hear you about those language pairs that have to be round-tripped through a third language. I completely agree, too, that the big open questions in NLP are all about understanding meaning, semantic content, pragmatics, etc rather than just syntax.
I don't think that "NMT and similar techniques" ignore meaning by design though. What they do do by design, compared to expert systems etc, is avoid having explicitly encoded knowledge (of the kind SHRDLU had). Take word2vec for instance, it's not NMT but fits into the "statistical NLP" description - its purpose is to find encodings of words that carry some semantic content. Now, of course it's very little semantic content compared to what an expert could plausibly encode, but it is some semantic content, and this content improves the (subjective human) evaluation of NMT systems that do use word2vec or something similar.
Also, we should carefully distinguish "prior knowledge" as in "prior common-sense knowledge" and "prior linguistic knowledge". The end-to-end trend eschews "prior linguistic knowledge", while current NLP systems tend to lack "common-sense" knowledge, for rather different reasons.
End-to-end training tends to eschew prior linguistic knowledge because it improves (subjectively evaluated) performance in real-world tasks - I believe this is true for MT as well, but an easier example if you want to look into it is in audio transcription. I don't think there's a consensus about why this happens, but I think it is something like - the previously way people were encoding linguistic knowledge was too fragile / simplified (think about how complicated traditional linguistic grammars are), and if that information can somehow be learned in the end-to-end process, that performs better.
Lacking "common-sense" knowledge - that's more in the realm of AGI, so there's a valid debate about to what extent neural networks can learn such knowledge, but the other side of that debate is that expressing common-sense knowledge in today's formal systems is really hard and expensive, and AIUI this is also something that attempts to generalize SHRDLU run into. But it is definitely incorrect to say that it's ignored by anyone by design...
BTW, the biggest improvements (as subjectively evaluated by me) I've seen in MT on "dissimilar languages" have come from black box neural nets and throwing massive amounts of (monolingual or bilingual) data at it, rather than anything from formal systems. I use deepl.com for Japanese-English translation of some technical CS material, and that language pair used to be really horrible in the pre-deep-learning days (and it's still not that good on google translate for some reason).
I agree about word2vec and embeddings in general- they're meant to represent meaning or capture something of it anyway. I'm just not convinced that they work that well in that respect. Maybe I can say how king and queen are analogous to man and woman etc, but that doesn't help me if I don't know what king, queen, man or woman mean. I don't think it's possible to represent the meaning of words by looking at their collocation with other words- whose meaning is also supposedly represented by their collocation with other words etc.
I confess I haven't used any machine translation systems other than google translate. For instance, I've never used deepl.com. I'll give it a try since you recommend it although my use case would be to translate technical terms that I only know in English to my native Greek and I don't think anything can handle that use case very well at all. Not even humans!
Out of curiousity, you say neural machine translation is better than earlier techniques, which I think is not controversial. But, have you tried such earlier systems? I've never had the chance.
On the subject of benchmarks, how about speech transcription? I was under the impression that those benchmarks are pretty reliable indicators of "real-world accuracy" (or about as reliable as benchmarks are in general)