>> We knew hand-crafted program in limited domains can work for NLP, computer vi...

zodiac · on Aug 17, 2020

> It's no stretch to imagine that an entity of the size of Google or Facebook could achieve something considerably more useful, still in a limited domain. But this has never even been attempted.

How do you know if no one has attempted it, or if all attempts so far have failed to achieve their goals? One of the claimed downsides of "hand-crafted systems in limited domains" (i.e., something like the SHRDLU approach) is that they would take too much effort to create when the domain is expanded to something even slightly bigger than SHRDLU's domain, so a lack of successful systems could be evidence of no one trying, or it could be evidence that the claimed downside is indeed true.

The fact that a working system for the limited domains of e.g. customer support or medical diagnosis would be worth a lot of money suggests to me that they must have been tried, but that nothing useful could be built, and we didn't hear about the failed attempts, meaning that those domains (at least) are too big for hand-crafted systems to work.

> Yes, it is faffing about. Basically NLP gave up on figuring out how language works and switched to a massive attempt to model large datasets evaluated by contrived benchmarks that serve no other purpose than to show how well modern techniques can model large datasets.

It is inaccurate to say that all benchmarks are useless. For instance, language translation (as in Google Translate) is a benchmark NLP task, but it is also something I personally use at least every week, and deep learning based solutions beat handcrafted systems by a lot for this particular task (speaking as an end user who has used systems based on both approaches). The same comments apply to audio transcription (e.g. generating subtitles on YouTube) as well.

YeGoblynQueenne · on Aug 17, 2020

>> The fact that a working system for the limited domains of e.g. customer support or medical diagnosis would be worth a lot of money suggests to me that they must have been tried, but that nothing useful could be built, and we didn't hear about the failed attempts, meaning that those domains (at least) are too big for hand-crafted systems to work.

That's actually a good point and one I've made myself a few times in the past- negative results never see the light of day. So, yes, I agree we can't know for sure that such systems haven't been attempted, despite the certainty with which I assert this in my comment above.

On the other hand, there are some strong hints. First of all, there was an AI winter in the 1980's after which most of the field turned very hard towards statistical techniques and away from the kind of work in SHRDLU. This kind of work became radioactive for many years and it would have been very hard to justify putting a PhD student, or ten, to work trying to even reproduce it, let alone extend it. That's in academia. In the industry, it's clear that nowadays at least companies like the FANGs strongly champion statistical machine learning and anyone proposing spending actual money on such a research program ("Hey, let's go back to the 1960's and start all over again!") would be laughed out of the building. That is, I believe there are strong hints that the political climate in academia and the culture in large companies, has suppressed any attempt to do work of this kind. But that's only my conjecture, so there you have it.

>> It is inaccurate to say that all benchmarks are useless.

Of course. My point is that current benchmarks are useless.

The fact that you find Google translate useful is unrelated to how well Google translate scores in benchmarks, which are not designed to measure user satisfaction but instead are supposed to tell us something about the formal properties of the system. In any case, for translation in particular, it's not controversial that there are no good benchmarks and metrics and many people in NLP will tell you that is the case. In fact, I'm saying this myself because I was told during my Master's, by our NLP tutor who is a researcher in the field. Also see the following article which indcludes a discussion of commonly used metrics in machine translation and the difficulty of evaluating machine translation systems:

https://www.skynettoday.com/editorials/state_of_nmt

zodiac · on Aug 17, 2020

Yeah, I know about how current automatic MT benchmarks don't reflect "user satisfaction" very accurately, and that it's an open problem to get one that is serviceable. However you make it sound like all deep learning solutions to tasks perform well on the task benchmark but poorly on the real-world task the benchmarks try to approximate, whereas that's not true for MT - they are bad at the benchmarks, but outperform non-deep-learning-based translation approaches at the real-world-task.

On the subject of benchmarks, how about speech transcription? I was under the impression that those benchmarks are pretty reliable indicators of "real-world accuracy" (or about as reliable as benchmarks are in general)

YeGoblynQueenne · on Aug 17, 2020

I don't know much about speech transcription, sorry.

How do you mean that machine translation outperforms non deep learning based approaches at the real world task? How is this evaluated?

zodiac · on Aug 17, 2020

> How do you mean that machine translation outperforms non deep learning based approaches at the real world task? How is this evaluated?

There were two ways that performance can be evaluated that I had in mind:

1. Commercial success - what approach does popular sites like Google Translate use? 2. Human evaluation - in a research setting, ask humans to score translations - this is mentioned in your link

YeGoblynQueenne · on Aug 17, 2020

OK, thanks for clarifying. The problem with such metrics is that they're not objective results. It doesn't really help us learn much to say that a system outperforms another based on subjective evaluations like that. You might as well try to figure out which is the best team in football by asking the fans of Arsenal and Manchester United.

zodiac · on Aug 18, 2020

The subjective human evaluations used in research are blinded - the humans rate the accuracy of translation without knowing what produced the translation (whether a NMT system, a non-ML MT system, or a human translator), whereas the football fans in your scenario are most definitely not blinded. There are some criticisms you could make about human evaluation, but as far as how well they correspond to the real-world task, I think they're pretty much the best we can do. I'm very curious to know if you actually think they're a bad target to optimize for.

More to the point, you still have yet to show that NMT "serves no other purpose than to show how well modern techniques can model large datasets", given that they do well on human evaluations and they're actually producing value by serving actual production traffic (you know, things humans actually want to translate) in Google Translate. If serving production traffic like this is not "serving a purpose", what is?

YeGoblynQueenne · on Aug 19, 2020

Sorry for the late reply.

Regarding whether human evaluations are a good target to optimise for, no, I certainly don't think so. That's not very different than calculating BLEU scores, except that instead of comparing a machine generated translation with one reference text, it's compared with peoples' subjective criteria, which actually serves to muddle the waters even more -because who knows why different people thought the same translation was good or bad? Are they all using the same criteria? Doubtful! But if they're not, then what have we learned? That a bunch of humans agreed that some translation was good, or bad, each for their own reasons. So what? It doesn't make any difference when the human evaluators are blinded, you can make the same experiment with human translations only and you'll still have not learned anything about the quality of the translation- just the subjective opinions of a particular group of humans about it.

See, the problem is not just with machine translation. Evaluating human translaion results is also very hard to do because translation itself is a very poorly characterised task. The question "what is a good translation is very difficult to answer. We don't have, say, a science of translation to tell us how a text should be translated between two languages. So in machine translation people try to approximate not only the task of translation, but also its evaluation- without understanding either. That's a very bad spot to be in.

In fact, a "science of translation" would be a useful goal for AI research, but it's the kind of thing that I complain is not done anymore, having been replaced with beating meaningless benchmarks.

Regarding the fact that neural machine translation "generates value", you mean that it's useful because it's deployed in production and people use it? Well, "even a broken clock is right twice a day" so that's really not a good criterion of quality at all. In fact, as a criterion for an AI approach it's very disappointing. Look at the promise of AI: "machines that think like humans!". Look at the reality: "We're generating value!". OK. Or we could make an app that adds bunny ears and cat noses to peoples' selfies (another application of AI). People buy the app- so it's "generating value". Or we can generate value by selling canned sardines. Or selling footballs. Or selling foot massages. Or in a myriad other ways. So why do we need AI? It's just another useless trinket that is sold and bought while remaining completely useless. And that for me is a big shame.

zodiac · on Aug 20, 2020

> Look at the promise of AI: "machines that think like humans!". Look at the reality: "We're generating value!". OK.

OK, I hadn't realised we had such different implicit views on what the "goal" of AI / AI research was. Of course, I agree that the goal of "having machines think like humans" is a valid goal, and "generates value by serving production traffic" is not a good subgoal for that. However, this is not the only goal of AI research, nor is it clear to me that for e.g. public funding bodies see it as the only goal.

I use MT (at least) every week for my job and for my hobbies, mostly translating stuff I want to read in another language. I love learning languages but I could not learn all the languages I need to a high enough level to read the stuff I want to read. The old non-NMT approaches produced translations that were often useless, whereas the NMT-based translations I use now (mostly deepl.com), while not perfect, are often quite good, and definitely enough for my needs. Without NMT, realistically speaking, there is no alternative for me (i.e., I can't afford to pay a human translator, and I can't afford to wait until I'd learned the language well enough). So how can you say that AI "remains completely useless"?

Basically, you have implicitly assumed that "make machines that think like humans" is the only valid goal of AI research. And, from that point of view, it is understandable that evaluating NMT systems for how well they approach that goal using human evaluations, has many downsides. However, while some people working on NMT do have that goal, many of them also have the goal of "help people (like zodiac) translate stuff", and in the context of that goal, human evaluation is a much better benchmark target.

YeGoblynQueenne · on Aug 20, 2020

In general, yes, that's it. But to be honest I'm actually not that interested in making "machines that think like humans". I say that's the "promise of AI" because it was certainly the goal at the beginning of the field, specifically at the Dartmouth workshop where John McCarthy coined the term "Artificial Intelligence" [1]. Researchers in the field have varying degrees of interest in that lofty goal but the public certainly has great expectations, as seen everytime OpenAI releases a language model and people start writing or tweeting stuff about how AGI is right around the corner etc.

Personally, I came into AI (I'm a PhD research student) because I got (really, really) interested in logic programming languages and well, to be frank, there's no other place than in academia that I can work on them. On the other hand, my interest in logic programming is very much an interest in ways to make computers not be so infuriatingly dumb as they are right now.

This explains why I dislike neural machine translation and similar statistical NLP approaches: because while they can model the structure of language well, they do nothing for the meaning carried by those structures which they completely ignore by design. My favourite example is treating sentences as a "bag of words", as if order doesn't make a difference- and yet this is a popular technique... because it improves performance on benchmarks (by approximately 1.5 fired linguists).

The same goes with google translate. While I'd have to be more stubborn than I am to realise that people use it and like it, I find it depends on the use case -and on the willingness of users to accept its dumbness. For me, it's good where I don't need it and bad where I do. For example, it translates well enough between languages I know and can translate to some extent myself, say English and French. But if I want to translate between a language that is very far from the ones I know - say I want to translate from Hungarian to my native Greek- that's just not going to work, not least because the translation goes through English (because of a dearth of parallel texts and despite the fact that Google laughably claims its model actually has an automatically learned "interlingua") and the result is mangled twice and I get gibberish on the other end.

I could talk at length on why and how this happens, but the gist of it is that Google translate stubbornly refuses to use any information to decide what translation to choose, among many possible translations of an expression, by looking at the frequencies of token collocations- and nothing else. So for example, if I ask it to translate a single word, "χελιδόνι", meaning the bird swallow, from Greek to French, I get back "avaler" which is the word for the verb to swallow- because translation goes through English where "swallow" has two meanings and the verb happens to be more common than the bird. The information that "χελιδόνι" is a noun and "avaler" is a verb exists, but Google translate will just not use it. Why? Well, because the current trend in AI is to learn everything end-to-end from raw data and without prior knowledge. And that's because prior knowledge doesn't help to beat benchmarks, which are not designed to test world-knowledge in the first place. It's a vicious circle.

So, yes, to some extent it's what you say- I don't quite expect "machines that think like humans", but I do want machines that can interact with a human user in a slightly more intelligent manner than now. I gave the example of SHRDLU above because it was such a system. I'm sad the effort to reproduce such results, even in a very limited domain, has been abandoned.

P.S. Sorry, this got a bit too long, especially for the late stages of an HN conversation :)

___________

[1] That was in 1956: https://en.wikipedia.org/wiki/Dartmouth_workshop

zodiac · on Aug 20, 2020

I hear you about those language pairs that have to be round-tripped through a third language. I completely agree, too, that the big open questions in NLP are all about understanding meaning, semantic content, pragmatics, etc rather than just syntax.

I don't think that "NMT and similar techniques" ignore meaning by design though. What they do do by design, compared to expert systems etc, is avoid having explicitly encoded knowledge (of the kind SHRDLU had). Take word2vec for instance, it's not NMT but fits into the "statistical NLP" description - its purpose is to find encodings of words that carry some semantic content. Now, of course it's very little semantic content compared to what an expert could plausibly encode, but it is some semantic content, and this content improves the (subjective human) evaluation of NMT systems that do use word2vec or something similar.

Also, we should carefully distinguish "prior knowledge" as in "prior common-sense knowledge" and "prior linguistic knowledge". The end-to-end trend eschews "prior linguistic knowledge", while current NLP systems tend to lack "common-sense" knowledge, for rather different reasons.

End-to-end training tends to eschew prior linguistic knowledge because it improves (subjectively evaluated) performance in real-world tasks - I believe this is true for MT as well, but an easier example if you want to look into it is in audio transcription. I don't think there's a consensus about why this happens, but I think it is something like - the previously way people were encoding linguistic knowledge was too fragile / simplified (think about how complicated traditional linguistic grammars are), and if that information can somehow be learned in the end-to-end process, that performs better.

Lacking "common-sense" knowledge - that's more in the realm of AGI, so there's a valid debate about to what extent neural networks can learn such knowledge, but the other side of that debate is that expressing common-sense knowledge in today's formal systems is really hard and expensive, and AIUI this is also something that attempts to generalize SHRDLU run into. But it is definitely incorrect to say that it's ignored by anyone by design...

BTW, the biggest improvements (as subjectively evaluated by me) I've seen in MT on "dissimilar languages" have come from black box neural nets and throwing massive amounts of (monolingual or bilingual) data at it, rather than anything from formal systems. I use deepl.com for Japanese-English translation of some technical CS material, and that language pair used to be really horrible in the pre-deep-learning days (and it's still not that good on google translate for some reason).

YeGoblynQueenne · on Aug 22, 2020

Sorry for the late reply again.

I agree about word2vec and embeddings in general- they're meant to represent meaning or capture something of it anyway. I'm just not convinced that they work that well in that respect. Maybe I can say how king and queen are analogous to man and woman etc, but that doesn't help me if I don't know what king, queen, man or woman mean. I don't think it's possible to represent the meaning of words by looking at their collocation with other words- whose meaning is also supposedly represented by their collocation with other words etc.

I confess I haven't used any machine translation systems other than google translate. For instance, I've never used deepl.com. I'll give it a try since you recommend it although my use case would be to translate technical terms that I only know in English to my native Greek and I don't think anything can handle that use case very well at all. Not even humans!

Out of curiousity, you say neural machine translation is better than earlier techniques, which I think is not controversial. But, have you tried such earlier systems? I've never had the chance.

joe_the_user · on Aug 16, 2020

Good question!

As you say SHRDLU was the best program of that time.

I've read Winnograd's book and looked at the SHRDLU source code (not that I'm much of a lisp hacker but I got some time). It's built-on a parser and planner (logic program, pre-prolog). And it's built the old-fashioned-way, rewriting the source code with the parser rewriting input and then re-rerunning and other harry things. I think this achieves the parsing of idiomatic constructions on a high level. I believe the "raw lisp" of the day was both incredibly powerful since you could do anything and incredibly hard to scale ... you could do anything.

Winnograd wrote it himself but I think that's because he had to write it himself. In a sense, a programmer is always most productive when they are writing something by themselves because they don't have to explain anything they are doing (until the fix-ups and complexity overwhelm them but the base fact remains). And in the case of SHRDLU, Winnograd would have had an especially hard time explaining what he was doing. I mean, there was a theory behind - I've read Winnograd's book. But there was lots and lots of brilliant spaghetti/glue-code to actually make it work, code that jumped module and function boundaries. And the final had a reputation of being very buggy, sometimes it worked and sometimes it didn't. And Winnograd was a brilliant programmer and widely read in linguistics and other fields.

The software industry is an industry. No company wants to depend on the brilliance of it's workers. A company needs to produce based on some sort of average and a person working with just average skills isn't going to do SHRDLU.

So, yeah, I think that's why actual commercial programs never reached the level of SHRDLU

_emacsomancer_ · on Aug 16, 2020

So what you're saying is that we really shouldn't be relying on industry for good AI?

joe_the_user · on Aug 16, 2020

Well, an "industrial model" is a model of a factory and it seems unlikely you could product a fully-functional, from scratch GFAI program like SHRDLU in something like a factory.

Perhaps one could create a different kind of enterprise for this but it's kind of an open problem.

YeGoblynQueenne · on Aug 16, 2020

Winograd made changes to the Lisp assembly to make SHRDLU work and he never back-ported them to his SHRDLU code, but his original version worked fine and was stable. The experience of breaking refers to later versions that were expanded by his doctoral students and others and to ports to java and I think C. The original code was written in 1969 but inevitably suffered from bit rot in the intervening years so it's true that there is no stable version today that can reliably do what Winograd's original code could do... but Winograd's original code was rock solid, according to the people who saw it working.

There's some information about all that here:

http://maf.directory/misc/shrdlu.html

[Dave McDonald] (davidmcdonald@alum.mit.edu) was Terry Winograd's first research student at MIT. Dave reports rewriting "a lot" of SHRDLU ("a combination of clean up and a couple of new ideas") along with Andee Rubin, Stu Card, and Jeff Hill. Some of Dave's interesting recollections are: "In the rush to get [SHRDLU] ready for his thesis defense [Terry] made some direct patches to the Lisp assembly code and never back propagated them to his Lisp source... We kept around the very program image that Terry constructed and used it whenever we could. As an image, [SHRDLU] couldn't keep up with the periodic changes to the ITS, and gradually more and more bit rot set in. One of the last times we used it we only got it to display a couple of lines. In the early days... that original image ran like a top and never broke. Our rewrite was equally so... The version we assembled circa 1972/1973 was utterly robust... Certainly a couple of dozen [copies of SHRDLU were distributed]. Somewhere in my basement is a file with all the request letters... I've got hard copy of all of the original that was Lisp source and of all our rewrites... SHRDLU was a special program. Even today its parser would be competitive as an architecture. For a recursive descent algorithm it had some clever means of jumping to anticipated alternative analyses rather than doing a standard backup. It defined the whole notion of procedural semantics (though Bill Woods tends to get the credit), and its grammar was the first instance of Systemic Functional Linguistics applied to language understanding and quite well done." Dave believes the hardest part of getting a complete SHRDLU to run again will be to fix the code in MicroPlanner since "the original MicroPlanner could not be maintained because it had hardwired some direct pointers into the state of ITS (as actual numbers!) and these 'magic numbers' were impossible to recreate circa 1977 when we approached Gerry Sussman about rewriting MicroPlanner in Conniver."

Regarding the advantage of a lone programmer- that's real, but large teams have built successful software projects before, very often. I guess you don't even need a big team, just a dozen people who all know what they're doing. That shouldn't be hard to put together given FANG-level resources. Hell, that shouldn't be hard to do given a pool of doctoral students from a top university... but nowadays even AI PhD students would have no idea how to recreate something like SHRDLU.

Edit: I got interested in SHRDLU recently (hence the comments in this thread) and I had a look at Winograd's thesis to see if there was any chance to recreate it. The article above includes a link to a bunch of flocharts of SHRDLU'S CFG but even deciphering those hand-drawn and occasionally vague plans would take a month or two of doing nothing else, something for which I absolutely do not have the time. And that's only the grammar - the rest of the program woudl have to be reverse-engineered from Winograd's thesis, examples of output from the original code or later clones, etc. That's a project for a team of digital archeologists, not software developers.

joe_the_user · on Aug 16, 2020

"... his original version worked fine and was stable. The experience of breaking refers to later versions that were expanded by his doctoral students and others and to ports to java and I think C."

I can believe this but I think your details overall reinforce my points above.

For a recursive descent algorithm it had some clever means of jumping to anticipated alternative analyses rather than doing a standard backup.

Yeah, fabulous but extremely hard to extend or reproduce. The aim of companies was to scale something like this. It seems like the fundamental problem was only a few really smart people could programs to this level and no one could take them beyond it (the saw that a person half to be twice as smart to debug a program as to write comes in, etc).

rvense · on Aug 16, 2020

My impression is that the systems never progressed much after SHRDLU even though there were attempts at larger scale "expert systems". But adding more advanced rules and patterns proved extremely difficult and did not always have the expected effect of making the systems more general.

There was the whole AI winter thing, of course, but that was as much a result of things not living up to the hype as a cause.

YeGoblynQueenne · on Aug 16, 2020

This doesn't directly address your question, though it perhaps can give you some pointers if you want to read about the history of AI and the AI winter of the '80s, but in a way SHRDLU featured prominently in the AI winter, at least in Europe, particularly in the UK.

So, in the UK at least the AI winter was precipitated by the Lighthill Report, a report published in 1973, compiled by a Sir James Lighthill and commissioned by the British Research Council, i.e. the people who held all the research money at the time in the UK. The report was furiously damning of AI research of the time, mostly because of grave misunderstandings e.g. with respect to combinatorial explosion and basically accused researchers of, well, faffing about and not doing anything useful with their grant money. The only exception to this was SHRDLU, that Lighthill praised as an example of how AI should be done.

Anyway, if you have time, you can watch the televised debate between Lighthill and three luminaries of AI, John McCarthy (the man who named the field, created Lisp and did a few other notable things), Donald Michie (known for his MENACE reinforcement-learning program running on... matchboxes, and basically setting up AI research in the UK) and Richard Gregorie (a cognitive scientist from the US for whom I confess I don't know much). The (short) wikipedia article on the Lighthill Report has links to all the youtube videos:

https://en.wikipedia.org/wiki/Lighthill_report

It's interesting to see in the videos the demonstration of the Freddy robot from Edinburgh, that was capable of constructing objects by detecting their components with early machine vision techniques. In the 1960's. Incidentally:

Even with today's knowledge, methodology, software tools, and so on, getting a robot to do this kind of thing would be a fairly complex and ambitious project.

http://www.aiai.ed.ac.uk/project/freddy/

The above was written sometime in the '90s, I reckon but it is still true today. Unfortunately, Lighthill's report killed the budding robotics research sector in the UK and it has literally never recovered since. This is typical of the AI winter of the '80s. Promising avenues of research were abandoned not because of any scientific reasons, as is sometimes assumed ("expert systems didn't scale" etc) but, rather, because pencil pushers in charge of disbursing public money didn't get the science.

Edit: A couple more pointers. John McCarthy's review of the Lighthill Report:

http://www-formal.stanford.edu/jmc/reviews/lighthill/lighthi...

An article on the AI winter of the '80s by the editor of IEEE Intelligent Systems:

https://www.computer.org/csdl/magazine/ex/2008/02/mex2008020...

rvense · on Aug 16, 2020

Interesting, thank you for the clarifications.

lacker · on Aug 16, 2020

The modern natural language interfaces with limited domains are Alexa and Siri. Yes, they’re limited. But they are far more impressive and useful than SHRDLU.

YeGoblynQueenne · on Aug 16, 2020

Alexa and Siri (and friends) are competely incapable of interacting with a user with the precision of SHRDLU. You can ask them to retrieve information from a Google search but e.g. they have no memory of the anaphora in earlier sentences in the same conversation. If you say "it" a few times to refer to different entities they completely lose the plot.

They are also completely incapable of reasoning about their environment, not least because they don't have any concept of an "environment" - which was represented by the planner and the PROGRAMMAR language in SHRDLU.

And of course, systems like Siri and Alexa can't do anything even remotely like correctly disambiguating the "support support supports" show-off sentence in the excerpt above. Not even close.

Edit: Sorry, there's a misunderstanding about "limited domain" in your comment. Alexa and Siri don't operate in a limited domain. A "limited domain" would be something like being in charge of your music collection and nothing else. Alexa annd Siri etc are supposed to be general-use agents. I mean, they are, it's just that they suck at it... and would still suck in a limited domain also.

randcraw · on Aug 17, 2020

It’s not meaningful to compare SHRDLU with today’s verbal search interfaces. The world SHRDLU manipulated were only stackable blocks and the only relations it knew were ‘above’ and ‘below’. The entire scope of its endeavor was describing the state of block stacks and basic ways to reorder stacks to satisfy the single relation of above-ness and below-ness.

Time and causality and even basic probability were all absent from SHRDLU’s model. Not surprisingly the work was a dead end that even Winograd was quick to abandon, as he subsequently exited the field of experimental AI for the more conceptual models of cognitive science and HCI.

YeGoblynQueenne · on Aug 17, 2020

>> The world SHRDLU manipulated were only stackable blocks and the only relations it knew were ‘above’ and ‘below’.

If that were true then SHRDLU would have operated in a one-dimensional world. Instead, it moved blocks around a 3-dimensional world. It could understand more spatial relations than "above" and "below", as in the example I quote above where it is asked "Is there a large block behind a pyramid?". It stacked blocks, but in doing so it also had to move others out of the way etc. That is no big mystery, like I say SHRDLU used a planner and even in the 1960s' there were planners capabe of solving block stacking problems in 3d environments.

Time and causality were certainly not "absent" from SHRDLU! Like I say, it had a memory that allowed it to keep track of what it had done previously and it could explain the reasoning behind its actions, like in this excerpt from the demonstration quoted in wikipedia:

    Person: Had you touched any pyramid before you put the green one on the little cube?
    Computer: YES, THE GREEN ONE.
    Person: When did you pick it up?
    Computer: WHILE I WAS STACKING UP THE RED CUBE, A LARGE RED BLOCK AND A LARGE GREEN CUBE.
    Person: Why?
    Computer: TO GET RID OF IT.
    Person: Why did you do that?
    Computer: TO CLEAN OFF THE RED CUBE.
    Person: Why did you clear off that cube?
    Computer: TO PUT IT ON A LARGE GREEN CUBE.
    Person: Why did you do that?
    Computer: TO STACK UP THE RED CUBE , A LARGE RED BLOCK AND A LARGE GREEN CUBE.
    Person: Why did you do that?
    Computer: BECAUSE YOU ASKED ME TO.
    Person: How did you do it?
    Computer: BY PUTTING A LARGE RED BLOCK ON THE TABLE ; THEN LETTING GO OF IT ; THEN PUTTING A LARGE GREEN CUBE ON IT ; THEN LETTING GO OF THAT CUBE ; THEN PUTTING THE RED CUBE ON THAT CUBE; THEN LETTING GO OF THAT CUBE.

This is a question about time and causality: the user asks why the program performed a set of operations in an earlier time.

>> Not surprisingly the work was a dead end that even Winograd was quick to abandon, as he subsequently exited the field of experimental AI for the more conceptual models of cognitive science and HCI.

Regarding Winograd's subsequent work direction, this is what he had to say about it himself:

How would you say SHRDLU influenced your subsequent work and/or philosophy in AI?

Having insight into the limitations I encountered in trying to extend SHRDLU beyond micro-worlds was the key opening to the philosophical views that I developed in the work with Flores. The closest thing I have online is the paper Thinking machines: Can there be? Are we?

How would you characterize AI since SHRDLU? Why do you think no one took SHRDLU or SHRDLU-like applications to the next level?

There are fundamental gulfs between the way that SHRDLU and its kin operate, and whatever it is that goes on in our brains. I don't think that current research has made much progress in crossing that gulf, and the relevant science may take decades or more to get to the point where the initial ambitions become realistic. In the meantime AI took on much more doable goals of working in less ambitious niches, or accepting less-than-human results (as in translation).

What future do you see for natural language computing and/or general AI?

Continued progress in limited domain and approximate approaches (including with speech). Very long term research is needed to get a handle on human-level natural language.

http://maf.directory/misc/shrdlu.html

My reading of this is he realised that natural language understanding is not an easy thing. I don't disagree one bit and I don't think for a moment that SHRDLU could "understand" anything at all. But it was certainly capable of much more intelligent-looking behaviour than modern statistical machine-learning based systems. Winograd's reply above says that it's hard to extend SHRDLU outside of its limited domain, but my point is that a program that can operate this well in a limited domain is still useful and much more useful than a program that can operate in arbitrary domains but is dumb as bricks, like modern conversational agents that have nothing like a context of their world that they can refer to, to choose appropriate contributions to a conversation. He also hints to the shift of AI research targets from figuring out how natural language works to "less ambitious niches" and "less than human results", which I also point out in my comments in this thread. This is condensed - I'm happy to elaborate if you wish.

I have to say I was very surprised by your comment, particularly the certainty with which you asserted SHRDLU's limitations. May I ask- where did you come across information that SHRDLU could only understand "above and below" relations and that time and causality were absent from its "model"?