On second thought, I'm not sure if we're capable of this. To my (weak) knowledge...

On second thought, I'm not sure if we're capable of this. To my (weak) knowledge, all of these models seem to disregard time. They just wait their turn and spit out the number of sentences they were told to spit out in advance.

An active listener would have to understand when to interrupt, which means it would have to be trained on audio (or timed transcriptions.)

I did have a funny thought that you could do this for Indians with a video of an attractive person "head-bobbling" at various rates of speed, and periodically humming an encouraging note. It's so efficient to have a physical gesture that means "I (still) see, yes. You can tell that I'm getting confused when I stop doing this."