> why haven't we been able to advance on-device processing?
Probably because getting the payloads shipped for server side processing provides a constant stream of training material updates? That's my cynical take, at least.
But there's got to be more. Paraphrasing H.L. Mencken[0], for every complex problem there is an answer that is straightforward, easy to accept - and wrong. I remember how in the late naughties Nokia had an early version of on-device speech recognition: long-press the trigger key and you could call people by uttering out a saved "voice tag". IIRC the feature picked the right entry roughly 1/3 of the time.
For what comes next I'm relying on hearsay, but from what I heard at the time, the feature was originally developed at MIT. Nokia then financed the team to optimise their code and underlying detection model to be ported to ARM and to fit the constrained memory/CPU envelope. Because Nokia definitely had collaboration with MIT at the time, this is at least plausible.
If people are expected to use speech-to-text in real life, it has to work within very strict boundaries. Low latency and high accuracy are table stakes. At least some level of contextual awareness would be nice. As long as predictive text input routinely provides us with meme-worthy failures, I won't expect anything better from (fundamentally noisier) speech inputs. And if server-side processing is the only way to get performance from dismal to somewhat functional, practical applications don't have much of a choice. Plus, you don't have to ship your model to end user devices.
For what it's worth, I dislike voice interfaces. But when they do work, I dislike them less than Byzantine and user-hostile phone menu systems. I guess that qualifies as progress.
Probably because getting the payloads shipped for server side processing provides a constant stream of training material updates? That's my cynical take, at least.
But there's got to be more. Paraphrasing H.L. Mencken[0], for every complex problem there is an answer that is straightforward, easy to accept - and wrong. I remember how in the late naughties Nokia had an early version of on-device speech recognition: long-press the trigger key and you could call people by uttering out a saved "voice tag". IIRC the feature picked the right entry roughly 1/3 of the time.
For what comes next I'm relying on hearsay, but from what I heard at the time, the feature was originally developed at MIT. Nokia then financed the team to optimise their code and underlying detection model to be ported to ARM and to fit the constrained memory/CPU envelope. Because Nokia definitely had collaboration with MIT at the time, this is at least plausible.
If people are expected to use speech-to-text in real life, it has to work within very strict boundaries. Low latency and high accuracy are table stakes. At least some level of contextual awareness would be nice. As long as predictive text input routinely provides us with meme-worthy failures, I won't expect anything better from (fundamentally noisier) speech inputs. And if server-side processing is the only way to get performance from dismal to somewhat functional, practical applications don't have much of a choice. Plus, you don't have to ship your model to end user devices.
For what it's worth, I dislike voice interfaces. But when they do work, I dislike them less than Byzantine and user-hostile phone menu systems. I guess that qualifies as progress.
0: https://quoteinvestigator.com/2016/07/17/solution/
1: https://nokia-e71.helpdoc.net/en/nokia-e71-user-guide/phone/...