Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined. A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.
Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration. There are many failure points: it can "hear" you wrong, it can miss the wake word, it can hear correctly but interpret wrong, miss context clues, or simply be unable to process whatever the request is. In my experience, most normal people either relegate voice commands to ultra-specific tasks, like timers, weather, and music, and that's that. Google and Alexa are relatively good at "trivia" questions, but Siri is a complete failure. All systems have edge cases that make them brittle.
I think there's potential here. Cortana was the most promising: an assistant that's integrated into the OS and can change any setting or perform anything on-screen would, again, be really awesome. We just don't have that. I think maybe OS-wide + GPT 4 (or later) might get closer to what we expect, but it's just not great right now. I really want to be able to say something as unstructured as "hey siri, create alarms every 5 minutes starting at 6am tomorrow" or "hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news". There /is/ power to-be-had, but nobody has really tapped it.
Natural language is a fundamentally wrong vehicle to convey information to a computer. It can be useful for some specific tasks, automated Q/A, simple interfaces to databases, stuff where I can't be properly f_ed to remember the syntax or the shortcut like IDE commands.
But the idea it can replace formal language is fundamentally and dangerously incorrect. I agree with Dijkstra's quip, we shouldn't regard formal language as a burden, but rather as a privilege.
A lisp compiler in a voice assistant would seem like an improvement in that the user could define objects and then express the actions to be performed in the same room. But these assistants seem to drop objects between commands making them hard to program conversationally.
I guess a list like language would be ideal and the pauses would be like parentheses
Not to take away from your point (I'd like the magic list too) but to some degree, this can be worked around using Shortcuts. If you use inputs, Siri will prompt for them which is a bit slow but you could even use a dictate text and parse yourself if desired.
On the other side, humans have been fine using natural language to delegate commands to each other.
So maybe it's just that the subfield of natural language understanding is still too early to be really useful. Speech recognition itself has gotten really good but then understanding the context, the intent, etc, all that is natural language understanding, and that is often the problem.
Citation needed, there's a lot of disagreements and misunderstandings (some have cost lives) that could've been avoided if we didn't have 10 different ways to say the same vague thing that can be interpreted in 20 ways. You think the military uses a phonetic alphabet and specifically structured communications for fun? Or the way planes talk to ATC for example. Where precision and unambiguity is crucial, natural language always gets ditched for something more formal.
This is actually an interesting point. In the Army, we used terms that limited ambiguity thereby increasing efficiency. Even if one eliminates the complexity of language, there's still a specification problem.
I only use voice assistants to set alarms. I cannot imagine voice as a primary input. Then again, many have opted out of owning desktops and laptops in favor of mobile phones. That also seems terribly inefficient.
>Then again, many have opted out of owning desktops and laptops in favor of mobile phones. That also seems terribly inefficient
A lot of people don't need computers in the general purpose sense. I admit my mind boggles a bit when co-workers tell me their kids don't want a computer to do their school papers because their phone is fine. But, then, I'm used to keyboards and what we think of as a "computer" and have been using one for decades--and grab one when I can for any remotely complex or input-heavy task.
> A lot of people don't need computers in the general purpose sense. I admit my mind boggles a bit when co-workers tell me their kids don't want a computer to do their school papers because their phone is fine.
I grew up in the 1980s, when handwritten papers were still the norm. I do see the advantages of using a word-processor for writing papers, but don't see why it would be a necessity (at least, until University).
It sounds ridiculous, but I'll admit that when you've got something like Dex that lets you dock the phone for usb and hdmi out and gives you close to a full desktop OS I'd imagine it really is enough for the casual user.
I certainly know colleagues in the industry who travel with just a tablet and external keyboard. No, they're not running IDEs etc., but they find it OK for emails, editing docs, taking notes, etc. Personally I'll spend the extra few pounds to also carry along a laptop. But I can imagine not needing/wanting a dedicated laptop when I travel at some point.
I'm usually carrying a tablet anyway though for entertainment/reading purposes. So it's usually a choice of tablet + laptop vs. tablet + keyboard. (I admittedly don't really have a weight optimized travel laptop these days either.)
I actually do wish there were good Mac or Chromebook choices for a travel 11" or so laptop but the market seems to have settled on a thin 13" as the floor and, admittedly, the weight/size difference isn't huge.
While I am mostly a Mac person, for travel I often prefer a tiny and cheap Lenovo Chromebook that does everything (a bit poorly): Linux containers for light weight programming and writing, consume media like books, audiobooks, and streaming.
In response to a grandparent comment about weight for tablets: I prefer Apple’s folio old style of cases/keyboards because of weight. I have one for both my small and large iPad Pros. Whenever I travel, I usually just take one of my iPads if I don’t need a dev environment [1].
[1] but with GitHub Codespaces and Google Colab, development on an iPad is sort of OK.
I still don't see the point of tablets. It's just a smartphone with a larger screen, and practically all people already carry phones.
Might as well go for the laptop at that point given that it can actually do far more imo, unless you ditch the phone and go for one of those half phone half tablets I guess.
I'd rather watch movies, read, play certain games, etc. on my tablet than on a phone. (Obviously there are also specific use cases like digital art.) That said, I mostly use my tablet when traveling and it's a distant third in necessity compared to either a laptop or a phone--and only somewhat more useful than a smartwatch.
Watching movies on a tablet is terrible, though. All methods for propping the device up so you can watch the movie are inferior to the way a laptop screen props itself up via hinges and a base.
On a plane I'd rather use the tablet in my lap than have to put the tray table down. And in a hotel room I'm watching on the couch if there is one. (I do also have an attachment for my tablet that will let you prop it up on a table but I mostly don't use it because it adds weight.)
For reading, I'm probably bringing my Kindle along if I don't bring my tablet.
If you do not have one, buy a dock! I have a sp6 and 4 , and having the dock makes it quite the device. Speakers, multiple external monitors, keyboard, mouse -- a full desktop setup, I can grab and either stick a keyboard cover on or just use as a reading device on the couch.
Back to work? Sit on table, one cable and it's back to a desktop and charging up again.
How old are you? Because larger screens become really nice as your eyes go bad. And I don't need the full size of a laptop for things I'd want to do on a tablet.
The obsession with being lighter definitely has diminishing returns. At some point another few ounces doesn't make any difference in a real, practical sense. I think have just started to associate "lightness" == "better" despite there being no actual benefit past a certain threshold.
Right at some point. But at the current point my tablet is too heavy to hold in hand for more than 20 secs perhaps. Phone is ok. Tablet is not (for me). I only use tablet by placing it on table or a stand. Then actually using a laptop is much better than a table.
The killer-tech will be when we have a tablet that is as light as phone.
Thanks for that. A lot of energy is currently sunk because of natural language, and I'd argue gains from employing software (instead of human processes) for various tasks is in part due to scaling up the results of many confusing discussions in natural language about what a specific process actually comprises.
This is part of the reason Google search sucks more and more.
Around when Android appeared, and the first voice searches began, Google suddenly started to alias everything.
Search for 'Andy', 'Andrew' appears. Search for 'there', and 'they're' appears.
This has been taken further, now silly aliases such as debian .. ubuntu exist, and as google happily drops words in your search, to find a match, this makes precision impossible.
But, that's the only way to make voice search remotely work, so...
I don't think this is to support voice search: Google generally knows whether a query was initiated by voice or typing. Instead, I think it's because most users find what they're looking for faster with it.
If you have terms you don't want interpreted broadly you can put them in quotes.
Google "helpfully" ignores the quotes sometimes too. They're not the hard and fast rule they used to be.
I preached the Gospel of Google when the competition was composed of web rings and Altavista, but Google in its infinite wisdom has abandoned the advanced user with changes of this nature.
I find voice-assistant often useful for using the phone such as opening a given setting, say make the display brighter. Trying to navigate the settings pages is very error-prone. There seems to be no universal standard as to where each setting should be found.
There is a widely accepted and straightforward thinking that humans has ideas, which are expressed in languages, and that languages being ambiguous is problematic: this I'm starting to have doubts on.
Maybe we don't have clear intentions in the first place, maybe languages are not just ambiguous, but only meant to narrow realms of valid interpretations down to a desired precision, rather than intended to form a logically fully constrained statements. Maybe this is why intelligent entities are needed to "correctly" interpret natural language statements, because an act of interpretation itself is a decision making and an action.
Just my thoughts but I do think there are more to be said than "natural languages are ambiguous".
> On the other side, humans have been fine using natural language to delegate commands to each other.
Using language to instruct humans goes wrong all the time. Just a short while ago on British Bakeoff I saw 2 of the contestants make white chocolate feathering on their biscuits by making actual feathers out of white chocolate and placing them on their biscuits. And I'm sure that will confuse quite a few people reading this too. It certainly confuses image searches. Language is a fuzzy interface. Compare to interface like clicking on a button that does the thing I want done.
How would you (easily) describe the concept of chocolate feathering to a computer without using natural language? (e.g. if you wanted the computer to generate an image, or search for an image of / recipe with chocolate feathering).
> On the other side, humans have been fine using natural language to delegate commands to each other.
And that's why all of aviation has moved to a tight phraseology, such that delegated commands are universally understood and their meaning is set in stone.
> humans have been fine using natural language to delegate commands to each other.
Not always resulting in unambiguous instructions:
"Lord Raglan wishes the cavalry to advance rapidly to the front, follow the enemy, and try to prevent the enemy carrying away the guns."
~Lord Raglan, Balaclava
"I wish him to take Cemetery Hill if practicable."
~Robert E. Lee, Gettysburg
> On the other side, humans have been fine using natural language to delegate commands to each other.
I think this is really a characterization. Mostly human communication is full of errors and problems.
What is true is that when it is important enough, humans have come up with ways that minimize communication errors and frameworks to deal with ambiguity - mostly these involve training and effort though, it really doesn't come naturally.
> humans have been fine using natural language to delegate commands to each other.
Every time we try to minimize errors, we formalize a language. I don't even think people use natural language to issue commands often. Commanding people is often considered rude.
The problem is that it's not actually a conversation. To significantly improve it, you'd want to:
- identify users by voice
- ask them clarifying questions
- remember the answers on a per-user basis
- understand "no, that was the wrong answer"
If you're going to provide a formal interface to the computer, you also have to provide teaching in that formal interface, which is far more of a burden to the user than the cost of the device. And we've completely moved away from that model (not necessarily a good thing, but that's what the market has chosen).
Calling it a burden is an assumption that ignores and belittles the end user. Sure, there are people who won't want to train their personal ai.
But I imagine there are significantly more who would appreciate clarifying requests by a teachable assistant capable of interacting with the entire digital world on their behalf, efficiently and intelligently.
I think you're right. There are glimpses of this in the voice interfaces right now. For example, Alexa will distinguish between voices and preferentially take actions for me, saying "Play Music" plays Spotify, and for my kids, it plays Amazon music.
An example backing this is voice assistants that DO work, e.g. Talon voice. But these require defining a language, and then they are very accurate and powerful.
I don't see why a voice assistant for the masses couldn't "train it's own users", for example suggesting the language it does expect. But even then, most times people are talking in noisy environments or talk to fast or don't have an understand of how the machine might work. Regardless, who cares. They ruin the audio environment of a home. They're good for setting timers while you're cooking, that's about it.
Car voice assistants do this, but they're still clunky and it takes them forever to list their options. Voice interfaces just like CLI suffer from extremely bad discoverability and presentation compared to GUIs and thus will always be limited to specialty applications. CLIs at least have a league of try-hards and hobby linux users to keep them alive.
Right - natural language works for people because we have minds that are communicating. A virtual assistant has a list of things it can do, and uses language as an interface to them. So the language just becomes obfuscation instead of allowing clarification.
I've said before, I would prefer a voice assistant that optimized for traversing its menu system, in response to unambiguous noises (could be high and low pitch hums or whatever) that lets me bypass the guessing game and use the menu it's hiding
Otherwise, it works great :-) We love the hands-off usage mode because we cook a lot, so adding things to shopping lists or looking stuff up doesn't require cleaning hands in the middle of prep. Also the speakers are pretty darn good for the size and work well for music.
Doing complicated things is right out though. But the simple stuff works fine.
I'm just waiting for someone to finally release a voice assistant built around an actual language model, like GPT-3 or LaMDA.
It would be more error prone in a lot of ways, which is probably why nobody's done it yet, but it would also be a _lot_ more powerful, and fulfill the vision of conversational AI in a way the current rules-based assistants do not.
I think if powerful language models were easily accessible to normal people (in an inexpensive and completely unrestricted fashion, like with Stable Diffusion) we'd already see this happening in the open source world. Companies are going to be a lot more hesitant to try it though until they have a way to 100% prevent the models from making mistakes that could reflect poorly on the company, which is going to take _way_ longer to achieve.
Natural language conveys information to other people just fine. So the problem isn't that "Natural language is a fundamentally wrong vehicle to convey information to a computer". The problem is getting the computer to understand natural language to the same level as a human.
Dijkstra's full essay[1] is a bit more illuminating, but essentially it's about how, for example, developing a system of symbols and formal language around mathematics has allowed "school children [to] learn to do what in earlier days only genius could achieve".
I think his argument even generalizes to literacy in general. Remember that reading and writing skills don't develop naturally (as opposed to spoken language). They require a large educational investment, and used to be reserved for the wealthy and the privileged.
But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.
And even if you're all alone in a silent place, giving instructions out loud takes more time than configuring a screen, and will always be error prone, because the feedback will always be ambiguous and imprecise.
Except maybe if the feedback is on a screen, but then if there's already a screen, why not use it.
I think the best use cases for voice assistants are when you don’t have free hands. I have two scenarios where I use voice assistants: setting a timer while cooking and changing the music while showering. Both could be done by other means as well but they wouldn‘t be more convenient.
Exactly. For instance, in the mornings Google Assistant has been really useful for when I say "OK Google, Good Morning". It then runs through and tells me:
* Current time, and weather forecast for the day
* Upcoming meetings today
* My current commute time to work, including traffic
* NPR news podcast
So during my routine of letting the dogs out, starting the coffee, etc. in the morning, I get the daily "essential" info.
> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.
I would separate out the two, actually. There's a "natural language control system for the entire OS" and then there's the actual voice part. Voice is often mostly useful for accessibility purposes -- hands full, running, driving, etc. However, the other side is that a text-based NL assistant would also be profoundly useful. On iOS, you can enable "Type-to-siri" and you can just type sentences and Siri will respond back in text.
If we make progress on NL-driven command-lines, we can actually make progress on voice-assistants, and vice versa. The catch is that the voice side still needs recognition work.
Well, you are not trying to operate heavy machinery with Amazon Echo - hopefully. Voice as a common interface - I agree with all of that, but to me the everyday utility of being able to add something to my shopping list or my TODO list without having to fire up an APP greatly increases my quality of life. That part is magical, but I don't expect a lot more from it.
I used to use Alexa for my shopping list. I guess over time I came to the conclusion that adding something to a steno pad or my whiteboard was even easier.
If the assistant AI was advanced enough for pleasant conversations to occur, it would be useful.
The would be trivial to use the interface on screen when appropriate, and a truly smart assistant should be able to follow the context and be aware of your preferences and mood.
This is not fundamentally impossible, we're simply not there yet.
> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click
Working from home changes that. I can see many more opportunities for a multimodal input interface. Examples:
1. My fingertips now are closer to the "reply" button below this text area than they are even to the touchpad. Touching "reply" is half a second, moving one hand to the touchpad, aiming the pointer at the button and clicking takes longer. With a mouse: much longer. Anyway, my screen is not a touchscreen. I'll click.
2. Or, with an assistant, I could have said "Click reply", provided that the assistant knows where the focus is and that it can read the form I'm typing in.
Your fingertips while typing are even closer to the Tab and Enter keys on your keyboard, which, if pressed in sequence, have the exact same effect. Much simpler and much faster than either of your options.
"hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news"
I think the problem with that is that even I, as a human, struggle to know for sure what you want.
You want to turn all the lights on in the house? Does that include the lamps in the bedroom? How about new lights that you add later? Or the ones in the garden? It's full of ambiguity. What device do you want to watch the news on? Or did you mean the radio? Do you want this to apply when you get back at 2am one night, meaning your family gets woken up when you turn on all the lights and start playing the news in their bedrooms?
I think that's probably why voice interfaces aren't likely to work well for anything beyond direct, specific, well-scoped requests: turn on the lights in the bedroom; turn off the heating at home; roll up the blinds; what's the weather like today; what's the remaining range on my car. They really struggle to deal with anything more complex – not so bad in theory, but really incredibly irritating when they make the wrong decision.
If you had some kind of 24-hour live-in assistant (a butler, maybe?), then they probably have the knowledge and intuition to make sensible decisions in response to fairly unstructured requests. But I think we're miles off getting a voice assistant to do it – not because they can't, necessarily, but because if they mess it up at all it's infuriating.
You can do some of this with shortcuts, and then use Siri to trigger the shortcut. But that involves thinking; the magic of Jeeves is that he knows what you want even before you do.
The problem is there are more different combinations I might want as a shortcut then I have time to program/remember. I can remember something like a dozen commonly used shortcuts. However when 5 years from now I arrive home at 2am (for the first time in several decades, but it will probably happen at some point again in my life) will I remember the correct shortcut - and assuming I do, is it up to date with whatever changes have been made to my house?
What about the shortcut for when I need to leave at 3am for some reason. then a different shortcut for when it isn't just me, but my whole family leaving at 3am. An still another for my son having to leave that early.
Jeeves can figure it out when I arrive at 2am so I don't need to program it.
You've reminded me of some aspects of these platforms that I like in a more general sense – like for example the way the Apple Watch will automatically ring the alarm on my phone if I forget to put my watch on, or if I get up before my alarm goes off the watch will notice and ask if I want to skip the alarm for the day. This stuff genuinely feels almost like magic sometimes – the risk is that when anything like this goes wrong it's awful.
I might be in the minority, but I also don't want to add things to my life that make my environment noisier or that require me or others living with me to speak more. As much of a Star Trek fan as I am, I never found "The Computer" to be appealing, and always thought of it more as an artistic device. It's a lot easier to communicate a character's intent / action if they are vocalizing it for performance. Even in scenes where they are "typing" something into the computer, they will inevitably be communicating to the captain or another character what they are doing.
In practical reality these interfaces feel, to me, as extremely inefficient. As someone who doesn't particularly like to speak, and prefers silent environments, these interfaces require more energy from me to use. Unless they are serving someone who has a physical impairment then I don't see what problems exist that these solve, but I can identify lots of problems that they introduce (not only noise but privacy / security vulnerabilities etc.)
Timers and reminders alone are enough to make them a pretty nice thing to have though.
I don't really want them to be all that much more powerful, because natural language can be imprecise, and... there's just not much I that I want to automate in a home setting beyond some real simple timers for lights and stuff.
What if I had a bad day and didn't want to see depressing news? Or what if I came home and was talking on the phone when it turned the news on?
True automation as opposed to just telemetry and remote control can easily be annoying more than helpful.
I like the idea of automation... but I don't actually... automate anything aside from timers and reminders.
I think that's generally true though playing music is a little more freeform. (And, guess what? Voice assistants tend to be worse at that.)
The problem is that you have many many billions of dollars have been sunk into making these devices about more than setting alarms and timers. There's actually been a lot of pretty amazing progress. But it's yet another one of those things that getting to 90% to anyone but techies who want to fiddle with their smarthome stuff or otherwise play with the technology.
They might have a sudden increase in usefulness when smarthome stuff is more common, although smart bulbs are a bit of a hassle in most switched outlets, because the switch is usually more convenient.
Maybe they'll add an app that lets you browse possible commands so it's more discoverable.
It's probably true that a well-integrated smarthome would benefit from voice control.
But I'd observe that I'm going up to my brother's tomorrow and he has all manner of timers and other WiFi-connected stuff and none of it has any sort of centralized control and that's pretty normal even for people who have a lot of that sort of thing.
And, yeah, the only smart light thing I have at home is one thing that doesn't have a controlling light switch and I used X10 for it for years before I got an Alexa.
If I would be in this space I would just build voice assistant to very specific situations where you cannot type like driving, cooking, doing some sport etc. There is lots of potential but big players are kinda trying build generic tool for every situation which is super hard problem.
My Alexa asked me today if I wanted an Avatar theme. No I really do not, Alexa. I was reminded of the article a few days ago how they can’t monetize this well and are somehow losing $10 billion. :)
Voice assistants have reached the Unhelpful Valley stage.
When they were a novelty I recall the excitement of trying new commands and layering in context, after many failures I've been conditioned to now only attempt and expect success with generic queries.
To me what’s interesting is that MS smelled that it was a problem a while ago and pulled the plug before it ate a hole in their wallet but Amazon and Google keep plugging along ploughing money into a bottomless pit. Apple has a different play and looks like they are controlling their losses there quite well and may act as a slight loss leader for other products.
I can't fathom how they managed to spend so much on it, though. The product has been around for quite a while, as well, so it's not some initial ramp-up cost. $3B/quarter $10B/year? Wow.
Edit: Maybe things like this happen because there are various nerds who lead these products and are good at talking the businesspeople into funding it. Maybe this was only possible at the big tech growth stage while business wasn't that good at telling the value proposition. So end result, lots more engineers get paid which is great in my book :-)
> Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration
My biggest frustration with Alexa is getting it play the podcasts I want to listen to. Even popular podcasts with English names are hard to get just right for Alexa. The same goes for song titles and bands that are not popular, or they are in other languages.
Usually when I want to take a shower, I try to get the podcasts/music to play for 2 minutes, then sigh, give up and just say "Alexa play Britney Spears".
And discoverability. For a long drive I probably want to pick out some specific podcast episodes rather than play whatever. I'm just not a whatever background sound sort of person. The interfaces aren't really good enough to present me with some options with voice control only. So I end up mostly pre-populating a "Car" playlist.
>> A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.
And even then, a voice assistant is essentially a user interface, not a product or service.
It could be a service if you could reliably say "Alexa, plan my trip to customer X the week of the 30th and send me my itinerary". But for now they are an alternative to a phone UI.
The reality is that even a human personal assistant can rapidly devolve to being more of a hindrance than a help if they're not very good once you get beyond simple mechanical tasks. Even with all the knowledge about the world that most adults carry around in their heads. Yes, a poor human assistant can fall down in other ways such as forgetting to do something--but they have a lot of context.
This seems a really high bar for voice assistants aspiring to do much more than set alarms or turn the odd light etc. on or off.
These days few people have personal secretaries, but back when they were common they really were personal - once you got a personal secretary she (nearly always she, I feel like we should acknowledge sexism even though it is irrelevant my point) would follow you (nearly always male), as you moved job to job and up the ladder. She went with the you because once you spent a few years training her to how you worked, a new secretary would greatly limit your effectiveness.
These days a large part of what people relied on secretaries for a computer can do faster, so only at the highest levels do you see them. There are still secretaries at the low levels, but not nearly as many, and they are not doing the same tasks.
That's pretty much it. We call them executive admins these days where they exist.
And, yeah, assistants shared with a bunch of other people--as with travel agents in general--aren't really all that useful. If I'm mostly just giving fairly mechanical instructions to execute, it's probably easier for me to go online and figure out the options myself.
A secretary made a lot more sense when you dictated memos for inter-office mail and retrieving information often involved making multiple phone calls.
>> This seems a really high bar for voice assistants aspiring to do much more than set alarms or turn the odd light etc. on or off.
That's kind of my point. A voice assistant is just a fancy UI until they reach the level of AGI, and I don't see the point in spending billions of dollars on them to be a simple UI as Amazon seems to be doing.
If that voice assistant were self hosted in the little device, I agree. But those simple interfaces are connected directly to a significantly larger machine that literally knows everything about you and half of everyone you know. It's not unheard of to expect it to be more useful than setting timers and playing music.
They "know" a bunch of discrete facts. They don't know that if you book me on a red-eye unnecessarily to save $100 I'll be hunting you down. Or any of a zillion other flexible preferences--some of which I'm not even very consistent about.
I don't know about you personally, but google definitely knows I've never booked a red-eye and that I haven't booked a layover since the early aughts. I'm fairly sure Google could easily figure out not only where I'd be interested in flying to in the next few months, but when and for how long, and at what price points I'd consider upgrading my flight.
I know they know this about me not only because of my Gmail account but also because I use Google flights to find the flights before I book them.
Unfortunately they're not using this data to help me. Rather they're using it to target advertising to me. But they definitely have the data and the machinery to be more useful to me with more than just a few facts
Maybe my travel is more complicated but I even not infrequently get annoyed with "past me" for various travel-related decisions. I avoid red-eyes but at some price point I won't--or maybe only if it's someone else's money. And maybe I don't have a choice based on my schedule or just what flights are available. Normally I won't do an unnecessary layover but maybe I will to fly my preferred airline.
It gets complicated in a hurry and for the cases where it is relatively simple (and when it gets into very complex international travel a voice interface is going to be completely useless), I can look up my options pretty quickly on a computer.
The potential would be there, if they would focus on the assistant-part, and take the voice just as a mean to interact with the assistant, besides other means like clicking, typing, showing complex information on a screen, etc.
Voice alone sucks, it's just too limited to be useful on a grand scale. Similarly, command lines suck too. The shell in general has the same problems that Voice assistants have, just that they have more value and had decades to mature into something actually useful. And toady we have unix-shells which reduce the problematic parts by many levels, and still receive constant improvements. This is missing for voice assistants, because unix-shells are growing and improving in an open space, where everyone can add their own things. This is not happening in big tech.
I don't think this is actually reliably possible due to the fact that while grammar does tend to follow patterns sometimes, we're fundamentally dealing with an exponential amount of ways to say things to a voice assistant.
In the spirit of the title of this post, someone else also has to say something.
If your argument is that this is a "non-visual command line" there's slim hope of the layperson learning a whole secret grammar without even a goddamn man page just to do their menial tasks.
*nix was optimized for low-bandwith channels. That's why the command names and options are extremely terse and typically return trivial output on success. OTOH it was assumed that input would be reliable, so there's no confirmation required for potentially dangerous commands. A "*nix for voice" would need to address that, at the very least.
I’d sure be lost if I had to listen to the entirety of a manpage or dmesg output or /var/log/messages read out by voice. Some of those could take hours to read out. Nothing actually trivial about *nix command output. Just sometimes terse.
>Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined.
This got me thinking. Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager. Now there's just the language part.
Thing is, I don't want to speak to my computer using English. Aside from the enormous practical problems in natural language processing you've outlined, I just find the idea creepy[1].
What I want is to unambiguously tell it to do arbitrary things. I.e. use it as an actual computer, not a toy that can do a few tricks. I.e. actually program it. In some kind of Turing complete shell language that is optimized for being spoken aloud. You would speak words into the open source voice recognizer, it writes those to stdout, then an interpreter reads from stdin and executes the instructions.
Is there any language like this? What should it look like?
And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.
If you're using an averaged American voice - maybe. But it's really not solved for everyone. Google assistant can't set the right timer for me 1/10 times. And that's before we get to heavy accent Scots and others.
> Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager.
This is potentially far from true, depending on how exactly you draw the line between "voice recognition" and "language". I've looked at quite a few transcription services, and they fail a lot of the time for most people - those who either have a non-native accent (even if very slight!) or those who do any amount of stammering or other vocal tics.
I find the ML transcription services, given 2 people speaking English with high quality sound and without heavy accents/a lot of jargon, to be adequate for having a skimmable record--such as for extracting quotations (and just go back to the recording to confirm the exact words if it's not obvious). But if I'm publishing a transcript I get a human transcription. Cleaning up the ML stuff takes way too much time and I wouldn't publish a transcript without cleaning it up.
I was in fact looking at some transcriptions of my recent meetings, and found one that captures how even small mistakes can make for completely not-understandable transcripts, unless they are manually cleaned up.
Manual transcription:
> So no: long story short, Slum is basically the way we can have an individual [, uhhh,] instance that carries all the licenses.
(Slum is a project name in this case)
Computer transcription (MS Teams):
> So no.
> A long story shorts. Love is basically the way we can have an individual.
> Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager.
I personally don't consider this a fully-solved problem. The best transcription system I've used is OpenAI Whisper, and it doesn't work in realtime. Maybe it's fine on small amounts but it's still not perfect. You really need error to be driven down dramatically. Zoom auto-captions are a joke in terms of how badly they work for me, and Live Text (beta) on macOS is equally dreadful. YouTube auto-captions suck. All of these use industry-leading APIs. If I'm speaking a voice command and one single word is wrong, usually the whole thing fails.
There's an entirely separate issue about things that are Proper Nouns that don't exist. For example, "Todoist" is often misunderstood by Siri. Thus, people started saying "Two doist (where doist rhymes with joist)" to fool it into understanding "Todoist". Media like anime with strange titles from other languages often flat out trolls these transcription systems. ("Hey Siri, remind me to watch Kimetsu no Yaiba tomorrow".)
That reminds me of the handwriting recognition approach [1] used in old Palm Pilot devices. Even though the shapes it expected you to draw resembled the corresponding letters, you would never draw them like that if you were writing on paper.
You knew that you were drawing something designed for a computer to recognise as unambiguously as possible, while being efficient to draw quickly and easy to learn for you. I feel like that's the kind of notion that voice interfaces should somehow expand upon.
To me the hardest problem is simply remembering what every light on my network is named. Did I call the light next to my desk “desk light” or did I call it “office light”? If I don’t get the name exactly right, I cannot control the light. Multiply that by every other light in the house and it becomes a lot to remember. I have probably 15 lights controlled by Alexa and I can only remember the name of like three of them. Thus most of the time it is just “Alexa turn on the lights” so it can turn everything on in a room.
If these voice assistants were smarter about “alternative” names for every device it might be easier to use. But as it stands, it’s kind of a pain because the way you phrase each request is so unforgiving…
Oh yeah, and god help you if your device name is similar to your room name. If your room is “office” (or did I name it “the office”?) and your light is “office light” Alexa is gonna have a bad time figuring the two apart.
I have no clue how to fix this…
PS: this is why I question steering wheel free self driving cars. How will we tell these things exactly where to go when we cannot even reliably tell our voice assistants exactly what light to turn on?
I think the biggest potential is with Microsoft Teams in business. It is ubiquitousness in people's work life, has access to data and has integrated with everything. And adding cortana to calls would be an easy step for people to understand and learn. People would say "cortana share my screen". People would learn phrases from each other.
But teams hasn't figured out how to send text in a coherent way.
It's used because companies can cheap out on buying a license for other communication applications, it is fundamentally worse than anything else in any other metric. If voice lets me respond to a message without hunting for the hidden reply because Teams shoves it below the bottom of the screen then it could be a win.
Considering UX is so low for Teams I doubt it will.
> There /is/ power to-be-had, but nobody has really tapped it.
This kind of thing can't be built for modern mainstream operating systems because they generally prevent subjugation of the OS components and other programs, even if the user wants that, ostensibly for security reasons.
Unlike a human operator, an assistant "app" can only operate within the bounds of APIs defined by the OS vendors and third-party developers. Gone are the days of third-party software that extends the operating system in ways that the overlords couldn't (or wouldn't) dream of.
That's not entirely true. Accessibility APIs on macOS, for example, would let you control so many aspects of the OS from user land apps given that permissions are granted. But voice assistants are not up to the task.
I think you're identifying some of the right problems here. All voice assistants are based on turn-taking, and when the VoiceAI hits one of those failure points and just comes back with "I didn't get that" it leaves the user in a frustrating state trying to debug what's wrong.
I work at SoundHound where we've been worried about these issues. (I'm going to plug our recent work...) Our new approach is to do natural language understanding in real-time instead of at the utterance (turn) taking level. That way we can give the user constant feedback in real-time. In the case of a screen that means the user sees right away that they are understood, and if not, a better hint of what went wrong. For example a likely mistake is an ASR mistranscription for a word or two.
We still need to prove this is a better paradigm for VoiceAI in products that people can try for themselves, and are working towards that goal. I hope that voice interfaces that were clunky with turn-taking will finally be more naturally usable with real-time NLU.
I tried Amazon's Alexa, the top end model with a display. Often it would taunt you about new/interesting things on the screen, but I could never get them to work. I'd had to memorize things to get even the basics working. Ended up unplugging it.
However Google's Assistant in comparison worked great, no memorization, and very useful. Sure time, weather, set timers, and alarms worked great with a very flexible set of natural language queries. Even more complex things like what will be the temperature tomorrow at 10pm, simple calculations and unit conversions. But also things like IMDB like queries about directors, actors, which movies someone was in, etc generally worked well. It seemed to really understand things, not just "A web search returned ...". Even more complex things like the wheelbase of a 2004 WRX would return an answer, not a search result.
With all that said I'm looking for a non-cloud/on site solution, even if it requires more work, most recently noticed https://github.com/rhasspy/rhasspy
The big issue is that there's no clearly defined interface for users. What commands are possible? Nobody knows. So people default to the most obvious things like setting a timer. Is it possible to setup your own commands and build your own work flows? AFAIK, no. So the tech is essentially dead in the water until companies fundamentally rethink what they're trying to do with voice assistants.
Yup. At the risk of being glib I would say this is 90% of the issue. Or more like 'the big blocking issue' at the moment.
Voice can do way more than we know, but we have no idea what it does or how to use it.
Standardizing the interface and providing tutorials would possibly change things dramatically.
And this goes for the back-end protocols as well.
The tech is way, way ahead of the UI and integration.
Imagine getting the power of 'git' with no tutorial and not really an understanding of what it does? Good luck with that.
90% of us would be using it in the car to do a lot of things if we really knew how to do it:
You: "Siri: Command. Open. Mail. Prompt. Recipients starting with S"
Siri: "Sarah, Sue, Sundar"
You: "Stop. Command. Message. To: Sunar. Thanks for the note. Stop. Send without Review"
Some of this already exists, but it's product specific etc. there needs to be some kind of natural universal interface - or we have to wait until the AI is really, really that good.
Talon voice can do everything a keyboard and mouse offers, plus more (contextual awareness, higher level abstraction). Very powerful in combination with modal editing. I'm not affiliated, just a user.
Granted, this is for a specific user base and yes, not in coffee shops.
This timeline is such a mishmash of mediocrity. Voice assistants could have been a vibrant ecosystem of different personalities, like say buying a Darth Vader voice pack or having your computer sound like a snooty English butler..
There's a great little game series called Megaman Battle Network (Rockman.exe in Japan) which diverges from the mainline by showing an alternate universe where scientists focused on AI instead of robotics, resulting in a world where "Navis" are ubiquitous.
I wonder, what if our early software engineers focused on bringing natural voice control to CLIs, before perfecting GUIs first?
I think these assistants just need to give the user a way to edit interpretations.
A 'debug' area that lets you ask a command, see what was interpreted - and immediately edit or click "that's not what I wanted". But not an afterthought and not a cumbersome process like setting up an automation that is triggered by specific commands.
Imagine telling your voice assistant "You're wrong, as usual" and instead of it giving you the boiler plate "I'm sorry ", it actually offered a way to improve itself.
I would think that a good command-line is one that responds to me within milliseconds on a crapbox i386 machine, and I can COMMAND it what to do.
A good command-line is not a binary blob that cannot parse simple instructions correctly.
At the same time, siri seems to be getting slower and fatter every iteration so perhaps it is becoming more human ;)
As a native English speaker, that seems a profoundly odd request but that is what you asked for.
And you now have me wondering how open-ended calendar requests are actually implemented given that they can't literally have entries out to infinity. (I assume they go out some finite period and some background process periodically re-populates future entries.)
A recurrence rule is added to a start event, then an occurrence cache is either generated on the fly for periods of interest, or, yes, a rolling cache a year or two in the future is maintained and updated daily.
Perhaps trivial, but actually seems like an interesting question given you have to potentially tradeoff RPCs for routine queries (and the number of database records) vs. being wrong for the random "Am I free on this day three years from now?" query. Of course, the answer may be that, in general, the differences don't really matter.
Another pitfall of most voice assistants is that they are really designed first with the corporation in mind rather than the user. Most are proxies for surveillance, advertising, or are just steering consumers back to a preferred set of walled-garden services.
Yeah, the whole idea has a lot of potential that seems like it should be within reach, but somehow it's 2022 and my phone still can't handle "hey Google, play my driving playlist on Spotify."
Your queries continue to be money-sinks -- even in your ideal case, you aren't buying anything! This query costs them money but earns them nothing. This is useless.
Me and voice assistants are like me on the ballroom dance floor. I loved to take the lessons and learn all sorts of moves and chain them all together and look impressive, but when I got onto the floor with a partner, I just wouldn't know what to do or where to start. I kept to the "basic" steps and maybe a timid little turn once in a while.
Maybe it's possible to learn a working vocabulary and know how to command a voice assistant. I know my way around several command lines, but I have no idea what to say to Hey Google.
it almost sounds like you are describing how it feels to learn a new language. And if that's the case and people need to learn "voice assistant" to communicate with their device effectively, hasn't it utterly failed as a natural language processor?
Also I know this is true in other domains as well, obviously there is a common "google-ese" that people learn to narrow down their searches.
Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration. There are many failure points: it can "hear" you wrong, it can miss the wake word, it can hear correctly but interpret wrong, miss context clues, or simply be unable to process whatever the request is. In my experience, most normal people either relegate voice commands to ultra-specific tasks, like timers, weather, and music, and that's that. Google and Alexa are relatively good at "trivia" questions, but Siri is a complete failure. All systems have edge cases that make them brittle.
I think there's potential here. Cortana was the most promising: an assistant that's integrated into the OS and can change any setting or perform anything on-screen would, again, be really awesome. We just don't have that. I think maybe OS-wide + GPT 4 (or later) might get closer to what we expect, but it's just not great right now. I really want to be able to say something as unstructured as "hey siri, create alarms every 5 minutes starting at 6am tomorrow" or "hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news". There /is/ power to-be-had, but nobody has really tapped it.