Siri may be the first voice interface to get it right. (image via Marc Wathieu)
Will Siri, the new voice-recognition interface of the iPhone 4S, live up to Apple’s promises? It’s probably too soon to tell. The digital assistant demos well on stage, but the idea of talking aloud to your phone may prove too weird for most people to do in public. Regardless, Siri is evidence that Apple has realized something fundamental about voice interaction: the way we talk is essentially different from the way we read. And the same interface can’t work for both.
We’re all accustomed to visual and written interfaces, in which a user is asked to choose from a collection of words or symbols. Given lots of items, interface designers and content strategists tend to arrange those items in a hierarchy. Which makes sense: we can only focus on a few things at once, and without a hierarchy we’d have to deal with thousands of options laid out with equal weight in a flat and endless grid. To solve the problem of navigating this complexity, visual interfaces start at a high, general level, and work down to the specific. This holds true in print as well, in dictionaries, restaurant menus, and so on.
Many audio interfaces adopt this structure, too. Take automated phone support, and the typical menu system: “Welcome. Please choose from the following options…” On the surface, an audio hierarchy is appealing, for the same reasons as a visual one. How are you supposed to know what options are available without being prompted, or without having your choices constrained? Using a hierarchy should feel streamlined, necessary, and clean. But it doesn’t. People hate it. They hit 0 without listening, trying to break out of the command tree. Look no further than GetHuman.com for evidence of how passionately people resist hierarchical voice command systems.
The fact is, audio interaction is essentially different from visual interaction. In day-to-day speech, no hierarchy is necessary. We can pluck any sentence from the bottomless pool of possible sentences. There’s no need for people to tell their roommates “Hey, New Reminder. Payments > House and Home > Utilities > Electricity > Pay Bill.” This certainly isn’t the most natural way to speak and it likely requires some user training to navigate the hierarchy successfully. We want our computers to understand commands without context. In other words, we want audio interfaces to be flat.
For years we’ve struggled. Technology seems to be the bottleneck — computers are bad at interpreting the human voice, so we’ve had to constrain significantly the number of things a person can say. Search engines approximate something like an audio interface, in that they take a single input and provide possible solutions, but there’s a key difference: search engines don’t take our input and execute an action, or bring us directly to a single URL. Some refinement and judgement is always required on the part of the user. (If all we had was “I’m Feeling Lucky,” Google would be a strange and frustrating service). We’ve never really achieved true flatness in any interface.
This is why the concept of Siri is so important. The virtual assistant is a promising step toward a true natural language interface. There are still constraints; for instance, Siri’s capabilities are limited, and she gets confused from time to time. But the concept is there, the idea of a perfectly efficient system for achieving an action: one command, one response.
This morning I asked Siri for directions home. She understood, and spent a moment thinking about my question. And then, something eerie happened. She pulled up my empty contact information from my address book. “I don’t know what your home address is, Jake.” Immediately, without thinking, I tapped in my information and saved it. I did exactly as I was told.