Talking Siri

Jake Keyes   October 17, 2011

Siri may be the first voice interface to get it right. (image via Marc Wathieu)

Will Siri, the new voice-recognition interface of the iPhone 4S, live up to Apple’s promises? It’s probably too soon to tell. The digital assistant demos well on stage, but the idea of talking aloud to your phone may prove too weird for most people to do in public. Regardless, Siri is evidence that Apple has realized something fundamental about voice interaction: the way we talk is essentially different from the way we read. And the same interface can’t work for both.

We’re all accustomed to visual and written interfaces, in which a user is asked to choose from a collection of words or symbols. Given lots of items, interface designers and content strategists tend to arrange those items in a hierarchy. Which makes sense: we can only focus on a few things at once, and without a hierarchy we’d have to deal with thousands of options laid out with equal weight in a flat and endless grid. To solve the problem of navigating this complexity, visual interfaces start at a high, general level, and work down to the specific. This holds true in print as well, in dictionaries, restaurant menus, and so on.

Many audio interfaces adopt this structure, too. Take automated phone support, and the typical menu system: “Welcome. Please choose from the following options…” On the surface, an audio hierarchy is appealing, for the same reasons as a visual one. How are you supposed to know what options are available without being prompted, or without having your choices constrained? Using a hierarchy should feel streamlined, necessary, and clean. But it doesn’t. People hate it. They hit 0 without listening, trying to break out of the command tree. Look no further than GetHuman.com for evidence of how passionately people resist hierarchical voice command systems.

The fact is, audio interaction is essentially different from visual interaction. In day-to-day speech, no hierarchy is necessary. We can pluck any sentence from the bottomless pool of possible sentences. There’s no need for people to tell their roommates “Hey, New Reminder. Payments > House and Home > Utilities > Electricity > Pay Bill.”  This certainly isn’t the most natural way to speak and it likely requires some user training to navigate the hierarchy successfully. We want our computers to understand commands without context. In other words, we want audio interfaces to be flat.

For years we’ve struggled. Technology seems to be the bottleneck — computers are bad at interpreting the human voice, so we’ve had to constrain significantly the number of things a person can say. Search engines approximate something like an audio interface, in that they take a single input and provide possible solutions, but there’s a key difference: search engines don’t take our input and execute an action, or bring us directly to a single URL. Some refinement and judgement is always required on the part of the user. (If all we had was “I’m Feeling Lucky,” Google would be a strange and frustrating service). We’ve never really achieved true flatness in any interface.

This is why the concept of Siri is so important. The virtual assistant is a promising step toward a true natural language interface. There are still constraints; for instance, Siri’s capabilities are limited, and she gets confused from time to time. But the concept is there, the idea of a perfectly efficient system for achieving an action: one command, one response.

This morning I asked Siri for directions home. She understood, and spent a moment thinking about my question. And then, something eerie happened. She pulled up my empty contact information from my address book. “I don’t know what your home address is, Jake.” Immediately, without thinking, I tapped in my information and saved it. I did exactly as I was told.

Tags:

3 Responses

  1. Tosca says:

    “There’s no need for people to tell their roommates: Hey, New Reminder. Payments > House and Home > Utilities > Electricity > Pay Bill.” Amen to that. As someone who has literally screamed at both FedEx’s voice system and DMV’s, this post hit home. Hierarchy in speech isn’t simply unnecessary, it’s frustrating and clunky. Now I’m more excited than ever to get my new phone!

  2. David Capito says:

    The language you use is telling… “she.” It’s not easy for people to consider our devices in human terms, and when I read this article I get the impression that Siri really does break the traditional conventions of voice recognition.

    I could never imagine referring to my GPS as “she.”

    I myself own an Android phone, but I’m impressed with Apple’s attempt not just to make the *input* of language more natural, but also the *feedback*. Siri responds more like a human would, and that is key.

    The future of natural language will be the response.

    Or maybe I am eagerly awaiting the day my devices say, “I can’t let you do that, Dave.”

  3. Jeff Barbose says:

    There may be no need to speak that hierarchy, but the object model lives underneath it all. What matters in speech is context and what forms context is the connections between and among the objects in the model.

    As for the roommates and the electric bill, there’s not even a need to identify roommates, really. They have the same address you do, and you haven’t identified them as family or some other formal relationship. :)

    The reason I bring up an object model is that Apple has been a strong advocate to third-party developers in building underlying object models—and non-UI access to them almost since the very beginning of the Macintosh. Non-UI access used to mean AppleEvents. Then AppleScript.

    Apple’s been building on its own technology for *that* long, and with that kind of historical prospective it’s easy to see why they acquired Siri, and not only how Siri might do its thing today, but how third party developers might make their apps accessible to Siri in the future.

Leave a Reply

Microposts

Twitter feed responded with an HTTP status code of 410.

Razorfish Blogs

Events

  • Confab Minneapolis

    June 3-5, 2013, Minneapolis, MN
    The third year is going to be bigger than ever. Get details and we’ll see you there soon!

  • Content Strategy Forum 2013

    Sept. 11-13, 2013, Helsinki, Finland
    The programme has been announced. Get the latest details and register today: http://csforum2013.com/

  • Content Strategy Applied

    Nov. 14 & 15, 2103, London, UK
    This year’s theme is “the end-to-end customer experience.” Call For Papers is now open! Get more information at: contentstrategyapplied.eu.

  • Confab Higher Ed

    Nov 11-12, 2013, Atlanta, GA
    Content Strategy goes to college!  Get more details and register today.

What is this site, exactly?

Scatter/Gather is a blog about the intersection of content strategy, pop culture and human behavior. Contributors are all practicing Content Strategists at the offices of Razorfish, an international digital design agency.


This blog reflects the views of the individual contributors and not necessarily the views of Razorfish.

What is content strategy?

Oooh, the elevator pitch. Here we go: There is content on the web. You love it. Or you do not love it. Either way, it is out there, and it is growing. Content strategy encompasses the discovery, ideation, implementation and maintenance of all types of digital content—links, tags, metadata, video, whatever. Ultimately, we work closely with information architects and creative types to craft delicious, usable web experiences for our clients.

Why "scatter/gather"?

It’s an iterative data clustering operation that’s designed to enable rich browsing capabilities. “Data clustering” seems rather awesome and relevant to our quest, plus we thought the phrase just sounded really cool.

Privacy Policy | Entries (RSS) |     © Razorfish™ LLC All rights reserved. Company Logo.