Talking Siri

Jake Keyes   October 17, 2011

Siri may be the first voice interface to get it right. (image via Marc Wathieu)

Will Siri, the new voice-recognition interface of the iPhone 4S, live up to Apple’s promises? It’s probably too soon to tell. The digital assistant demos well on stage, but the idea of talking aloud to your phone may prove too weird for most people to do in public. Regardless, Siri is evidence that Apple has realized something fundamental about voice interaction: the way we talk is essentially different from the way we read. And the same interface can’t work for both.

We’re all accustomed to visual and written interfaces, in which a user is asked to choose from a collection of words or symbols. Given lots of items, interface designers and content strategists tend to arrange those items in a hierarchy. Which makes sense: we can only focus on a few things at once, and without a hierarchy we’d have to deal with thousands of options laid out with equal weight in a flat and endless grid. To solve the problem of navigating this complexity, visual interfaces start at a high, general level, and work down to the specific. This holds true in print as well, in dictionaries, restaurant menus, and so on.

Many audio interfaces adopt this structure, too. Take automated phone support, and the typical menu system: “Welcome. Please choose from the following options…” On the surface, an audio hierarchy is appealing, for the same reasons as a visual one. How are you supposed to know what options are available without being prompted, or without having your choices constrained? Using a hierarchy should feel streamlined, necessary, and clean. But it doesn’t. People hate it. They hit 0 without listening, trying to break out of the command tree. Look no further than for evidence of how passionately people resist hierarchical voice command systems.

The fact is, audio interaction is essentially different from visual interaction. In day-to-day speech, no hierarchy is necessary. We can pluck any sentence from the bottomless pool of possible sentences. There’s no need for people to tell their roommates “Hey, New Reminder. Payments > House and Home > Utilities > Electricity > Pay Bill.”  This certainly isn’t the most natural way to speak and it likely requires some user training to navigate the hierarchy successfully. We want our computers to understand commands without context. In other words, we want audio interfaces to be flat.

For years we’ve struggled. Technology seems to be the bottleneck — computers are bad at interpreting the human voice, so we’ve had to constrain significantly the number of things a person can say. Search engines approximate something like an audio interface, in that they take a single input and provide possible solutions, but there’s a key difference: search engines don’t take our input and execute an action, or bring us directly to a single URL. Some refinement and judgement is always required on the part of the user. (If all we had was “I’m Feeling Lucky,” Google would be a strange and frustrating service). We’ve never really achieved true flatness in any interface.

This is why the concept of Siri is so important. The virtual assistant is a promising step toward a true natural language interface. There are still constraints; for instance, Siri’s capabilities are limited, and she gets confused from time to time. But the concept is there, the idea of a perfectly efficient system for achieving an action: one command, one response.

This morning I asked Siri for directions home. She understood, and spent a moment thinking about my question. And then, something eerie happened. She pulled up my empty contact information from my address book. “I don’t know what your home address is, Jake.” Immediately, without thinking, I tapped in my information and saved it. I did exactly as I was told.


3 Responses

  1. Tosca says:

    “There’s no need for people to tell their roommates: Hey, New Reminder. Payments > House and Home > Utilities > Electricity > Pay Bill.” Amen to that. As someone who has literally screamed at both FedEx’s voice system and DMV’s, this post hit home. Hierarchy in speech isn’t simply unnecessary, it’s frustrating and clunky. Now I’m more excited than ever to get my new phone!

  2. David Capito says:

    The language you use is telling… “she.” It’s not easy for people to consider our devices in human terms, and when I read this article I get the impression that Siri really does break the traditional conventions of voice recognition.

    I could never imagine referring to my GPS as “she.”

    I myself own an Android phone, but I’m impressed with Apple’s attempt not just to make the *input* of language more natural, but also the *feedback*. Siri responds more like a human would, and that is key.

    The future of natural language will be the response.

    Or maybe I am eagerly awaiting the day my devices say, “I can’t let you do that, Dave.”

  3. Jeff Barbose says:

    There may be no need to speak that hierarchy, but the object model lives underneath it all. What matters in speech is context and what forms context is the connections between and among the objects in the model.

    As for the roommates and the electric bill, there’s not even a need to identify roommates, really. They have the same address you do, and you haven’t identified them as family or some other formal relationship. :)

    The reason I bring up an object model is that Apple has been a strong advocate to third-party developers in building underlying object models—and non-UI access to them almost since the very beginning of the Macintosh. Non-UI access used to mean AppleEvents. Then AppleScript.

    Apple’s been building on its own technology for *that* long, and with that kind of historical prospective it’s easy to see why they acquired Siri, and not only how Siri might do its thing today, but how third party developers might make their apps accessible to Siri in the future.

Leave a Reply

Razorfish Blogs


  • SXSW Interactive

    March 7 – 11, Austin, TX
    Several of our contributors will be speaking this year. If you’re going, say hi to Rachel, Robert, & Hawk.

  • Confab Minneapolis

    May 7-9, Minneapolis, MN
    The original Confab Event. Rachel will be there doing her Content Modelling workshop with Cleve Gibbon. Get details and we’ll see you there!

  • Intelligent Content Conference Life Sciences & Healthcare

    May 8-9, San Francisco, CA
    Call for Presenters, now open:

  • Confab for Nonprofits

    Jun 16, Chicago, IL
    Another new Confab Event! Early Bird pricing until March 7:

  • Content Strategy Forum

    July 1-3, Frankfurt, Germany
    International Content Strategy workshops & conference: Call for speakers now open!

Search scatter/gather

What is this site, exactly?

Scatter/Gather is a blog about the intersection of content strategy, pop culture and human behavior. Contributors are all practicing Content Strategists at the offices of Razorfish, an international digital design agency.

This blog reflects the views of the individual contributors and not necessarily the views of Razorfish.

What is content strategy?

Oooh, the elevator pitch. Here we go: There is content on the web. You love it. Or you do not love it. Either way, it is out there, and it is growing. Content strategy encompasses the discovery, ideation, implementation and maintenance of all types of digital content—links, tags, metadata, video, whatever. Ultimately, we work closely with information architects and creative types to craft delicious, usable web experiences for our clients.

Why "scatter/gather"?

It’s an iterative data clustering operation that’s designed to enable rich browsing capabilities. “Data clustering” seems rather awesome and relevant to our quest, plus we thought the phrase just sounded really cool.

Privacy Policy | Entries (RSS) |     © Razorfish™ LLC All rights reserved. Company Logo.