Sunday, January 7, 2007

The Future of Human-Computer Interaction

Interfaces

The Future: Perceptual Interfaces

The other important piece of future interfaces should be "perception." The simplest example is speech recognition, or more accurately, speech-based interfaces. Another example is computer vision. Smart phones are excellent speech platforms, as already noted, but most also have cameras and a respectable amount of CPU power, especially in their digital signal processors. They are more than capable of computer vision using either still images or video from their cameras. A simple example is barcode recognition, which is already available on some camera phones (both 2D and 1D barcode readers have appeared on commercial phones). OCR (optical character recognition) for business-card recognition is also available commercially. Another example is TinyMotion, a phone software application that my lab has developed, which uses the video from a camera phone to compute the phone's motion relative to a background - just as an optical mouse does. This creates a software-only general-purpose 2D mouse for camera phones. TinyMotion is very useful for map browsing (which is why we developed it) in location-based cellphone services. It turned out also to be a nice interface for smart-phone games, which is probably a bigger market than its target .Computer vision has a big role to play in managing personal media assets, and this reaches into the home, as well as the mobile market.

These niche applications for vision on phones are suggestive, but perhaps not really convincing of the economic value of computer vision for phones. Let's look for a moment at "social media," personal data such as photos and videos that are shared with friends and family. As argued before, the phone is a communicating and social platform, and photo sharing is likely to be one of the most popular uses of multimedia on the phone. With collaborators at Berkeley and in industry, we explored face recognition from camera-phone images. The application is precisely photo-sharing and archival. The user will likely want to share a photo with the people who are in the photo and would like meta-data about who is in the photo so he or she can find it later when looking for specific people. Our results were interesting because we found not only was it possible to recognize subjects reasonably well using computer vision, but also that the recognition accuracy improved significantly when context data was used, as well as computer vision. While our system actually did its recognition on a PC rather than on the phone, we realized that the same state-of-the-art PC algorithms could easily have run on the smart phones we had used. Computer vision has a big role to play in managing personal media assets, and this reaches into the home, as well as the mobile market.

Turning to ASR (automatic speech recognition) and VUIs (voice user interfaces), we saw a boom in these industries in 2000, followed by a contraction for several years. But 2000 was also the era of wild promises and unrealistic expectations. What should have happened with speech? First of all, when PCs were mostly in offices, VUIs didn't make much sense. Nothing wrong with the technology, but speech is a poor match for most office work. Let's not forget the significant advantages of text for routine business communication: You can scan text for what you want, you can read back and forth if you don't understand, you can edit text while you're writing it to make sure you say exactly what you mean, and you can forward text through a long chain of readers without losing its meaning. Written text is generally less ambiguous than spoken language that expresses the same meaning - we're not really aware of this, but we're trained from an early age to take more care with text. Furthermore, you can work on text documents without your neighbors listening in. Much knowledge work is about managing structured or semi-structured information (even before computers came along). Most organizations relied on paper to store and move this information around with precision and robustness (again before computers). Speech technology can certainly play a role, but it's wrong to think about displacing most of the "paperwork" in office environments. As Jordan Cohen (formerly of VoiceSignal, now of SRI International) points out in his interview in this issue, the way to succeed with speech technology is first to identify the market where it makes sense.

Let's remember the lessons from the Xerox Star. The Star was all about having a real-use context (office work) and identifying an appropriate set of user tasks. Phones are primarily about communicating using a variety of media (sound, images, text) and to an increasing extent about sharing and archiving those media. To support and augment those communication services, we need some knowledge of what's "in" those media, which is exactly a machine perception task. Furthermore, if phones are to provide other services (besides communication) to users, they also need to interpret the user's intent through whatever interfaces the phone possesses. I already remarked on users' toils with phone menus and buttons, while at the same time the phone is a beautifully evolved speech platform. Speech interfaces do indeed look like a great choice. They continue to improve in performance, but the state of the art is much better than people realize.

Tips on how to adapt IT for business changes!

Until last year, like most HCI researchers, I was skeptical about the value of speech interfaces in HCI. But then I saw a Samsung phone (P207) shipping with large-vocabulary speech recognition and getting very good user reviews in all kinds of publications (including the hard-to-impress business market).

I also taught a class on medical technologies and had a chance to meet with many caregivers. There is already a large speech industry in medicine, and it is widely seen as one of the key technologies moving forward (it has probably already eclipsed "office ASR" and is a significant part of the speech recognition industry overall).

I had committed the cardinal sin of generalizing experience from a technology in one context (VUIs in the office) to its application in a different context. It's the technology-in-context complex that matters. ASR-on-phones and ASR-in-medicine are brand new markets. Their users don't know or care about the history of speech in the office. They just buy it and use it, and they either like it (so far, so good) or they don't.

My only direct experience with speech interfaces was with the burgeoning automated call-center industry, which had been quite bad. But after learning more about the state of the art (Randy Allen Harris's Voice Interaction Design or Blade Kotelly's The Art and Business of Speech Recognition are excellent guides), I realized that there are many superb examples of voice interface design. It's a lot like Web sites and GUIs in the 1980s. The practice of human-centered user interface design was not widely known back then, but as the HCI discipline grew both in academia and industry, best practices spread. Products that didn't follow a good user-centered process were quickly displaced by competitors that did. There is an excellent set of user-centered design practices for speech interfaces that are very similar to the practices for core HCI. As yet, they aren't widely adopted, but the differences between systems that follow them and those that don't are so striking that this cannot last forever.

It has also become clear that the recognition accuracy of the ASR part of the interface is not the limiting factor - it's the quality of the overall VUI design and the match of the application to its context. In other words, there's no reason to wait for future technical magic before using speech interfaces. You can write excellent ones now, assuming speech interaction fits your application context.

After these epiphanies, I moved a significant amount of activity in my group to speech and dialog-based interfaces (i.e., started four new projects). While there are very good practices in speech interface design today and many useful services that can be built with them, there are still significant challenges and room for improvement. Those limits have to do with the shared understanding between a human and a machine sharing a speech interface. This is why speech interfaces are also a rich research area. Much of the shared information is the context we have already been talking about, and all of the aforementioned projects are coupled with our work on context-awareness