Essay #132 - Will Speech Technology Ever Work? |
|
|
In closing, I must ask the question. Will it ever work? And, of course, the answer is, yes. Speech recognition—and its related technologies (e.g., speaker verification, text-to-speech, audio indexing, speech data mining, dictation) will work. Indeed they already do. They will fill their respective application niches almost completely. And, in fact, the majority will do so quite soon. What will change is the definition of “work.” Speech recognition is primarily a user interface technology*. As such, it works when it disappears. It’s really that simple. When the users are not thinking about the user interface, but instead are accomplishing the task to which they are connected by the user interface, then and only then can the interface be said to be “working.” We have to stay on message with this fundamental fact if we are ever to succeed at bringing speech to the performance level where we can legitimately claim that it “works.” *This is almost universally true. But there are applications for ASR which do NOT need a human in the loop, are not user-interface applicable, and which we have not really discussed in this book. These include applications like data mining, automatic closed-caption generation, audio indexing, and similar scanning tagging or book-marking activities. Most are a little farther out, but are achievable with current paradigms. All are—in keeping with the thesis of this book—niche markets. In one of my favorite books—Understanding Computers and Cognition—Terry Winograd and Fernando Flores make this point clearly: objects and properties are not inherent in the world, but arise only in an event of breaking down in which they become present-at-hand. As I sit here typing a draft on a word processor, … I think of words and they appear on my screen. [The computer as such does not exist.] There is a complex network of equipment that includes my arms and hands, a keyboard, and many complex devices that mediate between it and a screen. None of this equipment is present for me except when there is a breaking down [pp. 36-37]. |
I often say, “A speech application works until it breaks. As long as it works, it is unnoticed. Once it breaks, it tends to stay broken.” The statement calls attention to the fact that speech as a medium results in unstable applications—for reasons discussed throughout this book. The inherent uncertainty of the medium, coupled with the time and memory constraints labored elsewhere, make breakdown almost inevitable. Once there is breakdown, then recovery becomes difficult because of error amplification problems. As long as practitioners insist on making the speech technology be the task itself— that is, not an interface, but an end goal in its own right—then the technology will not “work.” This is because the definition of “work”—the rule against which success is measured—will keep changing and expanding, faster than technology can deliver, until everyone eventually quits out of sheer exhaustion. There is no sustainable pleasure in conversing with a non-sentient machine. This fact will remain true no matter how often we tell ourselves otherwise. |