Broca Networks: A Natural & Essential Progression In Communication
Broca Networks: A Natural & Essential Progression In Communication Broca Networks: A Natural & Essential Progression In Communication
 

About Speech Technology & Speech Technology Trends

Advanced Speech Recognition (ASR)

Advanced speech recognition (ASR) refers to the ability of a computer to recognise the spoken word. ASR software captures and digitises the sound of a human speaking. This is then processed into small pieces, which provide a spectral representation (audio map) of the sound and matched to specific phonemes (the basic sounds of language). Each phoneme is then identified, taking into consideration accents, voice quality, background noise etc., and matched against a grammar (a predefined list of phrases the caller may say).

It is important to draw a distinction between telephony based ASR and desktop based dictation software which is most peoples first encounter with speech recognition technology. Desktop dictation systems generally have to be trained to understand the users voice before they can be used, but have the benefit of a relatively large vocabulary although the software usually has little or no understanding of what the meaning of what is said. Telephony based ASR as used by Broca, requires no training and works across the whole range of UK accents. However this flexibility is gained from a trade off in the size of the vocabulary that can be recognised at each step in the dialog. As such telephony based ASR is ideal for goal directed dialogs, entering and looking up information, but is not suitable for free-form dictation.

Occasionally there is confusion between a speech recognition system and a simple pc based speech dictation system. Speech dictation systems are extremely basic use of speech technology. With these systems the user is constrained by a small recognition database and the user had to spend time training the software so that it can recognise the user. Speech technology solutions available today are something applications run on powerful servers and they are speaker independent.

ASR software is only one component of a complete speech technology solution.

Text-to-Speech (TTS)

Almost all speech recognition applications and all telephone based applications, must be able to communicate back to the user via speech. There are two options to produce this speech . The most natural and effective approach is to piece together recordings of a human (Voice Artist) saying predefined phrases (prompts).

However, in some applications, the information that needs to be spoken back to the user is too extensive, or changes too often, for recordings to be a practical solution. For example, speaking back unrestricted text such as reading a web page, a news article or an email.

Text To Speech (TTS) technology provides another solution based on computer generated synthesised speech. The TTS engine generates human-like speech from pieces of text. For example, a sentence from a recently received email could be converted into speech using TTS software. TTS speech does not have the same clarity as real human speech but this area is rapidly improving. TTS engines currently available do provide a real and usable solution for todays requirements.

Carefully combining human recorded prompts and TTS technology is often the best solution when deploying a complete speech solution.

Voice Authentication and Voice Identification

Many applications need to identify who the caller is, in order to access personal or corporate information or simply to provide a more personalised response. When high security is not required, the user can be identified by requesting that they answer a series of questions, for example, account number and PIN. With the increasing use of mobile phones as a personal communicator, a caller can often be seamlessly identified by matching their Caller Line ID (CLID) against a database of known users.

In higher security applications the caller´s identity claim can be further verified by using Voice Authentication technology. This analyses a large number of quantitative characteristics of the sound of the callers voice and matches it against the stored biometric profiles of the valid users. In effect Voice Authentication tries to determine the physical characteristics of the caller, such as the length of their vocal chords or the size of their chest cavity that modify the pitch and quality of the voice. Needless to say this can be very difficult to forge, and working in parallel with a dialog based verification strategy very high levels of authentication can be achieved.

Voice Identification provides the capability to identify a known user without them having to explicitly identify themselves. A typical use might be to enable access to a building using voice rather than a pass code or swipe card. Voice identification is rarely used for telephony applications, as it is unable to handle the large numbers of users that might be registered on a phone-based system (unlike Voice Authentication where the appropriate voice print is retrieved at the time it is needed, using some other identification means such as user ID, or account number).

VoiceXML (Also known as VXML)

VoiceXML is an emerging standard for developing speech services. VoiceXML has attracted great interest because it is the first proposed standard for the speech industry and is similar in concept to HTML - the document standard used to create Web pages.

A key advantage of VoiceXML is that it ultimately provides vendor independence. An application built with VoiceXML can run on any VoiceXML-based platform (much like a web page written in HTML can be displayed by Internet Explorer or Netscape).

SALT

Speech Application Language Tags (SALT) is very similar to VoiceXML but with additional capabilities in the area of multi-modal and telephony access to information, applications and web services from PCs, telephones and PDAs.

Multi Modal access enables users to interact with an application in a variety of ways: input using speech, a keyboard, mouse etc., and output as synthesised speech, audio, plain text, motion-video and/or graphics.

Voice over IP

Voice Over IP (VoIP) converts analogue voice into computer data, which can then be transferred over a computer network rather than normal telephone lines. Voice is then transported with other data, providing an economical solution especially in a corporate environment. The key issue to address with VoIP is that of latency. In a data environment any data loss can be managed with error handling software, but when transporting speech any errors or lost packets would be very troublesome for a listener. Nevertheless this issue can be overcome through robust application design, and the use of VoIP networks that can guarantee prioritisation of voice data.

Multi Modal

Multi-modal interfaces will be an important element in the future for speech technology. Essentially, it enables the individual to use hand-held devices, such as WAP and wireless phones or PDAs, to mix text, graphics, and voice. It allows for faster and more flexible data input by integrating speech and the keypad, and in many instances may make it easier to review the output.

Mixed Modal

Mixed-modal applications are similar to multi-modal applications except that the different input and output technologies are used at different times depending on the environment and the devices that the user has available.

For example a stock broking application might primarily use a web-based interface to display share price graphs on a desktop computer. This is fine for research, but doesnt help the user if they are in a car at the moment when their price trigger is reached. However a mixed modal application would still enable the user to complete a trade using just a mobile phone and a voice interface.

In practice mixed modal applications might also be multi-modal on some devices for some users. However, whereas multi-modal applications generally require next generation communication devices (such as 3G mobile phones) many mixed modal applications can take advantage of currently available technologies.

Natural Language Understanding

Today´s natural language speech recognition technologies achieve more life like interactions. Users are no longer limited to a narrowly constrained range of expressions that must be uttered one word at a time. Applications today can recognise and understand more complex expressions spoken in a more natural, free style manner. Advances in natural language technology are made possible through good dialog design and new capabilities such as powerful statistical models to recognise free style speech and extract key concepts to determine the meaning of the users input. These have characteristics of an artificial intelligence system.

Future natural language understanding systems will be able to listen to more complicated text and draw conclusions based on this information in the same way a human can draw a conclusion from reading into (for example) a newspaper article. This form of Artificial Intelligence is still some way from general deployment.

Copyright 2005: Broca Networks: All Rights Reserved