SALT: Language to support voice on the web

Interactive voice recognition (IVR) systems have been around for many years and are the initial experience of speech technology for many people.

After the World Wide Web developed in the 1990s and web technologies became standardized and popular, voice technology developers began to look for ways to combine voice with the web.Of course, the first method is to coordinate previously specialized phone / IVR technologies with the capacity of web server infrastructure.And people quickly realized the need for a standard, specialized XML-based language.This language allows definition of context and logic to control IVR applications in a web environment.VoiceXML language is developed by VoiceXML Forum (http://www.voicexml.org) to meet this need.VoiceXML Forum was founded in 1999, including AT&T, IBM, Lucent and Motorola, which now have hundreds of members.The introduction of VoiceXML not only creates open and more flexible IVR solutions, it also allows access to voice web applications.

In parallel with developments in voice and web integration, personal computers (PC) are becoming strong enough to handle the basic tasks of speech technology: speech recognition (first data in) and express the voice (output data).It is a feat to know that by the 1970s, 50 computers were needed for Carnegie Mellon University's HAPPY system to perform continuous speech recognition (with natural conversation speed).

Thanks to the development of the PC as well as the development of the voice technology itself, computer users began to have the opportunity to experience speech technology on their own PCs.Today, ordinary computer users have the opportunity to use voice products from multiple manufacturers, support multiple jobs, from controlling computers through commands to reading texts to directly entering the process. text editor.

Software developers also benefit from combining speech technology with PC.There are many options to incorporate voice technology into applications, from SDK development kits for Microsoft and PlainTalk Apple software vendors such as Speech API (SAPI), to the tools of the Independent software firm such as NeuVoice's biometric SDK and voice.

With an increase in the capacity and ability of PC voice processing, a new concept in voice integration began to emerge - 'multimodal' application, capable of supporting human communication. Using the client side in the language, in addition to the normal input / output methods are keyboard, mouse and screen.Because web browsers are becoming more and more popular and able to express a rich user interface, it is clear that the development of multi-modal applications requires a closer integration of speech technology with data and models. implement on the web, such as event-based scripting languages, DOM (Document Object Model) and HTML.

Although VoiceXML has been expanded to provide this level of integration, some industry insiders believe that there should be a different model from the platform.And this led to the development of SALT (Speech Application Language Tags) of the SAL Forum (www.saltforum.org), established in 2002 with companies like Microsoft, Intel, Cisco and Philips, nowadays Member companies have increased by more than 70. The goal of SALT is to incorporate speech technologies into a variety of computing devices, from PCs to PDAs, through multi-modal applications.To achieve this goal, SALT uses a different method than VoiceXML.

Difference between VoiceXML and SALT

VoiceXML was originally designed to provide a comprehensive environment for building voice applications.It provides components that define data (forms and areas in the form), execution routes, and provides an execution environment to translate VoiceXML commands at execution.In general, web / voice aggregation is done through interaction between the web server and the VoiceXML server.

For example, in a typical VoiceXML application, the phone performs the same function as a web browsing tool, the input is the user's speech that is sent to the VoiceXML server to translate words based on the command document including the VoiceXML component.The command in the document can instruct the VoiceXML server to connect to a URL on the web server, where the web script (for example, JavaScript) interacts with the application / website and responds to the corresponding VoiceXML command.VoiceXML server receives command information from the web server, interprets it, and sends appropriate output speech information to users via the phone.

In contrast, SALT provides the minimum environment for building a voice application.It is limited to defining data and specialized behaviors according to speech communication, such as listening to input and determining grammar used to translate input information.All other data, such as form definition and form components, assigns the markup language that the SALT embeds, such as HTML.

To provide this level of integration, SALT tags are built in the form of XML components that function like extensions of today's markup languages such as HTML, XHTML and WML.Furthermore, SALT components also use DOM communication to work as full-fledged members in the home page's data model.This means that they own methods, properties, and events that can be accessed directly from ECMA-compatible scripting languages such as JavaScript in the same way as other components in a DOM-compatible website. .

In short, SALT is specifically designed to accommodate the data model and execution environment currently used by web developers.The entire technical specification of SALT is available at here.

SALT application

Let's consider the basic multimodal application using the four most important components of SALT.

First of all, you need a way to prompt the user to input information.The SALT tag is used to identify pronunciation content such as greetings, instructions or confirmations.Content sources can be pronounced text or audio files.With audio files, the job is simply to play back this file, but with text, it requires TTS tools (text-to-speech) to turn into a voice.

Listing 1 illustrates the use of components to pronounce the text 'Welcome to a SALT multimodal application'.The oncomplete event activates the component when the notification delivery is complete.

The component provides voice recognition control, translates speech into text and handles identification results.Listing 2 illustrates component usage.In this example, there are two subcomponents: The element that identifies the identifier, the component that points to another component of the site, such as an HTML tag.

The component determines exactly what the program can identify and can be viewed as a dictionary list.Speech recognition translates the digital signal to represent the sentence into sounds and matches the negative representation of the words in the list assigned by grammar.If a match is found, the result is returned.The grammar can be specified in the command line or it can be located in a separate file and pointed to by the component's src attribute.In Listing 2, the grammar contained in the external file 'TagGrammar.xml';The source of this file is shown in example 3 that defines the identification of 3 words 'prompt', 'listen', 'grammar'.

With the above 4 basic SALT components, we can assemble the sample application as shown in the example 4. First of all, this website must refer to the SALT definition in the first line.

When the page loads into the browser, the line:

Starting to broadcast the specified message as the promptWelcome, the TTS tool will pronounce the sentence 'Welcome to a SALT multimodal sample.You may choose . '.After the notification is finished, the oncomplete event of the component triggered promptWelcome component is designated listenTag.This will enable the listener to listen to the input voice.When the input signal is detected, the identifier will try to identify based on the grammar loaded by the child element.

If the identity is successful, the child component will be executed.

This command assigns the node value of the identifier XML output data to the selectTag element, defined in HTML format in this example as a drop-down list.This results in the corresponding option in the selected HTML SELECT control.After the bind is executed, the successful identity event is triggered with onreco, which calls the JavaScript function ProcessInput ().The command in ProcessInput () then displays the declaration of the text being displayed in the HTML SELECT control.Note that ProcessInput () is also called by the onchange event of the SELECT control that is triggered when the user selects an option with the mouse or keyboard.

This simple multi-modal application illustrates how SALT supports voices that control web elements while still allowing users to interact with those components with the mouse and keyboard in the traditional way.On the programming side, this application also illustrates the smooth interaction between SALT components with the data model and the execution environment of the regular website.

SALT with WEB technology from Microsoft

Incorporating SALT with the data model and popular web execution environment (HTML, DOM, XML and scripting language) also supports the incorporation of SALT into common development environments.Let's look at how Microsoft supports SALT in the IIS (Internet Information Server) and IE (Internet Explorer) environments.

First of all, on the client computer you must install Microsoft Speech Add-in for IE.This add-in provides additional DLL libraries for IE 6.0 and performs the SALT command translation in the browser.Note that if you have installed the Speech Application SDK (SASDK), this step is not required.

Next step, create a directory on the IIS server and set up the application's root directory.In this new directory, create a file containing the SALT application code (use the code in the 'SALT Application' section) and name it with the appropriate extension, for example SimpleSALT.slt.Next, create another file containing grammar and name it, for example TagGrammar.xml.

Finally, set the MIME type on the IIS server for the page containing the SALT command.For example, if the file is named SimpleSALT.slt, you need to link the * .slt files to the SALT MIME type.One way to do this is to use the Internet Services Manager tool, select the application's root directory or website, open its properties page, select the HTTP Headers item, click the File Type button in the MIME Map section, and create a new type. with the extension '.slt' and the content form 'text / salt + html'.This will cause Microsoft Speech Add-in to automatically activate SALT objects and perform the SALT function for IE.

You can now use your browser to open the SALT-enabled website in the directory you created and run the application.If you 'view source' in the browser, you will notice that because of the MIME mapping, the same code as below has been added to your site by IIS before moving it to the browser:

IIS inserted these lines directly into IE to tell Speech Add-in to handle SALT components.

Epilogue

While this article focuses on the client side of the multimodal application, note that SALT also supports pure voice applications, without a graphical interface.In this case, the phone performs the terminal function, the SALT interpreter only supports the voice combined with the voice and telephone server in a manner similar to the traditional VoiceXML model.

In addition to the Microsoft technology used to deploy the sample application in this article, there are a number of products from other vendors that support SALT (see the SALT Forum website), and also the OpenSALT open source solution here from Carnegie Mellon University, a member of SALT Forum. OpenSALT has launched a compatible open source browser SALT 1.0 based on Mozilla's open source browser and uses open source software to synthesize Festival and identify Sphinx.

Lesley Montoya

Update 25 May 2019