W3C Voice Browser Activity

  • Standards for Voice and Dialogue applications
    • VoiceXML
    • SRGS
    • SISR
    • SSML
    • PLS
    • Call Control XML
    • State Chart XML
  • W3C Recommendations

VoiceXML

  • Language for dialogue applications development.
  • Specification
  • Primary targeted to phone applications.
    • telephone support automation
    • railways/bus schedules information
    • ticket reservation
  • Describes algorithm for dialogue flow control (dialogue strategy)
  • Alternatively can be described by finite state automaton with output (Mealy automaton)
    • SCXML
  • W3C standard W3C (present version 2.1, version 3.0 in state of Working Draft)

VoiceXML - processing

  • Application needs to be run on VoiceXML platform or using VoiceXML interpreter.
    • desktop platforms - OptimTalk, publicVoiceXML, JVoiceXML, …
    • opensource on-line - Asterisk+VoiceGlue, Asterisk+OpenVXI, …
    • on-line commercial:
      • Bevocal Cafe
      • Voxeo Prophecy
    • VoiceXML forms in XHTML documents
      • using namespaces (formerly W3C submission XHTML+Voice profile 1.0)
      • Support in Opera a Firefox web browsers.

VoiceXML - example

Figure: VoiceXML example

<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
   <form id="pizza-mixed">
      <grammar src="pizza.grxml"/>
      <initial name="pizzaall">
       <prompt>Welcome to FI pizzeria</prompt>
       <nomatch count="2"><assign name="pizzaall" expr="true"/></nomatch>
       <noinput count="2"><assign name="pizzaall" expr="true"/></noinput>
      </initial>
      <field name="kind">
       <prompt>What kind of pizza do you want?</prompt>
       <nomatch>We have salami, mozzarela and appolo pizza</nomatch>
       <noinput>We have salami, mozzarela and appolo pizza</noinput>
       <grammar src="pizza.grxml#kind"/>
      </field>
      <field name="topping">
       <prompt>What topping do you want?</prompt>
       <nomatch>We offer ketchup and chilli.</nomatch>
       <noinput>We offer ketchup and chilli.</noinput>
       <grammar src="pizza.grxml#topping"/>
      </field>
      <field name="drink">
       <prompt>What do you want to drink?</prompt>
       <nomatch>Select one of coke, sprite and water</nomatch>
       <noinput>Select one of coke, sprite and water</noinput>
       <grammar src="pizza.grxml#drink"/>
      </field>
      <field name="ack">
       <prompt>Did you ordered <value expr="kind"/> pizza with <value
       expr="topping"/> and <value expr="drink"/>?</prompt>
       <grammar src="yesno.grxml"/>
      </field>
      <filled>
       <if cond="ack=='yes'">
            <prompt>Order submitted</prompt>
       <else/>
            <clear namelist="kind topping drink ack"/>
       </if>
      </filled>
   </form>
</vxml>

SRGS (Speech Recognition Grammar Specification)

  • Standard for description of context free grammars.
    • describes the accepted inputs of particular VoiceXML fields
  • Specification
  • Part of W3C Voice Browser Activity standards
  • Present version 1.0
  • SRGS - motivation
    • User’s voice input needs to be recognized - continues speech recognition.
    • success rate 50-99 %
  • Possibilities how to improve success rate:
    • improve the language model
    • problem domain restriction
    • improve the user model
  • Problem domain restriction + language model improvement = SRGS.

SRGS - example

Figure: SRGS grammar referenced in the previous VoiceXML example (pizza.grxml)

<?xml version="1.0" encoding="UTF-8"?>
<grammar root="mixed" xml:lang="en_US">
<rule id="mixed">
   <item>
      <ruleref special="GARBAGE"/>
      <ruleref uri="#kind"/> pizza <ruleref special="GARBAGE"/>
      <ruleref uri="#topping"/> and <ruleref uri="#drink"/>
   </item>
   <tag>
   {
    out.kind=rules.kind;
    out.topping=rules.topping;
    out.drink=rules.drink;
   }
   </tag>
</rule>
<rule id="kind">
   <one-of>
    <item>salami</item>
    <item>mozzarela</item>
    <item>polo</item>
   </one-of>
</rule>
...
</grammar>

SISR (Semantic Interpretation for Speech Recognition)

  • Purpose:
    • What is the meaning of recognized input?
  • Language for derivation of the recognized inputs semantic.
  • Based on ECMAScript.
  • Used in speech recognition grammars (see previous slide).
  • SISR 1.0 Specification

SSML (Speech Synthesis Markup Language)

  • link: Speech Synthesis Markup Language
  • W3C Standard
  • present version 1.1 (September 2010)
  • Used to describe prosody characteristics of synthesized speech.
    • loudness
    • prosody
    • emphasis
    • speech rate
    • voice kind (male, female, neutral)
  • Contains markup for description of pronunciation of foreign words.
    • IPA (International Phonetic Alphabet) can be utilized.

SSML - example of loudness and breaks

Figure: SSML Breaks and loudness control example

<?xml version="1.0" encoding="utf-8"?>
<speak version='1.1' xmlns="http://www.w3.org/2001/10/synthesis"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis11/synthesis.xsd">
   <prosody volume="loud">
      Dobre rano.<break/>
   <prosody>
   <prosody volume="default">
      Jak se mate?
   </prosody>
</speak>

SSML - example of intonation modeling

Figure: SSML Intonation modeling

<speak ...>
   <prosody contour="(0%,50Hz) (75%, +10%) (80%, +20%) (90%,+30%)">
   Mas se dobre?
   </prosody>
</speak>

PLS (Pronunciation Lexicon Specification)

  • Pronunciation Lexicon Specification
    • W3C standard
    • Actual version - 1.0, October 2008
  • Developed for description of pronunciation of words, abbreviations, etc.
  • Used for:
    • Speech synthesis (SSML) - pronunciation of
      • foreign words
      • abbreviations
      • number values
    • Speech recognition (SRGS) - PLS allows to describe different pronunciations of some words (needed to be correctly recognized).

PLS Structure

  • Root element - lexicon
    • contains one or more lexicon entries - lexeme element
      • contains:
        • one or more word notations - grapheme element
        • one or more word pronunciation - phoneme element
          • pronunciation may be written using IPA, SAMPA, etc

PLS - example

Figure: PLS pronunciation example

<?xml version="1.0" encoding="utf-8"?>
<lexicon version="1.0"
    xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
      http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
    alphabet="ipa" xml:lang="cs-CZ">
   <lexeme>
   <grapheme>CSR</grapheme>
   <phoneme>tʃˈeː ˈes ˈer</phoneme>
   <phoneme>tʃˈeskaː rˈepublˌika</phoneme>
   </lexeme>
</lexicon>

Call Control XML

  • Voice Browser Call Control eXtensible Markup Language
  • Provides declarative markup to describe telephony call control
    • directing calls to corresponding application/human
    • merging multiple calls into a conference call
    • the ability to place outgoing calls
    • handling for a richer class of asynchronous events
    • handling the outside call queue for VoiceXML
    • etc.

State Chart XML

State Chart XML - Relation to Dialogue

  • Dialogue can be modeled using Mealy Automaton.
    • Mealy automaton - finite state automaton with an output function.
    • States of the automaton corresponds to the states of the dialogue.
    • Transition is function of the user input.
    • Output function is the dialogue system response.
  • Mealy automaton can be described using the SCXML (see example)

SCXML - Demo

Example 1: Process planing demo

Process state diagram

(if the image does not show, click here - Process state diagram)

SCXML - Demo

Example 1: Corresponding SCXML

<?xml version="1.0" encoding="UTF-8"?>
<scxml version="1.0" xmlns="http://www.w3.org/2005/07/scxml">
 <initial>
  <transition target="Created" type="external"/>
 </initial>
 <state id="Created">
  <transition target="Waiting" event="enqueue"/>
 </state>
 <state id="Waiting">
  <transition target="Running" event="assign"/>
 </state>
 <state id="Running">
  <transition target="Blocked" event="wait for resource"/>
  <transition target="Waiting" event="timeout"/>
  <transition target="Terminated" event="terminate"/>
 </state>
 <state id="Blocked">
  <transition target="Waiting" event="resource available"/>
 </state>
 <final id="Terminated"/>
</scxml>