W3C Voice Browser Activity

  • Standards for Voice and Dialogue applications

    • VoiceXML

    • SRGS

    • SISR

    • SSML

    • PLS

    • Call Control XML

    • State Chart XML

  • W3C Recommendations and Working Drafts

VoiceXML

  • Language for dialogue applications development.

  • Specification

  • Primary targeted to phone applications.

    • telephone support automation

    • railways/bus schedules information

    • ticket reservation

  • Describes algorithm for dialogue flow control (dialogue strategy)

  • Alternativelly can be described by finite state automaton with output (Meally automatom)

    • SCXML

  • W3C standard W3C (present version 2.1, version 3.0 in state of Working Draft)

VoiceXML - processing

  • Application needs to be run on VoiceXML platform or using VoiceXML interpreter.

    • desktop platforms - OptimTalk, publicVoiceXML, JVoiceXML

    • opensource on-line - Asterisk+VoiceGlue, Asterisk+OpenVXI

    • on-line commercial:

      • Bevocal Cafe

      • Voxeo Prophecy

    • VoiceXML forms in XHTML documents

      • using namespaces (formerly W3C submission XHTML+Voice profile 1.0)

      • Support in Opera a Firefox web browsers.

VoiceXML - example

Figure: VoiceXML example

 <?xml version="1.0" encoding="UTF-8"?>
 <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
  <form id="pizza-mixed">
   <grammar src="pizza.grxml"/>
   <initial name="pizzaall">
    <prompt>Welcome to FI pizzeria</prompt>
    <nomatch count="2"><assign name="pizzaall" expr="true"/></nomatch>
    <noinput count="2"><assign name="pizzaall" expr="true"/></noinput>
   </initial>
   <field name="kind">
    <prompt>What kind of pizza do you want?</prompt>
    <nomatch>We have salami, mozzarela and appolo pizza</nomatch>
    <noinput>We have salami, mozzarela and appolo pizza</noinput>
    <grammar src="pizza.grxml#kind"/>
   </field>
   <field name="topping">
    <prompt>What topping do you want?</prompt>
    <nomatch>We offer ketchup and chilli.</nomatch>
    <noinput>We offer ketchup and chilli.</noinput>
    <grammar src="pizza.grxml#topping"/>
   </field>
  <field name="drink">
    <prompt>What do you want to drink?</prompt>
    <nomatch>Select one of coke, sprite and watter</nomatch>
    <noinput>Select one of coke, sprite and watter</noinput>
    <grammar src="pizza.grxml#drink"/>
   </field>
   <field name="ack">
    <prompt>Did you ordered <value expr="kind"/> pizza with <value
    expr="topping"/> and <value expr="drink"/>?</prompt>
    <grammar src="yesno.grxml"/>
   </field>
   <filled>
    <if cond="ack=='yes'">
         <prompt>Order submited</prompt>
    <else/>
         <clear namelist="kind topping drink ack"/>
    </if>
   </filled>
  </form>
 </vxml>

SRGS (Speech Recognition Grammar Specification)

  • Standard for description of context free grammars.

    • describes the accepted inputs of particular VoiceXML fields

  • Specification

  • Part of W3C Voice Browser Activity standards

  • Present version 1.0

  • SRGS - motivation

    • User’s voice input needs to be recognized - continues speech recognition.

    • success rate 50-99 %

  • Possibilities how to improve success rate:

    • improve the language model

    • problem domain restriction

    • improve the user model

  • Problem domain restriction + language model improvement = SRGS.

SRGS - example

Figure: SRGS grammar referenced in the previous VoiceXML example (pizza.grxml)

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar root="mixed" xml:lang="en_US">
  <rule id="mixed">
    <item><ruleref special="GARBAGE"/> <ruleref uri="#kind"/> pizza <ruleref special="GARBAGE"/> <ruleref uri="#topping"/> and <ruleref uri="#drink"/>
    </item>
    <tag>
     {
       out.kind=rules.kind;
       out.topping=rules.topping;
       out.drink=rules.drink;
     }
    </tag>
  </rule>

  <rule id="kind">
   <one-of>
    <item>salami</item>
    <item>mozzarela</item>
    <item>polo</item>
   </one-of>
  </rule>

 ...

 </grammar>

SISR (Semantic Interpretation for Speech Recognition)

  • Purpose:

    • What is the meaning of recognized input?

  • Language for derivation of the recognised inputs semantic.

  • Based on ECMAScript.

  • Used in speech recognition grammars (see previous slide).

  • SISR 1.0 Specification

SSML (Speech Synthesis Markup Language)

  • link: Speech Synthesis Markup Language

  • W3C Standard

  • present version 1.1 (September 2010)

  • Used to describe prosody characteristics of synthesised speech.

  • loudness

  • prosody

  • emphasis

  • speech rate

  • voice kind (male, female, neutral)

  • Contains markup for description of pronunciation of foreign words.

    • IPA (International Phonetic Alphabet) can be utilized.

SSML - example of loudness and breaks

Figure: SSML Breaks and loudness control example

 <?xml version="1.0" encoding="utf-8"?>
 <speak version='1.1" xmlns="http://www.w3.org/2001/10/synthesis"
                      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                      xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis11/synthesis.xsd>
  <prosody volume="loud">
   Dobre rano. <break />
  <prosody>
  <prosody volume="default">
   Jak se mate?
  </prosody>
 </speak>

SSML - example of intonation modeling

Figure: SSML Intonation modeling

 <speak ...>
  <prosody contour="(0%,50Hz) (75%, +10%) (80%, +20%) (90%,+30%)">
   Mas se dobre?
  </prosody>
 </speak>

PLS (Pronunciation Lexicon Specification)

  • Pronunciation Lexicon Specification

    • W3C standard

    • Actual version - 1.0, October 2008

  • Developed for description of pronunciation of words, abbreviations, etc.

  • Used for:

    • Speech synthesis (SSML) - pronunciation of

      • foreign words

      • abbreviations

      • number values

    • Speech recognition (SRGS) - PLS allows to describe different pronunciations of some words (needed to be correctly recognized).

PLS Structure

  • Root element - lexicon

    • contains one or more lexicon entries - lexeme element

      • contains:

        • one or more word notations - grapheme element

        • one or more word pronunciation - phoneme element

          • pronunciation may be written using IPA, SAMPA, etc

PLS - example

Figure: PLS pronunciation example

 <?xml version="1.0" encoding="utf-8"?>
 <lexicon version="1.0"
       xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
         http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
       alphabet="sampa" xml:lang="cs-CZ">
  <lexeme>
   <grapheme>CSR</grapheme>
   <phoneme>tSe: es er</phoneme>
   <phoneme>tSeska: republika</phoneme>
  </lexeme>
 </lexicon>