Dialogue systems			
Luděk Bártek			
Speech Recognition Command		Dialogue systems	
Recognition Continuous speech recognition		Speech Recognition	
Speech recognition grammars		Luděk Bártek Laboratory of Searching and Dialogue, Fakulty of Informatics, Masaryk University, Brno spring 2023	
			
	Speech Recognition
Dialogue	
systems	
Luděk Bártek	
Speech	
Recognition	■ Continuous speech recognition - transforms continuous
Command Recognition Continuous speech	speech to a textual form.
recognition Speech recognition grammars	■ Command recognition.
	■ Recognition principle:
	□ using a short term signal analysis acquire the feature
	vector,
	El try to classify the signal using the vector from previous
	step.
	
Command Recognition
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Used to recognize either commands or words (commands) distinctly separated by silence on both ends.
There is no problem to identify the start and the end of the word in continuous utterance.
Usually user depended systems.
■ There is a need to train the recognizer,
■ limited size of used vocabulary.
Command recognition problems:
■ Identifying the start and the end of the command:
■ how to distinguish a noise and sibilants,
■ distinguishing a random sound excitation (click, tapping, ...) and plosives including a short pause,
■ possible infra sounds,
Command Recognition
Classifiers Types
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
DTW based classifiers.
■ Tries to find maximum correspondence between recognized word na words in database.
Statistical methods based classifiers - speech modelling using Hidden Markov Models:
■ simulates the speech generation process. Two phase classifiers:
□ speech segmentation to segments and phonetic decoding of segments
Q word recognition based on decoded segments. Artificial Neural Networks based solutions - see:
■ Hinton, 0., Teh - A Fast Learning Algorithm for Deep Belief Nets, in Neural Computation, 2006
■ Bengio, L, Popovici, L. - Greedy Layer-Wise Training of Deep Networks, in NIPS' 20016
■ Speech recognition - Lecture 14: Neural Networks
Dynamic Time Warping (DTW)
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Method is used to compare two series of numbers - two parts of speech (two words).
Input:
■ acoustic vectors sequence acquired using some of the short term signal analysis methods
■ database of acoustic vectors for recognized words.
Output - recognized word or command.
	DTW
	Basic principle
Dialogue systems	
Luděk Bártek	
Speech Recognition Command Recognition	■ Let's create database of recognized words (reference sequences of acoustic vectors)
Continuous speech recognition Speech recognition grammars	■ Usually several sequences for each word, corresponding to several manners of word pronunciations. ■ Recognized word is transformed into the corresponding acoustic vectors sequence. ■ Using DTW we find the reference sequence with maximum conformity.
	DTW Formalization
Dialogue	
systems	
Luděk Bártek	
	■ DTW algorithm search for parametrizations fand g\
Speech	
Recognition	
Command Recognition Continuous speech	f,g:i = f(k)J = g(k),ke< 1,K>
recognition Speech recognition grammars	that minimizes expression:
	K
	D(A B) = Y,d{ani),bg{i)) ;=i
	■ d - acoustic vectors distance (i.e.. Euklid's metric)
	■ 3f(/), bg(;) - reference and recognized word/command.
	
■ f,g - non-descending function
■ Local coherence and steepness:
■ 0 < f{k) - f(k - 1) < /*
■ 0 < g(k) - g(k - 1) < J*
■ mostly /*, J* = 1,2,3
■ Too steep function increase may lead to inappropriate correspondence between too short segment of a and too long segment of b
■ Boundary points restriction:
■ f(l) = 1, f(K) = I, where / is the count of the samples of the word a.
■ g(l) = l,g(K) = J, where J is the count of the samples of the word b.
DTW
Constraints - cont.
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
DTW function growth global limits:
■ limits to maximum and minimum of the line first
derivation defining the allowed area of the DTW function, where the boundary points constraints must be filled:
1 + a[i(k) - 1] < 1 + 0[i(k) - 1]
ma- minimal line first derivation defining the allowed area ■ /3 - maximal line first derivation defining the allowed area.
DTW - Word Classifier Realization
Block schema
Speech Recognition
Command Recognition
Continuous speech recognition
Speech recognition grammars
£
Vytvořeni referenčních obrazů slov a jejich uložení ve slovníku
Uživatel
I
Zpracování signálu
I
Výběr příznaků
Vytvoření obrazu testovaného slova
I
Porovnání obrazů Algoritmem DTW
Rozpoznání neznámého slova
Obrázek: Block schema of the word classifier
DTW - Word Classifier Realization
Training
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
General Algorithm:
Either speaker or group of speakers pronounces each word of required vocabulary. It is done either once or repeatedly. Words on input are digitized and transferred by selected method of short-term signal analysis into the corresponding feature vectors. Word boundaries detection:
■ May be difficult due to the background noise for example.
■ Incorrect word boundaries deteriorates the recognition success rate.
■ Methods used to reduce the background sound influence increases the computational complexity.
Creating reference words database..
DTW - realization
Methods used to create a reference word database
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Direct use of the words training set as the reference database - DTW does not require the reference word samples to be same length, but it is useful to perform the time normalization to be able to apply additional criteria.
Creating average sample for each word w class
■ the linear and dynamic averaging methods are used. Creating sample words by clustering.
■ Words recordings are divided into clusters that each cluster contains "similar"word recording. Different clusters contains "different"word records.
■ Clustering can be done interactively (semi-automatic -chain map method, ISODATA algorithm), automatically (algorithms based on McQueen algorithm). See Mgr. J. Kučera final thesis.
	DTW Computational and Memory Complexity Reduction
Dialogue	
systems	
Luděk Bártek	■ DTW Disadvantages - high memory and computing
Speech Recognition	complexity can make real-time classification difficult even
Command Recognition	with relatively small dictionary.
Continuous speech recognition	■ Solution:
Speech recognition grammars	■ Brute force - usage of either parallel processors or custom
	circuits - may be expensive.
	■ Effective reference and testing words parameters encoding.
	Can be used:
	■ vector quantization - the number of different word
	samples is finite - they are stored in the codebook and we
	can use their indices instead.
	■ codebook - all samples included in the signal values
	alphabet (the encoding is more effective than the PCM).
	
DTW
Computational and Memory Complexity Reduction
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Usage of spectral stationarity area - method of spectral trace segmentation.
■ Spectral trace - feature vectors boundaries connector.
■ Can be approximated - by linear segments for example.
Nearest neighbour search optimization:
■ metric spaces search methods
■ distance used in DTW must be a metric.
DTW
Computational and Memory Complexity Reduction
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Reduction of the computational requirements using heuristics by comparison.
■ Multi-level decision-making procedure:
Q comparison of utterance using reduced feature vectors set
against entire vocabulary El searching the result of previous step using standard DTW.
■ Rejection threshold:
We calculate distance of a word and the reference word in each step.
When the distance is bigger then the experimentally established threshold, reference word is rejected.
Hidden Markov Models - HMM
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Speech modelling using HMM is based on the following idea of speech production:
■ Speech tract on short-term interval is in one of a finite amount of articulation configurations - generates a voice signal.
■ The configuration changes then. This activity is based on statistics.
We can achieve a finite amount of all model parameters by all parameters quantization.
HMM
Speech recognition usage principles
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Two together tied time sequences of random variables are generated:
■ support Markov chain - finite number of states sequence
■ a string of finite number of spectral patterns.
Random function assigning probability to state-pattern relation.
The left-to-right Markov models are most often used for speech recognition:
■ suitable for increasing time related process modelling.
HMM
Markov process
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Markov process G with HMM is quintuplet
G = (Q, \/, A/,M,tt)
■ Q = <7i , • • • , <7/c _ set of states
■ V = vi,..., Vk - set of input symbols
m N = (n/j) - transition matrix. Evaluates the probability of transition from state q\ on time t\ to state qj on time fe-
■ M = (rriij) - matrix assigning the probability that the acoustic vector vj in state q-, no matter what time is it.
■ 7r = (717) - initial state probability vector (probability of that the state / is the initial one).
Triplet A = (A/, M,7r) - forms speech segment model.
■ the Vintsjuk's word model - 40 — 50 states (based on average count of micro segments in a word; segment length is 10 ms).
HMM
Determining the Probability of Utterance
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
The probability is marked P(0\X)
The utterance O is usually processed as a sequences O = (01,..., oT)
■ T - number of utterance micro segments
■ o\ - corresponds to output symbols.
Calculation of P(0\X) - the methods using the recursive enumeration either from the front or from the behind generated sequence (forward-backward algorithm).
HMM
Utterance probability determination - calculation
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Forward-backward calculation:
■ ol\ - probability of transition into the state q-, while generating the output sequence
{oi,..., ot}(a/ = P(oi... ot, qi(t)\X)
■ Recursive calculation:
Initialization: ai(/) = 717/77/(01), / £< 1, N > Recursive step for t=l,... T-l:
N
«/+i0") = Eat(')n''»/'imi(o''+i)
/=i
for j £< 1, N >, m(ot) is equal to notation /77/(/), when
Ot = V\.
Resulting probability:
N
P(0|A) = $>T(/)
1 = 1
	HMM
	Alternative way of P(0\X) calculation
Dialogue systems	
Luděk Bártek	
Speech Recognition	■ Previous method disadvantages:
Command Recognition	■ the result includes probabilities of all possible states
Continuous speech recognition	sequences of length T.
grammars	■ Solution: ■ calculation of maximum probable sequence of states Q. ■ Calculation realized using the Viterbi algorithm: ■ the problem is solved recursively using dynamic programming techniques.
	HMM
	Training the model A = (A/, M,7r) parameters
Dialogue systems	■ The procedure of training the model parameters must be
Luděk Bártek	determined.
Speech Recognition	■ Training objectives: ■ maximization of the P(0\X) probability.
Command Recognition Continuous speech	■ Problem:
recognition Speech recognition	■ There is no analytical method to find the global maximum
grammars	of a function of n variables. ■ Solution: ■ Iterative algorithms for finding the local maximality can be utilized. ■ The most used algorithm - Baum-Welch algorithm. ■ Another problem while training the model: ■ finite training set problem: ■ The smaller training set is and the bigger the matrix M is, the higher probability that some elements in M will left 0 (the missing data problem).
HMM
Isolated word recognition decision rule
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
The maximum credibility principle is used. For given word 0 and allA:
□ We calculate P(0|A). The result is the class with maximum value of P(0\X)
■ Commands modelling:
■ Commonly the models with 4 — 7 states are used.
■ The tools for creating of HMM can be utilised during the modelling.
■ HTK - Hidden Markov Model Toolkit.
■ Phoneme modelling:
■ 4 — 7 states usually
■ The word model - concatenation of phoneme models.
■ The real-time processing problems.
■ Can be solved using the special maximum P(0\X) searching algorithms.
Phoneme structure examples
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Phoneme structure examples
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
	Continuous speech recognition	
Dialogue		
systems		
Luděk Bártek	■	The principal differences to isolated word recognition:
Speech		■ the pattern database can not be created
Recognition Command		■ the prosodic factors must be taken into the account
Recognition Continuous speech		■ need to find word boundaries
recognition Speech recognition		■ the filler words/noises and speech errors must be
grammars		processed.
	■	Solution - statistical approach:
		■ language model
		■ speaker model.
	■	Example: HMM returns the same probability of Czech
		words „mama" (mother) and Mnana" (stupid girl) - the
		mother will be used - it's used more frequent.
		
	Continuous speech recognition	
	Language models	
Dialogue systems		
Luděk Bártek		
Speech Recognition Command Recognition Continuous speech	■ There are: ■ a word sequence (utterance) 1/1/ = ■ a sequence of acoustic vectors 0	= (wi,...,wn) = (oi,..., ot).
recognition Speech recognition grammars	■ Our objective is to find 1/1/* (set of maximizing P(l/l/|0). ■ According the Bayes' theorem:	all utterances),
	P(W>\0) = mSXP(W\0) = max PW*P(0\W)	
	< 1	□       rS1        ~        = -E^O^O
Continuous speech recognition
Language models - cont.
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
We need to know following to find the P(1/I/*|0) maximun:
■ a speaker model - P(0|l/l/)
■ a language model - P(W).
The speaker model can be replaced by probabality of generating of W using the corresponding Markov model
The Trigram model:
■ Experimentally proven to be true:
P(wn\w1... wn_i) = P(wn\wn_2wn-i)
Continuous speech recognition
Topic recognition
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
The speech recognition success rate is from aprox. 50 % — 99 % depending on the language, ...
The success rate can be improved by restricting the recognition domain:
■ topic recognition,
■ using the speech recognition grammar,
When the topic is known:
■ the space state of trigrams and trigrams probabality can be changed:
■ For example stock market news - Was recognized the word " honey" or " money" ?
■ more accurate language model can be created.
Speech recognition grammars
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
The success rate of a general continuous speech recognition may drop to 50 %.
It can be improved by limiting the recognition domain - by specification of allowed inputs for example.
To limit allowed inputs the spech recognition grammars can be used:
■ context free grammars
The possible ways of grammars notations:
■ using the logic programming methods
■ proprietary solutions
■ open standards - JSGF, W3C SRGS, ...
Speech recognition grammars
Java Speech Grammar Specification (JSGF)
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
Textual grammar notation independent on platform a vendor.
Design to be used in speech recognition.
Part of the Java Speech API.
It uses the Java style and conventions.
Present veion 1.0 (říjen 1998).
Used for example by the recognizer Sphinx-4, the VoiceXML interpreter VoiceGlue, ...
More details in the 2nd half of semester on dialogue interfaces.
	Speech recognition grammaers JSGF Demo
Dialogue	
systems	
Luděk Bártek	
Speech	#JSGF
Recognition	
Command Recognition	<koren> = 1 want to go by <what> .
Continuous speech recognition Speech recognition	1 want to go by <what> from <where> to <where> .
grammars	1 want to gou by <what> from <where> to <where> at
	<when> .;
	<what> = train 1 bus;
	<where> = <city>;
	<when> = <time>;
	
Speech recognition grammar
W3C Speech Recognition Grammar Specification (SRGS)
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
W3C Standard.
Current version 1.0 (March 2004).
Defines the way of rules notation and referencing. Two possible notations:
■ XML
■ ABNF (Augmented BNF).
In more detail on the 2nd half of the polovine semester (dialogue interfaces).
W3C SRGS Demo
Dialogue systems
Luděk Bártek
Command Recognition
Continuous speech recognition
Speech recognition grammars
#ABNF 1.0 UTF-8 root Sgreating; language en-GB; mode voice; Sgreating = hello
<?xml version=" 1.0" encodings'utf-8"? >
<grammar root="greating"xml:lang="en-US"version=" 1.0">
<rule id="greating">
hello
< /rule>
< /grammar>