Dialogue systems Luděk Bártek Speech Recognition Command Dialogue systems Recognition Continuous speech recognition Speech Recognition Speech recognition grammars Luděk Bártek Laboratory of Searching and Dialogue, Fakulty of Informatics, Masaryk University, Brno spring 2023 Speech Recognition Dialogue systems Luděk Bártek Speech Recognition ■ Continuous speech recognition - transforms continuous Command Recognition Continuous speech speech to a textual form. recognition Speech recognition grammars ■ Command recognition. ■ Recognition principle: □ using a short term signal analysis acquire the feature vector, El try to classify the signal using the vector from previous step. Command Recognition Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Used to recognize either commands or words (commands) distinctly separated by silence on both ends. There is no problem to identify the start and the end of the word in continuous utterance. Usually user depended systems. ■ There is a need to train the recognizer, ■ limited size of used vocabulary. Command recognition problems: ■ Identifying the start and the end of the command: ■ how to distinguish a noise and sibilants, ■ distinguishing a random sound excitation (click, tapping, ...) and plosives including a short pause, ■ possible infra sounds, Command Recognition Classifiers Types Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars DTW based classifiers. ■ Tries to find maximum correspondence between recognized word na words in database. Statistical methods based classifiers - speech modelling using Hidden Markov Models: ■ simulates the speech generation process. Two phase classifiers: □ speech segmentation to segments and phonetic decoding of segments Q word recognition based on decoded segments. Artificial Neural Networks based solutions - see: ■ Hinton, 0., Teh - A Fast Learning Algorithm for Deep Belief Nets, in Neural Computation, 2006 ■ Bengio, L, Popovici, L. - Greedy Layer-Wise Training of Deep Networks, in NIPS' 20016 ■ Speech recognition - Lecture 14: Neural Networks Dynamic Time Warping (DTW) Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Method is used to compare two series of numbers - two parts of speech (two words). Input: ■ acoustic vectors sequence acquired using some of the short term signal analysis methods ■ database of acoustic vectors for recognized words. Output - recognized word or command. DTW Basic principle Dialogue systems Luděk Bártek Speech Recognition Command Recognition ■ Let's create database of recognized words (reference sequences of acoustic vectors) Continuous speech recognition Speech recognition grammars ■ Usually several sequences for each word, corresponding to several manners of word pronunciations. ■ Recognized word is transformed into the corresponding acoustic vectors sequence. ■ Using DTW we find the reference sequence with maximum conformity. DTW Formalization Dialogue systems Luděk Bártek ■ DTW algorithm search for parametrizations fand g\ Speech Recognition Command Recognition Continuous speech f,g:i = f(k)J = g(k),ke< 1,K> recognition Speech recognition grammars that minimizes expression: K D(A B) = Y,d{ani),bg{i)) ;=i ■ d - acoustic vectors distance (i.e.. Euklid's metric) ■ 3f(/), bg(;) - reference and recognized word/command. ■ f,g - non-descending function ■ Local coherence and steepness: ■ 0 < f{k) - f(k - 1) < /* ■ 0 < g(k) - g(k - 1) < J* ■ mostly /*, J* = 1,2,3 ■ Too steep function increase may lead to inappropriate correspondence between too short segment of a and too long segment of b ■ Boundary points restriction: ■ f(l) = 1, f(K) = I, where / is the count of the samples of the word a. ■ g(l) = l,g(K) = J, where J is the count of the samples of the word b. DTW Constraints - cont. Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars DTW function growth global limits: ■ limits to maximum and minimum of the line first derivation defining the allowed area of the DTW function, where the boundary points constraints must be filled: 1 + a[i(k) - 1] < 1 + 0[i(k) - 1] ma- minimal line first derivation defining the allowed area ■ /3 - maximal line first derivation defining the allowed area. DTW - Word Classifier Realization Block schema Speech Recognition Command Recognition Continuous speech recognition Speech recognition grammars £ Vytvořeni referenčních obrazů slov a jejich uložení ve slovníku Uživatel I Zpracování signálu I Výběr příznaků Vytvoření obrazu testovaného slova I Porovnání obrazů Algoritmem DTW Rozpoznání neznámého slova Obrázek: Block schema of the word classifier DTW - Word Classifier Realization Training Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars General Algorithm: Either speaker or group of speakers pronounces each word of required vocabulary. It is done either once or repeatedly. Words on input are digitized and transferred by selected method of short-term signal analysis into the corresponding feature vectors. Word boundaries detection: ■ May be difficult due to the background noise for example. ■ Incorrect word boundaries deteriorates the recognition success rate. ■ Methods used to reduce the background sound influence increases the computational complexity. Creating reference words database.. DTW - realization Methods used to create a reference word database Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Direct use of the words training set as the reference database - DTW does not require the reference word samples to be same length, but it is useful to perform the time normalization to be able to apply additional criteria. Creating average sample for each word w class ■ the linear and dynamic averaging methods are used. Creating sample words by clustering. ■ Words recordings are divided into clusters that each cluster contains "similar"word recording. Different clusters contains "different"word records. ■ Clustering can be done interactively (semi-automatic -chain map method, ISODATA algorithm), automatically (algorithms based on McQueen algorithm). See Mgr. J. Kučera final thesis. DTW Computational and Memory Complexity Reduction Dialogue systems Luděk Bártek ■ DTW Disadvantages - high memory and computing Speech Recognition complexity can make real-time classification difficult even Command Recognition with relatively small dictionary. Continuous speech recognition ■ Solution: Speech recognition grammars ■ Brute force - usage of either parallel processors or custom circuits - may be expensive. ■ Effective reference and testing words parameters encoding. Can be used: ■ vector quantization - the number of different word samples is finite - they are stored in the codebook and we can use their indices instead. ■ codebook - all samples included in the signal values alphabet (the encoding is more effective than the PCM). DTW Computational and Memory Complexity Reduction Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Usage of spectral stationarity area - method of spectral trace segmentation. ■ Spectral trace - feature vectors boundaries connector. ■ Can be approximated - by linear segments for example. Nearest neighbour search optimization: ■ metric spaces search methods ■ distance used in DTW must be a metric. DTW Computational and Memory Complexity Reduction Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Reduction of the computational requirements using heuristics by comparison. ■ Multi-level decision-making procedure: Q comparison of utterance using reduced feature vectors set against entire vocabulary El searching the result of previous step using standard DTW. ■ Rejection threshold: We calculate distance of a word and the reference word in each step. When the distance is bigger then the experimentally established threshold, reference word is rejected. Hidden Markov Models - HMM Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Speech modelling using HMM is based on the following idea of speech production: ■ Speech tract on short-term interval is in one of a finite amount of articulation configurations - generates a voice signal. ■ The configuration changes then. This activity is based on statistics. We can achieve a finite amount of all model parameters by all parameters quantization. HMM Speech recognition usage principles Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Two together tied time sequences of random variables are generated: ■ support Markov chain - finite number of states sequence ■ a string of finite number of spectral patterns. Random function assigning probability to state-pattern relation. The left-to-right Markov models are most often used for speech recognition: ■ suitable for increasing time related process modelling. HMM Markov process Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Markov process G with HMM is quintuplet G = (Q, \/, A/,M,tt) ■ Q = <7i , • • • , <7/c _ set of states ■ V = vi,..., Vk - set of input symbols m N = (n/j) - transition matrix. Evaluates the probability of transition from state q\ on time t\ to state qj on time fe- ■ M = (rriij) - matrix assigning the probability that the acoustic vector vj in state q-, no matter what time is it. ■ 7r = (717) - initial state probability vector (probability of that the state / is the initial one). Triplet A = (A/, M,7r) - forms speech segment model. ■ the Vintsjuk's word model - 40 — 50 states (based on average count of micro segments in a word; segment length is 10 ms). HMM Determining the Probability of Utterance Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars The probability is marked P(0\X) The utterance O is usually processed as a sequences O = (01,..., oT) ■ T - number of utterance micro segments ■ o\ - corresponds to output symbols. Calculation of P(0\X) - the methods using the recursive enumeration either from the front or from the behind generated sequence (forward-backward algorithm). HMM Utterance probability determination - calculation Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Forward-backward calculation: ■ ol\ - probability of transition into the state q-, while generating the output sequence {oi,..., ot}(a/ = P(oi... ot, qi(t)\X) ■ Recursive calculation: Initialization: ai(/) = 717/77/(01), / £< 1, N > Recursive step for t=l,... T-l: N «/+i0") = Eat(')n''»/'imi(o''+i) /=i for j £< 1, N >, m(ot) is equal to notation /77/(/), when Ot = V\. Resulting probability: N P(0|A) = $>T(/) 1 = 1 HMM Alternative way of P(0\X) calculation Dialogue systems Luděk Bártek Speech Recognition ■ Previous method disadvantages: Command Recognition ■ the result includes probabilities of all possible states Continuous speech recognition sequences of length T. grammars ■ Solution: ■ calculation of maximum probable sequence of states Q. ■ Calculation realized using the Viterbi algorithm: ■ the problem is solved recursively using dynamic programming techniques. HMM Training the model A = (A/, M,7r) parameters Dialogue systems ■ The procedure of training the model parameters must be Luděk Bártek determined. Speech Recognition ■ Training objectives: ■ maximization of the P(0\X) probability. Command Recognition Continuous speech ■ Problem: recognition Speech recognition ■ There is no analytical method to find the global maximum grammars of a function of n variables. ■ Solution: ■ Iterative algorithms for finding the local maximality can be utilized. ■ The most used algorithm - Baum-Welch algorithm. ■ Another problem while training the model: ■ finite training set problem: ■ The smaller training set is and the bigger the matrix M is, the higher probability that some elements in M will left 0 (the missing data problem). HMM Isolated word recognition decision rule Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars The maximum credibility principle is used. For given word 0 and allA: □ We calculate P(0|A). The result is the class with maximum value of P(0\X) ■ Commands modelling: ■ Commonly the models with 4 — 7 states are used. ■ The tools for creating of HMM can be utilised during the modelling. ■ HTK - Hidden Markov Model Toolkit. ■ Phoneme modelling: ■ 4 — 7 states usually ■ The word model - concatenation of phoneme models. ■ The real-time processing problems. ■ Can be solved using the special maximum P(0\X) searching algorithms. Phoneme structure examples Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Phoneme structure examples Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Continuous speech recognition Dialogue systems Luděk Bártek ■ The principal differences to isolated word recognition: Speech ■ the pattern database can not be created Recognition Command ■ the prosodic factors must be taken into the account Recognition Continuous speech ■ need to find word boundaries recognition Speech recognition ■ the filler words/noises and speech errors must be grammars processed. ■ Solution - statistical approach: ■ language model ■ speaker model. ■ Example: HMM returns the same probability of Czech words „mama" (mother) and Mnana" (stupid girl) - the mother will be used - it's used more frequent. Continuous speech recognition Language models Dialogue systems Luděk Bártek Speech Recognition Command Recognition Continuous speech ■ There are: ■ a word sequence (utterance) 1/1/ = ■ a sequence of acoustic vectors 0 = (wi,...,wn) = (oi,..., ot). recognition Speech recognition grammars ■ Our objective is to find 1/1/* (set of maximizing P(l/l/|0). ■ According the Bayes' theorem: all utterances), P(W>\0) = mSXP(W\0) = max PW*P(0\W) < 1 □ rS1 ~ = -E^O^O Continuous speech recognition Language models - cont. Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars We need to know following to find the P(1/I/*|0) maximun: ■ a speaker model - P(0|l/l/) ■ a language model - P(W). The speaker model can be replaced by probabality of generating of W using the corresponding Markov model The Trigram model: ■ Experimentally proven to be true: P(wn\w1... wn_i) = P(wn\wn_2wn-i) Continuous speech recognition Topic recognition Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars The speech recognition success rate is from aprox. 50 % — 99 % depending on the language, ... The success rate can be improved by restricting the recognition domain: ■ topic recognition, ■ using the speech recognition grammar, When the topic is known: ■ the space state of trigrams and trigrams probabality can be changed: ■ For example stock market news - Was recognized the word " honey" or " money" ? ■ more accurate language model can be created. Speech recognition grammars Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars The success rate of a general continuous speech recognition may drop to 50 %. It can be improved by limiting the recognition domain - by specification of allowed inputs for example. To limit allowed inputs the spech recognition grammars can be used: ■ context free grammars The possible ways of grammars notations: ■ using the logic programming methods ■ proprietary solutions ■ open standards - JSGF, W3C SRGS, ... Speech recognition grammars Java Speech Grammar Specification (JSGF) Dialogue systems Luděk Bártek Command Recognition Continuous speech recognition Speech recognition grammars Textual grammar notation independent on platform a vendor. Design to be used in speech recognition. Part of the Java Speech API. It uses the Java style and conventions. Present veion 1.0 (říjen 1998). Used for example by the recognizer Sphinx-4, the VoiceXML interpreter VoiceGlue, ... More details in the 2nd half of semester on dialogue interfaces. Speech recognition grammaers JSGF Demo Dialogue systems Luděk Bártek Speech #JSGF Recognition Command Recognition = 1 want to go by . Continuous speech recognition Speech recognition 1 want to go by from to . grammars 1 want to gou by from to at .; = train 1 bus; = ; =