Dialogue systems Dialogue systems Speech Synthesis Time-domain Speech Synthesis Luděk Bártek Laboratory of Searching and Dialogue, Fakulty of Informatics, Masaryk University, Brno spring 2023 Speech Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Objective - conversion of written text into the speech. ■ Resulting speech should sound as natural as possible. Natural speech should contain: ■ correct intonation ■ correctly places stresses ■ word stress ■ sentence stress ■ correct co-articulation ■ correct rhythm (timing) Speech Synthesis Kinds Dialogue systems Luděk Bártek ■ Frequency-domain synthesis - simulates the human vocal Speech tract. Synthesis Phonetic ■ Time-domain synthesis - concatenates speech segments Transcription Frequency-domain Speech Synthesis into a bigger parts of speech (sentence, utterance, ...) Time-domain Speech Synthesis ■ Corpus-based - kind of time-domain synthesis - uses the speech corpus instead of a segment database. ■ Problem-oriented synthesis: ■ time-domains synthesis variant ■ uses bigger parts of speech - sentences, ... ■ prfklady: ■ station radio announcements ■ automatic phone-support lines ■ ... Speech Synthesis Phases Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Text phonetic transcription. Transcribed text synthesis: ■ Frequency-domain synthesis - selection of speech synthesis parameters (Fo/white-noise generator, formants and their intensities, ...) ■ Time-domain synthesis - proper segments selection and their concatenation. Possible post-processing: ■ intonation addition ■ stress addition Phonetic Transcription Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Used to correct, unambiguous textual speech recording. Uses phonetic alphabet: ■ The International Phonetic Alphabet (IPA) - part of the UNICODE standard ■ SAM PA (Speech Assessment Method Phonetic Alphabet) ■ 7bit IPA transcription ■ proposed in 80th years of 20th century ■ used in many TTS ■ figure - transcription of sentence "Czech is a beautiful language.": tSeSTina je kra:sni: jazik IPA Demo Speech Synthesis Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis CON50 HANTS (PULMONIC) 14414 E*| tili uil Libia-.!. :i: L r-ii.il I'jJoto Päidtdi v«|*r M. r i.-ivr.il tpu ftjjöttfll ■r.|i?-tt:lL Nasal in ri a N '- Plosivt P b t d U ' 1 kg ♦ P f V e 5 S Z J3 t i * Y X K _ h Af prwcinnanl n i j B + TrilJ B r R Tapv Flap V r fricative Latrral 1 _ L Ltfetul flap j whfH tyreJMj Vf-Ei* 111 uairs. the- oiy to lhe rijjht rcprwnt* a nwdslly v(?irii v, q —> kv, y —> i, y —> \ Speech Synthesis Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis V ■ e: ■ be —> bje, pe —> pje, fe —Mje, ve —> vje ■ de—>► d'e, te—)> te, ne—>► ne, mo mne ■ i/f: ■ di/f—)> d'i/i. ti/f—)> ti/i, ni/f—)> ni/f ■ X: ■ x —y ks — start of the word, before vowel, in-between vowels, before voiceless consonant or at the end of the word. ■ x —> gz: ■ exvowel ■ before voiced consonant Consonant Conjugation Changes Dialogue systems ■ Occurs at conjugation of consonants. ■ Caused by speech tract changes. Luděk Bártek Speech Synthesis ■ Two kinds: Phonetic Transcription ■ form of speech - change of sonority of pair consonants: Frequency-domain Speech Synthesis ■ ZPS —)> -i ZPS: dub —t dup, zpěv —spjev Time-domain Speech Synthesis ■ NPS - NPS: sběr zbjer, když gdiš ■ form of articulation - at conjugation of two consonants with different articulation: ■ nk/ng - banka, tango ■ mv/mf - tramvaj, nymfa ■ nť/nd - punťa, pindík ■ dň - odpovědně, sto dní, vodní ■ ts —► c ■ ts —► c ■ ds —► c ■ ds —► c Frequency-domain Speech Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Simulates voice formation of in vocal tract. Stores: ■ frequency characteristics of voice used for synthesis ■ excitation parameters Principle: ■ Voice tract emulation using: ■ frequency generators ■ filters ■ amplifier(s). ■ The components are controlled by model parameters. The following source encoding forms are used: ■ formant type TTS ■ LPC TTS ■ HMM based TTS Formant Type Speech Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Reconstructs vocal tract formants using the serial and parallel connection of several resonant circuits. The format frequencies and bandwidths are controlled electronically. Synthesizer parameters: ■ Fo -basic vocal chord frequency ■ F-, - formants ■ F/v - nasal formant ■ 6/ - F; band filters ■ Gj - Gain/Amplification control parameters ■ Kj - formants for consonants. Serial Formant Type Synthesizer Schema Dialogue systems Luděk Bártek Počítač Speech Synthesis Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Řízení úrovně ĽÄ Nazální formant K. K Generátor bílého šumu B, B. B, t t t t t t Rezonanční filtr 1 Řízení úrovně Mixér Formanty konsonantů Řízení úrovně Rec Obrázek: Serial Formant Type Synthesizer Block Schema LPC synthesizer Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis LPC synthesizer characteristics: ■ Basic vocal chord tone period 7~o ■ sound characteristics - voiced/unvoiced ■ excitation signal amplitude G ■ digital filter coefficients. Obtaining digital filter coefficients: ■ analysed microsegment LPC spectral envelop peaks ■ roots of source filter characteristic equation ■ reflex coefficients. LPC Synthesizer Schema Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Počítač k Číslicový filtr Převod na spojitý tvar Generátor bílého šumu Obrázek: LPC Synthesizer Block Schema Frequency-domain Synthesis Summary Dialogue ■ Frequency-domain synthesis advantages and systems disadvantages: Luděk Bártek + Small memory requirements - model of the used speaker. Speech + Synthesis can be realized using hardware. Synthesis - Resulting voice is not as natural as when using Phonetic Transcription time-domain synthesis. Frequency-domain Speech Synthesis ■ Mathematic model accuracy problem. Time-domain Speech Synthesis - Software frequency-domain synthesis has higher computational demands then time-domain synthesis. ■ Common usage: ■ time-domain synthesis post-processing: ■ adding sentence intonation ■ adding sentence and word stress ■ adding next prosodic factors. ■ Sometimes is used on devices with insufficient memory capacity (mobile phones, PDA, ...). ■ Sometime is used for multilingual synthesis. ■ See J. Psutka - Komunikace s počítačem mluvenou řečí for example. Time-domain Speech Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Objective - conversion of a general text into a speech. Based on a concatenation of a speech segments. Different length of a basic segments are used: ■ Longer segments: ■ the prosodic speech characteristics can be modelled better ■ higher memory demands - higher number of segments (up to 2n, where n is the segment length). ■ segments examples - words, parts of sentence, sentences, Shorter segments: ■ Worse possibilities to model the prosody (sentence intonation, stresses, ...) ■ smaller memory requirements - smaller amount of smaller segments. Commonly Used Speech Segments Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Allophones: ■ positional variants of phonemes - contain ■ phoneme ■ neighbourhood affected by coarticulation. ■ allophones count - n3 (n - number of phonemes). Diphones: ■ starts in the middle of the first phoneme and ends in the middle of the next phoneme ■ diphones number - n2 ■ Commonly used in speech synthesis and speech recognition (MBrola synthesizer) Commonly Used Speech Segments Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Triphones: ■ Starts in the middle of previous phoneme, contains entire middle phoneme and ends in the middle of the next phoneme. ■ Triphones number - a?3. ■ Commonly used in speech synthesis and recognition. Syllable segments: ■ should correspond to syllables as much as possible. ■ Length- 1 — 3 phonemes. ■ Used in the TTS system Demosthenes. Time-domain Speech Synthesis Syllable Dialogue systems Luděk Bártek Speech Syllable: ■ 1st class primary school children learns how to divide words into syllables. ■ Smallest organizational speech part. ■ The syllable structure can not be derived - ambiguous division of some words into syllables: ■ funk-cnf vs. funkc-nf ■ Total number of Czech syllables - approximately 10 000. Time-domain Speech Synthesis Syllables Structure Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Syllable structure: ■ preature (onset) ■ nucleus (vocalic syllable core) - on Czech it can be: ■ either vowel or diphthong ■ sonor - krk for example ■ fricative - pst for example ■ nasal - sed/77 for example ■ coda - is optional ■ nucleus + coda forms the syllable core ■ slopes: ■ preature and coda ■ are formed by one or more consonants. Time-domain Synthesis Syllable Segments Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Define artificially: ■ solution of syllable borders ambiguity. The frequented Czech syllable types: ■ V (vowel/diphthong) - u - kol ■ KV (consonant-vowel) - vo - da ■ KVK-jed-not-ka ■ KK - tr-sy ■ KKV-tma ■ KKVK-dmout These syllable segments form more than 95 % of syllable. Allows automatic text segmentation. Used in TTS Demosthenes (doc. Kopecek, LSD Fl) for example. □ e Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Phonetic transcription. Text segmentation corresponding to the used speech segments. Corresponding acoustic segments selection from a segment DB. Segments concatenation ■ The segment concatenation should be continuous and smooth: ■ the end of the first segment should be same or very close to the start of the second segment ■ the first derivation of the end of 1st segment should be same or very close to the 1st derivation of the start of second segment. Optional post-processing ■ prosody adding. Time-domain Synthesis Corpus Based Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Concatenative time-domain synthesis. Uses the speech corpus as a segment database. ■ Contains tagged speech. ■ Tagging contains: ■ phonetic transcription of the speech ■ speech segments borders ■ Fo and optionally other formant progress. ■ Allows to select more specific speech segments: ■ decreases the computational complexity of concatenation and post-processing. Segment selection algorithm: Select segments according the phonetic transcription. Select best segment that best follows-up. Time-Domain Synthesis Frame-based Synthesis Dialogue systems Luděk Bártek Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis Mostly used as a problem oriented synthesis. Synthesised speech is formed from: ■ frames - constant part of the sentence ■ slots - the variable parts of the speech. Adventages: ■ The frames are pre-recorded and may contain the intonation. ■ Only the slot content is synthesised: ■ good specified set of words ■ whole word can be used. Example: ■ train station radio announcement: The passenger train number from goes to the platform at