Dialogue systems
Dialogue systems
Speech Synthesis
Time-domain Speech Synthesis
Luděk Bártek
Laboratory of Searching and Dialogue, Fakulty of Informatics, Masaryk
University, Brno
spring 2023
Speech Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Objective - conversion of written text into the speech.
■ Resulting speech should sound as natural as possible. Natural speech should contain:
■ correct intonation
■ correctly places stresses
■ word stress
■ sentence stress
■ correct co-articulation
■ correct rhythm (timing)
	Speech Synthesis Kinds	
Dialogue		
systems		
Luděk Bártek	■	Frequency-domain synthesis - simulates the human vocal
Speech		tract.
Synthesis Phonetic	■	Time-domain synthesis - concatenates speech segments
Transcription Frequency-domain Speech Synthesis		into a bigger parts of speech (sentence, utterance, ...)
Time-domain Speech Synthesis	■	Corpus-based - kind of time-domain synthesis - uses the
		speech corpus instead of a segment database.
	■	Problem-oriented synthesis:
		■ time-domains synthesis variant
		■ uses bigger parts of speech - sentences, ...
		■ prfklady:
		■ station radio announcements
		■ automatic phone-support lines ■ ...
		
Speech Synthesis Phases
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Text phonetic transcription.
Transcribed text synthesis:
■ Frequency-domain synthesis - selection of speech synthesis parameters (Fo/white-noise generator, formants and their intensities, ...)
■ Time-domain synthesis - proper segments selection and their concatenation.
Possible post-processing:
■ intonation addition
■ stress addition
Phonetic Transcription
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Used to correct, unambiguous textual speech recording. Uses phonetic alphabet:
■ The International Phonetic Alphabet (IPA) - part of the UNICODE standard
■ SAM PA (Speech Assessment Method Phonetic Alphabet)
■ 7bit IPA transcription
■ proposed in 80th years of 20th century
■ used in many TTS
■ figure - transcription of sentence "Czech is a beautiful language.":
tSeSTina je kra:sni: jazik
IPA
Demo
Speech Synthesis
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
CON50 HANTS (PULMONIC)
	14414												
	E*| tili uil	Libia-.!. :i: L	r-ii.il		I'jJoto			Päidtdi	v«|*r		M. r i.-ivr.il	tpu ftjjöttfll	■r.|i?-tt:lL
Nasal	in		ri			a				N		'-	
Plosivt	P b		t d			U		' 1	kg				
	♦ P	f V	e 5	S Z	J3			t i	* Y	X K	_ h		
Af prwcinnanl		n				i		j		B		+	
TrilJ	B		r							R			
Tapv Flap		V				r							
fricative													
Latrral			1				_		L				
Ltfetul flap			j										
whfH tyreJMj Vf-Ei* 111 uairs. the- oiy to lhe rijjht rcprwnt* a nwdslly v(?irii<on»nirt, except Cot murmurrJ A.
:-i I u : U* 11.1 r r* 3 3 iIkiuI f ulimlifiuiii |u:lpyiLT   \t 11111 ■ ■ 1 -■ -■ 11 ■ I ■--
Phonetic Transcription
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
The computer can not store transcription of all sentences (infinite number):
■ Phonetic transcription should be assured. Phonetic Transcription Rules:
■ May have regional character.
■ Example - pronunciation of Czech sentence "na shledanou":
■ Bohemia - naschledanou
■ Moravia - nazhledanou.
■ Both variants are literary correct.
■ The transcription need not to use all letters of the given alphabet (i/y = i, c = ts, ...)
It takes the coarticulation into the account (form of sonority).
	Czech Phonetic Transcription Rules
Dialogue systems	
Luděk Bártek	■ ch —y x , u —y u, w —> v, q —> kv, y —> i, y —> \
Speech Synthesis Phonetic Transcription Frequency-domain Speech Synthesis Time-domain Speech Synthesis	V ■ e:
	■ be —> bje, pe —> pje, fe —Mje, ve —> vje ■ de—>► d'e, te—)> te, ne—>► ne, mo mne ■ i/f:
	■ di/f—)> d'i/i. ti/f—)> ti/i, ni/f—)> ni/f ■ X: ■ x —y ks — start of the word, before vowel, in-between vowels, before voiceless consonant or at the end of the word. ■ x —> gz: ■ exvowel ■ before voiced consonant
	Consonant Conjugation Changes
Dialogue	
systems	■ Occurs at conjugation of consonants. ■ Caused by speech tract changes.
Luděk Bártek	
Speech Synthesis	■ Two kinds:
Phonetic Transcription	■ form of speech - change of sonority of pair consonants:
Frequency-domain Speech Synthesis	■ ZPS —)> -i ZPS: dub —t dup, zpěv —spjev
Time-domain Speech Synthesis	■ NPS     - NPS: sběr     zbjer, když gdiš ■ form of articulation - at conjugation of two consonants with different articulation: ■ nk/ng - banka, tango ■ mv/mf - tramvaj, nymfa ■ nť/nd - punťa, pindík ■ dň - odpovědně, sto dní, vodní ■ ts —► c ■ ts —► c ■ ds —► c ■ ds —► c
Frequency-domain Speech Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Simulates voice formation of in vocal tract. Stores:
■ frequency characteristics of voice used for synthesis
■ excitation parameters
Principle:
■ Voice tract emulation using:
■ frequency generators
■ filters
■ amplifier(s).
■ The components are controlled by model parameters. The following source encoding forms are used:
■ formant type TTS
■ LPC TTS
■ HMM based TTS
Formant Type Speech Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Reconstructs vocal tract formants using the serial and parallel connection of several resonant circuits.
The format frequencies and bandwidths are controlled electronically.
Synthesizer parameters:
■ Fo -basic vocal chord frequency
■ F-, - formants
■ F/v - nasal formant
■ 6/ - F; band filters
■ Gj - Gain/Amplification control parameters
■ Kj - formants for consonants.
Serial Formant Type Synthesizer Schema
Dialogue systems
Luděk Bártek
Počítač
Speech Synthesis
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Řízení úrovně
ĽÄ
Nazální formant
K.
K
Generátor bílého šumu
B,
B.
B,
t t t t t t
Rezonanční filtr
1
Řízení úrovně
Mixér
Formanty konsonantů
	Řízení
	úrovně
Rec
Obrázek: Serial Formant Type Synthesizer Block Schema
LPC synthesizer
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
LPC synthesizer characteristics:
■ Basic vocal chord tone period 7~o
■ sound characteristics - voiced/unvoiced
■ excitation signal amplitude G
■ digital filter coefficients.
Obtaining digital filter coefficients:
■ analysed microsegment LPC spectral envelop peaks
■ roots of source filter characteristic equation
■ reflex coefficients.
LPC Synthesizer Schema
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Počítač
k
Číslicový filtr
Převod na spojitý tvar
Generátor bílého šumu
Obrázek: LPC Synthesizer Block Schema
	Frequency-domain Synthesis Summary
Dialogue	■ Frequency-domain synthesis advantages and
systems	disadvantages:
Luděk Bártek	+ Small memory requirements - model of the used speaker.
Speech	+ Synthesis can be realized using hardware.
Synthesis	- Resulting voice is not as natural as when using
Phonetic Transcription	time-domain synthesis.
Frequency-domain Speech Synthesis	■ Mathematic model accuracy problem.
Time-domain Speech Synthesis	- Software frequency-domain synthesis has higher
	computational demands then time-domain synthesis.
	■ Common usage:
	■ time-domain synthesis post-processing:
	■ adding sentence intonation
	■ adding sentence and word stress
	■ adding next prosodic factors.
	■ Sometimes is used on devices with insufficient memory
	capacity (mobile phones, PDA, ...).
	■ Sometime is used for multilingual synthesis.
	■ See J. Psutka - Komunikace s počítačem mluvenou řečí
	for example.
Time-domain Speech Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Objective - conversion of a general text into a speech.
Based on a concatenation of a speech segments.
Different length of a basic segments are used: ■ Longer segments:
■ the prosodic speech characteristics can be modelled better
■ higher memory demands - higher number of segments (up to 2n, where n is the segment length).
■ segments examples - words, parts of sentence, sentences,
Shorter segments:
■ Worse possibilities to model the prosody (sentence intonation, stresses, ...)
■ smaller memory requirements - smaller amount of smaller segments.
Commonly Used Speech Segments
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Allophones:
■ positional variants of phonemes - contain
■ phoneme
■ neighbourhood affected by coarticulation.
■ allophones count - n3 (n - number of phonemes). Diphones:
■ starts in the middle of the first phoneme and ends in the middle of the next phoneme
■ diphones number - n2
■ Commonly used in speech synthesis and speech recognition (MBrola synthesizer)
Commonly Used Speech Segments
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Triphones:
■ Starts in the middle of previous phoneme, contains entire middle phoneme and ends in the middle of the next phoneme.
■ Triphones number - a?3.
■ Commonly used in speech synthesis and recognition. Syllable segments:
■ should correspond to syllables as much as possible.
■ Length- 1 — 3 phonemes.
■ Used in the TTS system Demosthenes.
Time-domain Speech Synthesis
Syllable
Dialogue systems
Luděk Bártek
Speech
Syllable:
■ 1st class primary school children learns how to divide words into syllables.
■ Smallest organizational speech part.
■ The syllable structure can not be derived - ambiguous division of some words into syllables:
■ funk-cnf vs. funkc-nf
■ Total number of Czech syllables - approximately 10 000.
Time-domain Speech Synthesis
Syllables Structure
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Syllable structure:
■ preature (onset)
■ nucleus (vocalic syllable core) - on Czech it can be:
■ either vowel or diphthong
■ sonor - krk for example
■ fricative - pst for example
■ nasal - sed/77 for example
■ coda - is optional
■ nucleus + coda forms the syllable core
■ slopes:
■ preature and coda
■ are formed by one or more consonants.
Time-domain Synthesis
Syllable Segments
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Define artificially:
■ solution of syllable borders ambiguity.
The frequented Czech syllable types:
■ V (vowel/diphthong) - u - kol
■ KV (consonant-vowel) - vo - da
■ KVK-jed-not-ka
■ KK - tr-sy
■ KKV-tma
■ KKVK-dmout
These syllable segments form more than 95 % of syllable.
Allows automatic text segmentation.
Used in TTS Demosthenes (doc. Kopecek, LSD Fl) for example.
□ e
Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Phonetic transcription.
Text segmentation corresponding to the used speech segments.
Corresponding acoustic segments selection from a segment DB.
Segments concatenation
■ The segment concatenation should be continuous and smooth:
■ the end of the first segment should be same or very close to the start of the second segment
■ the first derivation of the end of 1st segment should be same or very close to the 1st derivation of the start of second segment.
Optional post-processing
■ prosody adding.
Time-domain Synthesis
Corpus Based Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Concatenative time-domain synthesis.
Uses the speech corpus as a segment database.
■ Contains tagged speech.
■ Tagging contains:
■ phonetic transcription of the speech
■ speech segments borders
■ Fo and optionally other formant progress.
■ Allows to select more specific speech segments:
■ decreases the computational complexity of concatenation and post-processing.
Segment selection algorithm:
Select segments according the phonetic transcription. Select best segment that best follows-up.
Time-Domain Synthesis
Frame-based Synthesis
Dialogue systems
Luděk Bártek
Phonetic Transcription
Frequency-domain Speech Synthesis
Time-domain Speech Synthesis
Mostly used as a problem oriented synthesis. Synthesised speech is formed from:
■ frames - constant part of the sentence
■ slots - the variable parts of the speech.
Adventages:
■ The frames are pre-recorded and may contain the intonation.
■ Only the slot content is synthesised:
■ good specified set of words
■ whole word can be used.
Example:
■ train station radio announcement:
The passenger train number <train number> from <station of origin> goes to the platform <platform number> at <time> o'clock.