About Speaker Recognition Techology
Gerik Alexander von Graevenitz
Bergdata Biometrics GmbH, Bonn, Germany
gerik @ graevenitz.de
Overview about Biometrics
Biometrics refers to the automatic identification of a living person based
on physiological or behavioural characteristics. There are many types of
biometric technologies on the market: face recognition, fingerprint
recognition, finger geometry, hand geometry, iris recognition, vein
recognition, voice and signature recognition.
The method of biometric identification is preferred over traditional
methods involving passwords and PIN numbers for various reasons:
The person to be identified is required to be physically present at the
point-of-identification. The identification based on biometric techniques
obviates the need to remember a password or carry a token or a
smartcard.
With the rapid increase in use of PINs and passwords occurring as a
result of the information technology revolution, it is necessary to restrict
access to sensitive/personal data. By replacing PINs and passwords,
biometric techniques are more convenient in relation to the user and can
potentially prevent unauthorised access to or fraudulent use of ATMs,
Time & Attendance Systems, cellular phones, smart cards, desktop PCs,
- 2 Workstations,
and computer networks. PINs and passwords may be
forgotten, and token based methods of identification, like passports,
driver's licenses and insurance cards, may be forgotten, stolen, or lost.
Various types of biometric systems are being used for real-time
identification; the most popular are based on face recognition and
fingerprint matching. Furthermore, there are other biometric systems
that utilise iris and retinal scan, speech, face, and hand geometry.
Voice Recognition
Speech contains information about the identity of the speaker. A speech
signal includes also the language this is spoken, the presence and type of
speech pathologies, the physical and emotional state of the speaker.
Often, humans are able to extract the identity information when the
speech comes from a speaker they are acquainted with.
LAWRENCE KERSTA at the Bell Labs made the first major step from
speaker verification by humans towards speaker verifications by
computers in the early 1960s where he introduced the term voiceprint for
a spectrogram, which was generated by a complicated electro-mechanical
device. The voiceprint was matched with a verification algorithm that
was based on visual comparison.
The recording of the human voice for speaker recognition requires a
human to say something. In other words the human has to show some of
his/her speaking behavior. Therefore, voice recognition fits within the
category of behavioral biometrics.
- 3 A
speech signal is a very complex function of the speaker and his
environment that can be captured easily with a standard microphone. In
contradiction to a physical biometric technology such as fingerprint, in
speaker recognition are not fixed, no static and no physical characteristics.
In speaker recognition there are only information depending on an act.
The state of-the-art approach to automatic speaker verification (denoted
as ASV) is to build a stochastic model of a speaker, based on speaker
characteristics extracted from the available amount of training speech.
In speaker recognition we differ between low-level and high-level
information. High level-information are values like a dialect, an accent,
the talking style and the subject manner of context. These features are
currently only recognitized and analyzed by humans. As low-level are
denoted the information like pitch period, rhythm, tone, spectral
magnitude, frequencies, and bandwiths of an individuaľs voice. These
features are used by speaker recognition systems.
Voice verification works with a microphone or with a regular telephone
handset, although performance increases with higher quality capture
devices. The hardware costs are very low, because today nearly every PC
includes a microphone or it can be easily connected one. However voice
recognition has got its problems with persons who are husky or mimic
another voice. If this happens the user may not be recognized by the
system. Additionally, the likelihood of recognition decreases with poorquality
microphones and if there is background noise. Voice verification
will be a complementary technique for e.g. finger-scan technology as
many people see finger recognition technology as a higher authentication
- 4 form.
In general voice authentication has got a high EER, therefore it is in
general not used for identification. The speech is variant in time,
therefore adaptive templates or methods are necessary.
Intraspeaker variance versus Interspeaker variance
The variation of features caused by different speakers is called
interspeaker variance. The interspeaker variance is caused by different
vocal characteristics of individuals and provides useful information for
distinguishing different speakers. Another kind of variation ­
intraspeaker variation occurs when a speaker pronounces the same word
or sentence but cannot repeat the utterance in exactly the same way from
trial to trail.
The intraspeaker variation includes the different speaking rate, the
emotional state of the speaker and the speaking environment. The
intraspeaker variation is the main factor that causes the performance
degradation of speaker recognition systems. Therefore, it is desirable to
select the parameters that show lower intraspeaker but high interspeaker
variability. In many speaker recognition applications, it is possible to
reduce the intraspeaker variability by requiring the user to pronounce the
test sentence that contains the same text or vocabularies as the training
sentences. This is the case of text-dependent speaker recognition
methods.
- 5 Text-dependent
vs. text-independent speaker recognition.
Speaker recognition systems are classified as text-dependent (fixed-text)
and text-independent (free-text). The text-dependent systems require a
user to repronounce some specified utterances, usually containing the
same text as the training data. There is no such constraint in textindependent
systems. In the text-dependent system, the knowledge of
knowing words or word sequence can be exploited to improve the
performance.
There are two main reasons for wanting a speaker verification system to
prompt the client with a new password phrase for each new test occasion:
(a) The user does not have to remember a fixed password and (b) the
system can not easily be defeated with the replaying of recordings of the
user´s speech.
There are a few methods that are used for speaker verification. The textdependent
speaker recognition methods can be classified into DTW
(dynamic time warping) or HMM (Hidden Markov Model) based
methods.
Text-independent speaker verification has been an active area of research
for a long time because performance degradation due to mismatched
conditions has been a significant barrier for deployment of speaker
recognition technologies.
- 6 How
it works.
There are a few methods that are used for speaker verification. The textdependent
speaker recognition methods can be classified into DTW
(dynamic time warping) or HMM (Hidden Markov Model) based
methods.
The DTW-methods are using instantaneous and transitional cepstra. In
1963, Bogert et al. published a paper with the title "The Quefrency
Analysis of Time Series for Echoes". They defined a new signal
processing technique where they defined an extensive vocabulary
interchanging letters like the word spectrum in cepstrum. For the
computation of the cepstrum usually a Fast Fourier Transformation is
used.
Since 1975 the Hidden Markov Modeling (denoted as HMM) is a
technique that has become popular in speech recognition research,
named by the Russian mathematician A.A. Markov. With HMM-based
methods, the statistical variation of spectral features is measured.
Examples for the text-independent speaker recognition methods are: the
average-spectrum-based method, the VQ-based methods and the
multivariate auto-regression (MAR) model.
The average-spectrum-based method is using a weighted cepstral
distance measure, where the phoneme effects in speech spectra are
removed by averaging the spectra.
- 7 With
the VQ-based method a set of short-term training feature vectors of
a speaker can be used directly to represent the essential characteristics of
that speaker. However, such a direct representation is impractical when
the number of training vectors is large, since the memory and amount of
computation required become prohibitively large. Therefore, attempts
have been made to find efficient ways of compressing the training data
using vector quantization (VQ) techniques.
Montacie et alii applied a multivariate auto-regression (MAR) model to
the time series of cepstral vectors to characterize speakers with receiving
quite good results.
Anyways, the text-independent speaker verification has been an active
area of research for a long time because performance degradation due to
mismatched conditions has been a significant barrier for deployment of
speaker recognition technologies.
Sources of Verification Errors
There are a few sources of verification errors that may occur:
* Misspoken or misread prompted phrases
* Extreme emotional states (e.g. stress or duress)
* The attitude how the speech is said is another than with the
enrollment
* Time varying (intra- or intersession) microphone placement
* Poor or inconsistent room acoustics (e.g. multipath and noise)
* Channel mismatch (e.g. using different microphones for enrollment
- 8 and
verification)
* Different pronunciation speed during the verification compared
with the training data.
* Sickness (e.g. head colds can alter the vocal tract)
* Aging (the vocal tract can drift away from models with age)
* Women have a quite higher FRR, because the spectral of the voice
is smaller.
Faking a voice verification system requires a very high quality recorder,
which is not easy to find on the market. Normal voice recorders that are
on the market do not record the complete spectrum of the voice that is
necessary to fake the system. The quality loss of the voice recording
system must be very low, too. With the most voice verification systems
imitating a voice from one human by another human does not lead to
success.
The mentioned sources of verification errors lead to the result that
actually it is quite complex to do an identification with voice verification.
Therefore the voice verification systems are used for verification in most
cases in combination with a PIN or a chipcard to identify the user in a
database.
Applications
The application for speaker verification systems are:
* Time and Attendance Systems
* Access Control Systems
- 9 *
Telephone-Banking/Broking
* Biometric Login to telephone aided shopping systems
* Information and Reservation Services
* Security control for confidential information
* Forensic purposes
Conclusion
The advantage of a voice verification system is the very cheap hardware
that is needed ­ in most computers a soundcard and a microphone is
implemented. It use very easy to use and to implement with applications
for the telecommunication.
Voice recognition has got a few disadvantages, too. On the one hand the
human voice is not invariant in time therefore the biometric template
must be adapted during progressing time. The human voice is also
variable through temporal variations of the voice, caused by a cold,
hoarseness, stress, emotional different states or puberty vocal change. On
the other hand voice recognition systems have got a higher EER
compared to fingerprint recognition systems, because the human voice is
not as "unique" as fingerprints. For the computation of the Fast Fourier
Transformation the systems needs to have a co-processor and more
processing power than e.g. for fingerprint matching. Therefore speaker
verification systems are not suitable for mobile applications / battery
powered systems in the current state.