International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
Gender Identification Using A Speech Dependent Voice
Frequency Estimation System
V.D.Ambeth Kumar , T. Vijaya Rajasekar S.Malathi P.Athiyaman and V.Kamalakannan
, ,
Department of Computer Science and Engineering, Panimalar Engineering College, Chennai,)ndiaDepartment of Mathematics, Panimalar Engineering College, Chennai, )ndia.
Email: ambeth_ in@yahoo.co.in
.
Abstract
The objective of this research paper is to design a speaker dependent system that
determines the gender of the speaker using the pitch of the speaker s voice. A speaker dependent
system is a system which can recognize speech from one particular speaker. The pitch of the
speaker s voice is estimated by applying various pitch estimation techniques, namely FFT, Cepstral
Analysis, Autocorrelation and MFCC to correctly determine the gender of a speaker by classifying
the pitch of the speaker s voice based on existing frequency values that were obtained using above
techniques. The proposed system can be used in implementations of A) technologies, )nternet of
Things and various other future applications to detect the gender of the speaker with maximum
possible efficiency and accuracy.
Keywords: FFT, Cepstral Analysis, Autocorrelation, MFCC, Python
I
Introduction
For human beings, speech has proven to be the major modicum of communication with
other humans, animals and even machines. Every day, research is being undertaken by experts of
various sciences across the world in the field of speech processing to study the characteristics of the
speaker. Of all the characteristics, gender is the most fundamental and obvious one. There are a lot
of possible applications where the gender of the speaker plays an important role. For example, an
A) system that controls home appliances by receiving voice input from the user can modify its
working based on the needs that arise from the gender of the user. To determine the gender, we use
the pitch of the speaker s voice as a classifier.
Lawrence R. And Michael J. Cheng et al
defined a pitch detector as an important
component of various speech processing systems, such as vocoder systems. A comparative study of
various pitch detection algorithms they performed shows that accurate and reliable measurement
of pitch from frequency waveform is difficult. They classified pitch detection algorithms into three
broad categories:
Using time-domain properties of speech
Using frequency-domain properties of speech
Using both time and frequency domains
The performance of the pitch detection algorithm is evaluated based on its speed,
suitability, complexity and cost of implementation.
1025
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
Brown and Puckette et al
studied the implementation of FFT in detail, FFT is an algorithm
that rapidly computes the discrete Fourier Transform of a signal. The Discrete Fourier Transform of
a signal is the representation of a signal in the frequency domain.
FFT performs parabolic interpolation to determine a truer peak. This provides better
accuracy. But the main problem with using FFT is that it fails when harmonics are strapping than
fundamental, which is common
Li Tan and Montri K. studied the implementation of the autocorrelation method in detail.
Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross-correlation
of a signal with itself at diverse points in time. )t is the best method for finding the true fundamental
of any repetitive wave, even with faint or missing fundamental. But it has quite a lot of
disadvantages like, it produces inaccurate results if waveform isn't perfectly repeating, like
inharmonic musical instruments. Also, unlike in FFT, this implementation has sorrow with finding
the true peak. Vibha Tiwari et al
, studied the implementation of MFCC algorithm in the
domain of speech processing. Mel-frequency cepstrum MFC is a representation of the short-term
power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a
nonlinear mel scale of frequency. Mel Frequencies are obtained by computing the FFT of the
spectrum, then mapping the powers on to the mel scale, applying a filterbank , i.e a collection of
filters, then taking the logarithm of the mel frequencies and finally converting the mean value to
hertz. K. Ravi Kumar, V.Ambika, K.Suri Babu et al
studied the implementation of Cepstral
Analysis for identification of emotion from speech. A cepstrum is the result of taking the )nverse
Fourier transform )FT of the logarithm of the robust spectrum of a signal. The frequency of the
spectrum is obtained by taking its FFT, then determining the log of each value in the FFT and
applying inverse FFT to the resultant array. The frequency is determined by dividing the sampling
frequency with the maximum value in the inverse FFT.
This paper aims to study the implementation of the above four algorithms for gender determination
purposes using Python as a language. Python is chosen because of its compatibility with various
backend tools and ease of implementation. )n Python, the above algorithms are implemented using
the NumPy and the SciPy libraries, along with Matplotlib for signal waveform generation. The
various mathematical methods provide by the above libraries help in the determination of
frequency from input signal.
II
Proposed System
.
.
.
.
As mentioned in the introduction section, Python is used to create a script that will perform
a set of operations to determine the frequency of the speaker s voice. The operations are
divided into four phases, viz.:
Voice Acquisition
Signal Processing
Pitch Determination
Gender )dentification
1026
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
Voice Frequency
Detection System
Voice Acquisition
Signal Processing
Pitch
Determination
Gender
Identification
Fig 1: Modules of the proposed system
An in-depth representation of the operations and algorithms that the four phases constitute are
represented in the architecture diagram below.
To perform voice analysis, the voice of the speaker must first be recorded using a digital recording
device. Following that, the audio must be modified to suit the algorithms i.e. converted into analog
form, since the frequency of digital audio cannot be determined using the chosen algorithms.
To perform operations on the audio, important features such as the no. of frames, the sampling
frequency, etc. need to be extracted. NumPy and SciPy provide methods to extract wave file
features.
Once the features have been extracted, the frequency of the voice is determined using four
algorithms, namely, FFT, Cepstral Analysis, Autocorrelation and MFCC.
Finally, the frequency values are converted into (z for tabulation purposes.
Voice
Acquisitio
n
Result
Generation
Wave File
Generatio
n
Audio
Normalization
Hz
Conversion
•
•
•
•
Memory
MFCC
Cepstral Analysis
FFT
Autocorrelation
Signal
Feature
Extraction
Array
Conversio
n
Fig 2: System Architecture
1027
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
Voice Acquisition
The voice of the test subject is received through a microphone as analog signals. The microphone
coverts the analog signals into digital signals. This is achieved by implementing a script named
Recorder. The Recorder module reads a single word or multiple words from the microphone and
returns the data as an array of signed shorts.
The quality of the audio is improved by trimming out the silent parts of the recorded voice. To
prevent the recorded audio from having a low volume, normalization is performed which averages
the volume out across the audio file. A duration of . seconds of silence is padded to the beginning
and ending of the audio file to improve playback when opened in a media player. The audio is given
a sampling frequency of
(z and stored in secondary memory in a .wav format. )t is chosen as
a format to save the audio file because .wav files are a raw bitstream representation of the audio
signal in a digital form. This makes it much easier to perform necessary mathematical and analytical
operations on the audio file.
Signal Processing
Signal Processing forms the central part of the operations performed. )n this phase, the .wav file is
retrieved from storage and opened as a wav object.
A wave object is an object that acts as a data type for storing audio signal arrays. )t is possible to
open a .wav file as a numeric array. But a wave object has predefined methods and properties that
make processing of the audio file easier.
The properties required for signal processing are obtained from the .wav file using wave methods.
They are:
nframes
Obtained using the method wave.getnframes
frames present in the audio file.
which returns the no. of audio
nchannels
Obtained using the method wave.getnchannels which returns the number of audio
channels in the audio file. This method returns the value for mono audio and for stereo audio.
The need for determining the number of channels is that the values stored in the
wave array change depending on the number of channels in the audio file. Stereo recordings
require two channels throughout the recording and processing phases. Opening a stereo audio file
as mono will result in incorrect values.
Sampling_frequency
Obtained using the method wave.getframerate which returns the sampling
frequency of the audio file. The value of sampling frequency will be
(z as specified in the
Voice Acquisition phase.
After obtaining the required properties, the audio signal frames are read using readframes
method and stored as a string of bytes.
1028
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
The stored string of bytes is opened as a struct object in which the byte size of each element is a
product of the number of channels and the number of frames. This removes any quantization noise
present in the stored audio file. The struct object is then converted into an array format which is
necessary to perform mathematical operations.
Pitch Determination
The human voice is generated by folds present in the larynx. The pitch of the audio/voice
corresponds to the fundamental frequency F of the audio signal. The fundamental frequency, often
referred to as simply as the fundamental, is defined as the lowest frequency of a periodic waveform.
To determine pitch of the voice, frequency determination methods are implemented. The various
algorithms to determine frequency from an audio signal are:
i.
ii.
iii.
iv.
Fast Fourier Transform FFT
Autocorrelation
Cepstral Analysis
Mel Frequency Cepstral Coefficient MFCC
Fat Fourier Transform (FFT)
An FFT computes the DFT and produces exactly the same result as evaluating the
DFT definition directly; the most important difference is that an FFT is much faster.
Let x , ...., xN- be complex numbers. The DFT is defined by the formula
Autocorrelation
)n signal processing, the above definition is often used without the normalization,
that is, without subtracting the mean and dividing by the variance. When the autocorrelation
function is normalized by mean and variance, it is sometimes referred to as the autocorrelation
coefficient[ ].
Given a signal f t , the continuous autocorrelation is most often defined as the continuous
cross-correlation integral of with itself, at lag T .
Mel Frequency Cepstral Coefficient (MFCC)
The calculation of the MFCC includes the following steps[ ].
The discrete Fourier transform DFT transforms the windowed speech segment
into the frequency domain and the short-term power spectrum P f is obtained.
The spectrum P f is warped along its frequency axis f in hertz into the melfrequency axis as P M where M is the mel-frequency, using Equation
. This is to approximately
reflex the human s ear perception.
1029
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
M f =
log
+f/
The resulted warped power spectrum is then convolved with the triangular
band-pass filter P M into θ M . The convolution with the relatively broad critical-band masking
curves ψ M significantly reduces the spectral resolution of θ M in comparison with the original
P f , which allows for the down sampling of θ M . The discrete convolution of ψ M with θ M
yields samples of the critical-band power spectrum
The MFCC is computed using
Gender Determination
The above four algorithms are applied to the waveform generated for each word utterance
individually and the frequencies determined by the algorithms are tabulated. The average is taken
for frequencies of the words for all the speakers. The pitch of human voice may range from
anywhere between
(z [ ]. The gender of the user is determined using the frequency
values obtained.
III
Experimentation
The voices of
speakers of various ages and both genders were recorded and analyzed. Each of
the speakers were recorded saying four words each (ello , Bye , Vehicle , (elp . Out of the
various frequency values recorded, the following voices were chosen and the results obtained were
tabulated
Frequencies for each male test subject
Table 1: Frequencies for each male voice test subject
SPEAKER
SPOKEN FFT
Cepstral
AutoCorrleation MFCC
WORD
(ARRY
(ello
.
.
.
MALE
Bye
.
.
.
.
Vehicle
.
.
.
.
(elp
.
.
.
.
.
.
.
.
GEORGE
(ello
.
.
.
.
MALE
Bye
.
.
.
.
Vehicle
.
.
.
.
(elp
.
.
.
.
.
.
.
.
JO(N
(ello
.
.
.
MALE
Bye
.
.
.
.
Vehicle
.
.
.
.
(elp
.
.
.
.
.
.
.
1030
.
.
.
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
The above table represents the frequency values determined for each spoken word by the chosen
test subjects. The frequency value in (z obtained using each method is tabulated under their
respective fields.
Taking the case of (ARRY MALE , the frequency results for the utterance of the word Bye
generated were
. (z when FFT was used,
. when Cepstral Analysis was used,
.
when Autocorrelation was used and
. when MFCC was used.
The mean of the results generated for each frequency is taken and represented in the table below
for a better understanding of the values that a single detection algorithm provides.
Table 2: Average male Frequencies for each method
SPEAKER
(ARRY MALE
GEORGE MALE
JO(N MALE
FFT
.
.
.
Cepstral
.
.
.
AutoCorrleation MFCC
.
.
.
.
.
.
180
160
140
120
FFT
100
80
Cepstral
60
MFCC
40
20
0
HARRY (MALE)
GEORGE (MALE)
JOHN(MALE)
Fig 3: Bar Graph representation of each male test subject and frequency detection method
The above bar graph is generated using Table as data source. From the graph we can see that in
the case of (ARRY MALE , the results obtained did not have much of a difference, while in the case
of GEORGE MALE , the results have a lot of discrepancies.
Frequencies for each female test subject
The processes followed for male voice analysis is repeated for the female test subjects. For
tabulation, the voices of women whose pitch was considerably lower, was chosen to showcase the
ability of the algorithms to accurately determine the higher frequency of female voices when
compared to men.
1031
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
SPEAKER
AL)CE
FEMALE
JENNA
FEMALE
ANNA
FEMALE
SPOKEN
WORD
(ello
FFT
Cepstral
AutoCorrleation
MFCC
.
.
.
.
.
(ello
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(ello
.
.
.
Bye
Vehicle
(elp
Bye
Vehicle
(elp
.
.
.
.
.
.
Bye
Vehicle
(elp
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Similar to Table , the frequency values for each word were tabulated under their respective fields.
Taking the example of JENNA FEMALE , the results obtained were
. using FFT,
. using
Cepstral Analysis,
. using Autocorrelation and
.
using MFCC.
This test case shows that the different algorithms are able to accurately determine the frequency of
voice.
Table 4: Average male Frequencies for each method
FFT
SPEAKER
AL)CE
FEMALE
JENNA
FEMALE
ANNA
FEMALE
.
.
.
Cepstral
AutoCorrleation MFCC
.
.
.
.
.
.
.
.
.
Similar to table , the means of the results of the different frequency detection algorithm are
tabulated for a better understanding of the valuse.
1032
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
250
200
150
FFT
100
Cepstral
50
AutoCorrleation
0
MFCC
ALICE
(FEMALE)
JENNA
(FEMALE)
ANNA
(FEMALE)
Fig 4: Bar Graph representation of each female test subject and frequency detection method
The bar graph, using Table as a data source shows the disadvantage of using Autocorrelation in
certain cases, as is with ANNA FEMALE .
From the above tables and graphs, we can see that each method applied to determine the frequency
is slightly inconsistent in determining the frequency of the speaker. This is because of many
background environmental variables affect the different methods.
The values reported by FFT seem to be different from the values obtained using the other methods.
This is because, as stated before, the harmonics of the background noises are higher than the voice
itself which leads to slightly incorrect values. But on an average, all the methods determine the
frequency to be in the range of
(z as determined by previous research in this field.
Therefore, when the average of different methods is taken into consideration, the frequency values
that were obtained are suitable for the determination of the gender of the speaker.
IV
Conclusion
There are various applications of the information of a human being s gender in computer science.
The different characteristics that separate male and female human beings can aid in the research of
aforementioned applications. But the most obvious characteristic, i.e. voice proves to be one of the
most complex features of a human being to analyze simply because of the sheer amount of
environmental variables that accompany it. This study was conducted to show that it is not enough
to determine the frequency/pitch of a speaker using only one algorithm due to the discrepancies in
the values. The precision and efficiency in determination of the gender of a speaker using their
voice can be improved by using multiple pitch estimation algorithms. Also, to support future
advancements in the field of Artificial )ntelligence, the application to determine frequency was built
using Python, which is a highly flexible language.
V
References
[ ] Lawrence R. And Michael J. Cheng, A Comparative Performance Study of Several Pitch Detection
Algorithms , )EEE Transactions on Acoustics, Speech, and Signal Processing, Vol. Assp- , No. ,
October
, Pages
–
1033
https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
[ ] Bageshree V. S, Ashish R. P., Extraction of Pitch and Formants and its Analysis to Extraction of
Pitch and Formants and its Analysis to identify different emotional states of a person , )JCS)
)nternational Journal of Computer Science )ssues, Vol. , )ssue , No , July
, Pages
[ ] Li Tan and Montri K. Pitch Detection Algorithm: Autocorrelation Method And AMDF ,
[ ] Vibha T., MFCC and its applications in speaker recognition , )nternational Journal on Emerging
Technologies
: , Pages -
[ ] K. Ravi Kumar, V.Ambika, K.Suri Babu, Emotion )dentification From Continuous Speech Using
Cepstral Analysis , )nternational Journal of Engineering Research and Applications )JERA )SSN:
, Vol. , )ssue , September- October
, pp.
-
[ ] G. Cloarec, D. Jouvet & J. Monné, Analysis Of The Modeling Of Pitch And Voicing Parameters For
Speaker-)ndependent Speech Recognition Systems , )TRW on Speech Recognition and )ntrinsic
Variatioon SR)V
Toulouse, France May ,
[ ] Dunn, Patrick F.
York: McGraw–(ill
. Measurement and Data Analysis for Engineering and Science. New
[ ] Fang Zheng 郑方 , Guoliang Zhang 张国亮 and Zhanjiang Song 宋战江
,
COMPAR)SON OF D)FFERENT )MPLEMENTAT)ONS OF MFCC, J. Computer Science & Technology,
Center of Speech Technology, State Key Lab of )ntelligent Technology and Systems, Department of
Computer Science and Technology, Tsinghua University, Beijing
[ ] (artmut Traunm“ller and Anders Eriksson, The frequency range of the voice fundamental in
the speech of male and female adults , )nstitutionen för lingvistik, Stockholms universitet, SStockholm, Sweden.
1034
https://sites.google.com/site/ijcsis/
ISSN 1947-5500