Academia.eduAcademia.edu
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 Gender Identification Using A Speech Dependent Voice Frequency Estimation System V.D.Ambeth Kumar , T. Vijaya Rajasekar S.Malathi P.Athiyaman and V.Kamalakannan , , Department of Computer Science and Engineering, Panimalar Engineering College, Chennai,)ndiaDepartment of Mathematics, Panimalar Engineering College, Chennai, )ndia. Email: ambeth_ in@yahoo.co.in . Abstract The objective of this research paper is to design a speaker dependent system that determines the gender of the speaker using the pitch of the speaker s voice. A speaker dependent system is a system which can recognize speech from one particular speaker. The pitch of the speaker s voice is estimated by applying various pitch estimation techniques, namely FFT, Cepstral Analysis, Autocorrelation and MFCC to correctly determine the gender of a speaker by classifying the pitch of the speaker s voice based on existing frequency values that were obtained using above techniques. The proposed system can be used in implementations of A) technologies, )nternet of Things and various other future applications to detect the gender of the speaker with maximum possible efficiency and accuracy. Keywords: FFT, Cepstral Analysis, Autocorrelation, MFCC, Python I Introduction For human beings, speech has proven to be the major modicum of communication with other humans, animals and even machines. Every day, research is being undertaken by experts of various sciences across the world in the field of speech processing to study the characteristics of the speaker. Of all the characteristics, gender is the most fundamental and obvious one. There are a lot of possible applications where the gender of the speaker plays an important role. For example, an A) system that controls home appliances by receiving voice input from the user can modify its working based on the needs that arise from the gender of the user. To determine the gender, we use the pitch of the speaker s voice as a classifier. Lawrence R. And Michael J. Cheng et al defined a pitch detector as an important component of various speech processing systems, such as vocoder systems. A comparative study of various pitch detection algorithms they performed shows that accurate and reliable measurement of pitch from frequency waveform is difficult. They classified pitch detection algorithms into three broad categories: Using time-domain properties of speech Using frequency-domain properties of speech Using both time and frequency domains The performance of the pitch detection algorithm is evaluated based on its speed, suitability, complexity and cost of implementation. 1025 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 Brown and Puckette et al studied the implementation of FFT in detail, FFT is an algorithm that rapidly computes the discrete Fourier Transform of a signal. The Discrete Fourier Transform of a signal is the representation of a signal in the frequency domain. FFT performs parabolic interpolation to determine a truer peak. This provides better accuracy. But the main problem with using FFT is that it fails when harmonics are strapping than fundamental, which is common Li Tan and Montri K. studied the implementation of the autocorrelation method in detail. Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross-correlation of a signal with itself at diverse points in time. )t is the best method for finding the true fundamental of any repetitive wave, even with faint or missing fundamental. But it has quite a lot of disadvantages like, it produces inaccurate results if waveform isn't perfectly repeating, like inharmonic musical instruments. Also, unlike in FFT, this implementation has sorrow with finding the true peak. Vibha Tiwari et al , studied the implementation of MFCC algorithm in the domain of speech processing. Mel-frequency cepstrum MFC is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Mel Frequencies are obtained by computing the FFT of the spectrum, then mapping the powers on to the mel scale, applying a filterbank , i.e a collection of filters, then taking the logarithm of the mel frequencies and finally converting the mean value to hertz. K. Ravi Kumar, V.Ambika, K.Suri Babu et al studied the implementation of Cepstral Analysis for identification of emotion from speech. A cepstrum is the result of taking the )nverse Fourier transform )FT of the logarithm of the robust spectrum of a signal. The frequency of the spectrum is obtained by taking its FFT, then determining the log of each value in the FFT and applying inverse FFT to the resultant array. The frequency is determined by dividing the sampling frequency with the maximum value in the inverse FFT. This paper aims to study the implementation of the above four algorithms for gender determination purposes using Python as a language. Python is chosen because of its compatibility with various backend tools and ease of implementation. )n Python, the above algorithms are implemented using the NumPy and the SciPy libraries, along with Matplotlib for signal waveform generation. The various mathematical methods provide by the above libraries help in the determination of frequency from input signal. II Proposed System . . . . As mentioned in the introduction section, Python is used to create a script that will perform a set of operations to determine the frequency of the speaker s voice. The operations are divided into four phases, viz.: Voice Acquisition Signal Processing Pitch Determination Gender )dentification 1026 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 Voice Frequency Detection System Voice Acquisition Signal Processing Pitch Determination Gender Identification Fig 1: Modules of the proposed system An in-depth representation of the operations and algorithms that the four phases constitute are represented in the architecture diagram below. To perform voice analysis, the voice of the speaker must first be recorded using a digital recording device. Following that, the audio must be modified to suit the algorithms i.e. converted into analog form, since the frequency of digital audio cannot be determined using the chosen algorithms. To perform operations on the audio, important features such as the no. of frames, the sampling frequency, etc. need to be extracted. NumPy and SciPy provide methods to extract wave file features. Once the features have been extracted, the frequency of the voice is determined using four algorithms, namely, FFT, Cepstral Analysis, Autocorrelation and MFCC. Finally, the frequency values are converted into (z for tabulation purposes. Voice Acquisitio n Result Generation Wave File Generatio n Audio Normalization Hz Conversion • • • • Memory MFCC Cepstral Analysis FFT Autocorrelation Signal Feature Extraction Array Conversio n Fig 2: System Architecture 1027 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 Voice Acquisition The voice of the test subject is received through a microphone as analog signals. The microphone coverts the analog signals into digital signals. This is achieved by implementing a script named Recorder. The Recorder module reads a single word or multiple words from the microphone and returns the data as an array of signed shorts. The quality of the audio is improved by trimming out the silent parts of the recorded voice. To prevent the recorded audio from having a low volume, normalization is performed which averages the volume out across the audio file. A duration of . seconds of silence is padded to the beginning and ending of the audio file to improve playback when opened in a media player. The audio is given a sampling frequency of (z and stored in secondary memory in a .wav format. )t is chosen as a format to save the audio file because .wav files are a raw bitstream representation of the audio signal in a digital form. This makes it much easier to perform necessary mathematical and analytical operations on the audio file. Signal Processing Signal Processing forms the central part of the operations performed. )n this phase, the .wav file is retrieved from storage and opened as a wav object. A wave object is an object that acts as a data type for storing audio signal arrays. )t is possible to open a .wav file as a numeric array. But a wave object has predefined methods and properties that make processing of the audio file easier. The properties required for signal processing are obtained from the .wav file using wave methods. They are: nframes Obtained using the method wave.getnframes frames present in the audio file. which returns the no. of audio nchannels Obtained using the method wave.getnchannels which returns the number of audio channels in the audio file. This method returns the value for mono audio and for stereo audio. The need for determining the number of channels is that the values stored in the wave array change depending on the number of channels in the audio file. Stereo recordings require two channels throughout the recording and processing phases. Opening a stereo audio file as mono will result in incorrect values. Sampling_frequency Obtained using the method wave.getframerate which returns the sampling frequency of the audio file. The value of sampling frequency will be (z as specified in the Voice Acquisition phase. After obtaining the required properties, the audio signal frames are read using readframes method and stored as a string of bytes. 1028 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 The stored string of bytes is opened as a struct object in which the byte size of each element is a product of the number of channels and the number of frames. This removes any quantization noise present in the stored audio file. The struct object is then converted into an array format which is necessary to perform mathematical operations. Pitch Determination The human voice is generated by folds present in the larynx. The pitch of the audio/voice corresponds to the fundamental frequency F of the audio signal. The fundamental frequency, often referred to as simply as the fundamental, is defined as the lowest frequency of a periodic waveform. To determine pitch of the voice, frequency determination methods are implemented. The various algorithms to determine frequency from an audio signal are: i. ii. iii. iv. Fast Fourier Transform FFT Autocorrelation Cepstral Analysis Mel Frequency Cepstral Coefficient MFCC Fat Fourier Transform (FFT) An FFT computes the DFT and produces exactly the same result as evaluating the DFT definition directly; the most important difference is that an FFT is much faster. Let x , ...., xN- be complex numbers. The DFT is defined by the formula Autocorrelation )n signal processing, the above definition is often used without the normalization, that is, without subtracting the mean and dividing by the variance. When the autocorrelation function is normalized by mean and variance, it is sometimes referred to as the autocorrelation coefficient[ ]. Given a signal f t , the continuous autocorrelation is most often defined as the continuous cross-correlation integral of with itself, at lag T . Mel Frequency Cepstral Coefficient (MFCC) The calculation of the MFCC includes the following steps[ ]. The discrete Fourier transform DFT transforms the windowed speech segment into the frequency domain and the short-term power spectrum P f is obtained. The spectrum P f is warped along its frequency axis f in hertz into the melfrequency axis as P M where M is the mel-frequency, using Equation . This is to approximately reflex the human s ear perception. 1029 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 M f = log +f/ The resulted warped power spectrum is then convolved with the triangular band-pass filter P M into θ M . The convolution with the relatively broad critical-band masking curves ψ M significantly reduces the spectral resolution of θ M in comparison with the original P f , which allows for the down sampling of θ M . The discrete convolution of ψ M with θ M yields samples of the critical-band power spectrum The MFCC is computed using Gender Determination The above four algorithms are applied to the waveform generated for each word utterance individually and the frequencies determined by the algorithms are tabulated. The average is taken for frequencies of the words for all the speakers. The pitch of human voice may range from anywhere between (z [ ]. The gender of the user is determined using the frequency values obtained. III Experimentation The voices of speakers of various ages and both genders were recorded and analyzed. Each of the speakers were recorded saying four words each (ello , Bye , Vehicle , (elp . Out of the various frequency values recorded, the following voices were chosen and the results obtained were tabulated Frequencies for each male test subject Table 1: Frequencies for each male voice test subject SPEAKER SPOKEN FFT Cepstral AutoCorrleation MFCC WORD (ARRY (ello . . . MALE Bye . . . . Vehicle . . . . (elp . . . . . . . . GEORGE (ello . . . . MALE Bye . . . . Vehicle . . . . (elp . . . . . . . . JO(N (ello . . . MALE Bye . . . . Vehicle . . . . (elp . . . . . . . 1030 . . . https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 The above table represents the frequency values determined for each spoken word by the chosen test subjects. The frequency value in (z obtained using each method is tabulated under their respective fields. Taking the case of (ARRY MALE , the frequency results for the utterance of the word Bye generated were . (z when FFT was used, . when Cepstral Analysis was used, . when Autocorrelation was used and . when MFCC was used. The mean of the results generated for each frequency is taken and represented in the table below for a better understanding of the values that a single detection algorithm provides. Table 2: Average male Frequencies for each method SPEAKER (ARRY MALE GEORGE MALE JO(N MALE FFT . . . Cepstral . . . AutoCorrleation MFCC . . . . . . 180 160 140 120 FFT 100 80 Cepstral 60 MFCC 40 20 0 HARRY (MALE) GEORGE (MALE) JOHN(MALE) Fig 3: Bar Graph representation of each male test subject and frequency detection method The above bar graph is generated using Table as data source. From the graph we can see that in the case of (ARRY MALE , the results obtained did not have much of a difference, while in the case of GEORGE MALE , the results have a lot of discrepancies. Frequencies for each female test subject The processes followed for male voice analysis is repeated for the female test subjects. For tabulation, the voices of women whose pitch was considerably lower, was chosen to showcase the ability of the algorithms to accurately determine the higher frequency of female voices when compared to men. 1031 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 SPEAKER AL)CE FEMALE JENNA FEMALE ANNA FEMALE SPOKEN WORD (ello FFT Cepstral AutoCorrleation MFCC . . . . . (ello . . . . . . . . . . . . . . . . . (ello . . . Bye Vehicle (elp Bye Vehicle (elp . . . . . . Bye Vehicle (elp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Similar to Table , the frequency values for each word were tabulated under their respective fields. Taking the example of JENNA FEMALE , the results obtained were . using FFT, . using Cepstral Analysis, . using Autocorrelation and . using MFCC. This test case shows that the different algorithms are able to accurately determine the frequency of voice. Table 4: Average male Frequencies for each method FFT SPEAKER AL)CE FEMALE JENNA FEMALE ANNA FEMALE . . . Cepstral AutoCorrleation MFCC . . . . . . . . . Similar to table , the means of the results of the different frequency detection algorithm are tabulated for a better understanding of the valuse. 1032 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 250 200 150 FFT 100 Cepstral 50 AutoCorrleation 0 MFCC ALICE (FEMALE) JENNA (FEMALE) ANNA (FEMALE) Fig 4: Bar Graph representation of each female test subject and frequency detection method The bar graph, using Table as a data source shows the disadvantage of using Autocorrelation in certain cases, as is with ANNA FEMALE . From the above tables and graphs, we can see that each method applied to determine the frequency is slightly inconsistent in determining the frequency of the speaker. This is because of many background environmental variables affect the different methods. The values reported by FFT seem to be different from the values obtained using the other methods. This is because, as stated before, the harmonics of the background noises are higher than the voice itself which leads to slightly incorrect values. But on an average, all the methods determine the frequency to be in the range of (z as determined by previous research in this field. Therefore, when the average of different methods is taken into consideration, the frequency values that were obtained are suitable for the determination of the gender of the speaker. IV Conclusion There are various applications of the information of a human being s gender in computer science. The different characteristics that separate male and female human beings can aid in the research of aforementioned applications. But the most obvious characteristic, i.e. voice proves to be one of the most complex features of a human being to analyze simply because of the sheer amount of environmental variables that accompany it. This study was conducted to show that it is not enough to determine the frequency/pitch of a speaker using only one algorithm due to the discrepancies in the values. The precision and efficiency in determination of the gender of a speaker using their voice can be improved by using multiple pitch estimation algorithms. Also, to support future advancements in the field of Artificial )ntelligence, the application to determine frequency was built using Python, which is a highly flexible language. V References [ ] Lawrence R. And Michael J. Cheng, A Comparative Performance Study of Several Pitch Detection Algorithms , )EEE Transactions on Acoustics, Speech, and Signal Processing, Vol. Assp- , No. , October , Pages – 1033 https://sites.google.com/site/ijcsis/ ISSN 1947-5500 International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 11, November 2016 [ ] Bageshree V. S, Ashish R. P., Extraction of Pitch and Formants and its Analysis to Extraction of Pitch and Formants and its Analysis to identify different emotional states of a person , )JCS) )nternational Journal of Computer Science )ssues, Vol. , )ssue , No , July , Pages [ ] Li Tan and Montri K. Pitch Detection Algorithm: Autocorrelation Method And AMDF , [ ] Vibha T., MFCC and its applications in speaker recognition , )nternational Journal on Emerging Technologies : , Pages - [ ] K. Ravi Kumar, V.Ambika, K.Suri Babu, Emotion )dentification From Continuous Speech Using Cepstral Analysis , )nternational Journal of Engineering Research and Applications )JERA )SSN: , Vol. , )ssue , September- October , pp. - [ ] G. Cloarec, D. Jouvet & J. Monné, Analysis Of The Modeling Of Pitch And Voicing Parameters For Speaker-)ndependent Speech Recognition Systems , )TRW on Speech Recognition and )ntrinsic Variatioon SR)V Toulouse, France May , [ ] Dunn, Patrick F. York: McGraw–(ill . Measurement and Data Analysis for Engineering and Science. New [ ] Fang Zheng 郑方 , Guoliang Zhang 张国亮 and Zhanjiang Song 宋战江 , COMPAR)SON OF D)FFERENT )MPLEMENTAT)ONS OF MFCC, J. Computer Science & Technology, Center of Speech Technology, State Key Lab of )ntelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing [ ] (artmut Traunm“ller and Anders Eriksson, The frequency range of the voice fundamental in the speech of male and female adults , )nstitutionen för lingvistik, Stockholms universitet, SStockholm, Sweden. 1034 https://sites.google.com/site/ijcsis/ ISSN 1947-5500