چکیده
|
Speech emotion recognition (SER) is a challenging and fundamental task. SER systems are capable of identifying emotions of different audio signals. Recently, machine learning techniques have presented promising results in this field. Speech emotion recognition is a classification issue from the viewpoint of machine learning, where an input sample (audio) has to be divided into a few preset emotions. In this thesis, due to the fact that Mel Frequency Cepstral Coefficients (MFCC) features extract the components that determine the auditory perception in each person, extracting this feature is considered as the main data for emotion classification. After the features were extracted, in the next step, we selected the features based on the information entropy using the genetic algorithm. Finally, after the optimal features were selected, an 8-layer convolutional neural network with three fully connected layers, was employed to classify the selected features. The results indicated that the accuracy of the network for 65 MFCC coefficients and 39 MFCC coefficients achieved 80.1% and 79.6%, respectively.
|