5 Years Impact Factor: 1.53
Author: 34-40
Abstract:
Understanding human emotions from speech signals has applications in virtual assistants, mental health analysis, and human-computer interaction. This paper presents a Long Short-Term Memory (LSTM) network-based approach for speech emotion recognition using Mel-frequency cepstral coefficients (MFCC) as audio features. We preprocess audio recordings from the RAVDESS and EMO-DB datasets by extracting 13-dimensional MFCC vectors, energy coefficients, and delta features. LSTM models, capable of modeling temporal dependencies, are trained to classify utterances into emotion categories: happiness, sadness, anger, fear, disgust, and neutral. We compare our LSTM model with traditional classifiers like SVM and random forests, observing a 7–10% improvement in accuracy across datasets. On RAVDESS, our best model achieves 81.4% accuracy, outperforming CNN and GRU-based baselines. We conduct ablation studies on input window size and recurrent layer depth to analyze their influence on perfo
Download PDF