An Interpretable Theory-based Deep Learning Architecture for Music Emotion
Ms. Hortense Fong
Ph.D. Candidate in Quantitative Marketing
Yale School of Management
Yale University
Music is used extensively to evoke emotion throughout the customer journey. This paper develops a theory-based, interpretable deep learning convolutional neural network (CNN) classifier-MusicEmoCNN-to predict the dynamically varying emotional response to music. To develop a theory-based CNN, we first transform the raw music data into a-mel spectrogram-a format that accounts for human auditory response as the input into a CNN. Next, we design and construct novel CNN filters for higher order music features that are based on the physics of sound waves and associated with perceptual features of music, like consonance and dissonance, which are known to impact emotion. The key advantage of our theory-based filters is that we can connect how the predicted emotional response (valence and arousal) are related to human interpretable features of the music. Our model outperforms traditional machine learning models and performs comparably with state-of-the-art black box deep learning CNN models. Our approach of incorporating theory into the design of convolution filters can be taken to settings beyond music. Finally, we use our model in an application involving digital advertising. Motivated by YouTube’s mid-roll advertising, we use the model’s predictions to identify optimal emotion based ad insertion positions in videos. We exogenously place ads at different times within content videos and find that ads placed in emotionally similar contexts are more memorable in terms of higher brand recall rates.