What makes the difference? An Empirical Comparison of Fusion Strategies for Multimodal Language Analysis

Published in Information Fusion, 2020

Recommended citation: Dmitris Gkoumas, Qiuchi Li, Christina Lioma and Dawei Song. (2021). "What makes the difference? An Empirical Comparison of Fusion Strategies for Multimodal Language Analysis." In Information Fusion. https://qiuchili.github.io/files/if2020-2.pdf

Understanding human language is an emerging interdisciplinary field bringing together artificial intelligence, natural language processing, and cognitive science. It goes beyond linguistic modality, by effectively combining non-verbal behaviour (i.e., visual, acoustic) which is crucial for inferring speaker intent. Being a rapidly growing area of research, a range of models of multimodal language analysis has been introduced within the last two years. In this paper, we present a large-scale empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches to find out which aspects could be effectively used to solve the problem of multimodal language analysis. An important feature of our study is the critical and experimental analysis of the SOTA approaches. In particular, we replicate diverse complex neural networks, utilizing attention, memory, and recurrent components. We propose a methodology to investigate both their effectiveness and efficiency in two multimodal tasks: a) video sentiment analysis and b) emotion recognition. We evaluate all approaches on three SOTA benchmark corpora, namely, a) Multimodal Opinion-level Sentiment Intensity (MOSI), b) Multimodal Opinion Sentiment and Emotion Intensity (MOSEI), which is the largest available dataset for video sentiment analysis, and c) Interactive Emotional Dyadic Motion Capture (IEMOCAP). Comprehensive experiments show that the attention mechanism components are the most effective for modelling interactions across different modalities. Besides, utilization of linguistic modality as a pivot modality for nonverbal modalities, incorporation of long-range crossmodal interactions across multimodal sequences, and integration of modality context, are also among the most effective aspects for human multimodal affection recognition tasks. Download paper here