Emotional stress significantly impacts mental and physical health, motivating the need for computational methods that can model subtle affective states. This study introduces a novel multimodal multitask architecture for the simultaneous regression of arousal and valence, two key emotional dimensions correlated with emotional stress. Utilizing the ULM-TSST dataset, the proposed model integrates video, audio, text, and physiological data through a combination of LSTM and Transformer Encoder networks, employing class tokens for task-specific representations. Experimental results demonstrate the model's effectiveness, achieving an average Concordance Correlation Coefficient (CCC) of 60% for valence and 61% for arousal, outperforming existing approaches by a 4% CCC margin. Ablation studies highlight the importance of each modality, confirming that the best performance is achieved only when all modalities are included. Additionally, comparative analysis between the multitask and single-task versions of the architecture confirms that the multitask approach outperforms single-task models in both arousal and valence prediction. This improvement underscores the benefits of shared representations and joint learning of related affective dimensions within a unified framework. The code for this project is publicly available at https://github.com/cosbidev/Temporal-Multimodal-Multitask-Attention.

Temporal Multimodal Multitask Attention for Affective State Estimation in a Stressful Environment

Bria, Alessandro
Supervision
;
2025-01-01

Abstract

Emotional stress significantly impacts mental and physical health, motivating the need for computational methods that can model subtle affective states. This study introduces a novel multimodal multitask architecture for the simultaneous regression of arousal and valence, two key emotional dimensions correlated with emotional stress. Utilizing the ULM-TSST dataset, the proposed model integrates video, audio, text, and physiological data through a combination of LSTM and Transformer Encoder networks, employing class tokens for task-specific representations. Experimental results demonstrate the model's effectiveness, achieving an average Concordance Correlation Coefficient (CCC) of 60% for valence and 61% for arousal, outperforming existing approaches by a 4% CCC margin. Ablation studies highlight the importance of each modality, confirming that the best performance is achieved only when all modalities are included. Additionally, comparative analysis between the multitask and single-task versions of the architecture confirms that the multitask approach outperforms single-task models in both arousal and valence prediction. This improvement underscores the benefits of shared representations and joint learning of related affective dimensions within a unified framework. The code for this project is publicly available at https://github.com/cosbidev/Temporal-Multimodal-Multitask-Attention.
File in questo prodotto:
File Dimensione Formato  
2025 - Temporal Multimodal Multitask Attention for Affective State Estimation in a Stressful Environment.pdf

accesso aperto

Tipologia: Documento in Pre-print
Licenza: Copyright dell'editore
Dimensione 1.01 MB
Formato Adobe PDF
1.01 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11580/118405
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
social impact