Emotional stress significantly impacts mental and physical health, motivating the need for computational methods that can model subtle affective states. This study introduces a novel multimodal multitask architecture for the simultaneous regression of arousal and valence, two key emotional dimensions correlated with emotional stress. Utilizing the ULM-TSST dataset, the proposed model integrates video, audio, text, and physiological data through a combination of LSTM and Transformer Encoder networks, employing class tokens for task-specific representations. Experimental results demonstrate the model's effectiveness, achieving an average Concordance Correlation Coefficient (CCC) of 60% for valence and 61% for arousal, outperforming existing approaches by a 4% CCC margin. Ablation studies highlight the importance of each modality, confirming that the best performance is achieved only when all modalities are included. Additionally, comparative analysis between the multitask and single-task versions of the architecture confirms that the multitask approach outperforms single-task models in both arousal and valence prediction. This improvement underscores the benefits of shared representations and joint learning of related affective dimensions within a unified framework. The code for this project is publicly available at https://github.com/cosbidev/Temporal-Multimodal-Multitask-Attention.
Temporal Multimodal Multitask Attention for Affective State Estimation in a Stressful Environment
Bria, AlessandroSupervision
;
2025-01-01
Abstract
Emotional stress significantly impacts mental and physical health, motivating the need for computational methods that can model subtle affective states. This study introduces a novel multimodal multitask architecture for the simultaneous regression of arousal and valence, two key emotional dimensions correlated with emotional stress. Utilizing the ULM-TSST dataset, the proposed model integrates video, audio, text, and physiological data through a combination of LSTM and Transformer Encoder networks, employing class tokens for task-specific representations. Experimental results demonstrate the model's effectiveness, achieving an average Concordance Correlation Coefficient (CCC) of 60% for valence and 61% for arousal, outperforming existing approaches by a 4% CCC margin. Ablation studies highlight the importance of each modality, confirming that the best performance is achieved only when all modalities are included. Additionally, comparative analysis between the multitask and single-task versions of the architecture confirms that the multitask approach outperforms single-task models in both arousal and valence prediction. This improvement underscores the benefits of shared representations and joint learning of related affective dimensions within a unified framework. The code for this project is publicly available at https://github.com/cosbidev/Temporal-Multimodal-Multitask-Attention.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025 - Temporal Multimodal Multitask Attention for Affective State Estimation in a Stressful Environment.pdf
accesso aperto
Tipologia:
Documento in Pre-print
Licenza:
Copyright dell'editore
Dimensione
1.01 MB
Formato
Adobe PDF
|
1.01 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

