Multimodal AI Computer Vision Speech Processing NLP Robotics

Multimodal Emotion Recognition for Human-Robot Interaction

Designed a system that fuses facial expressions, speech patterns, and linguistic cues to recognize human emotions in real-time, enabling intuitive human-robot interaction.

3 Modalities
7 Emotions
<100ms Real-time Inference
ROS Robot-Integrated

Why Multimodal Emotion Recognition?

For robots to interact naturally with humans, they need to understand not just what we say, but how we feel. Human emotions are expressed through multiple channels simultaneously -- facial expressions, vocal tone, and the words we choose. Relying on a single modality leads to critical misinterpretations: a smile paired with a sarcastic tone, or a neutral face hiding frustration in speech patterns.

Single-modality approaches consistently miss context. A vision-only system cannot detect sarcasm. An audio-only system fails when the environment is noisy. Text analysis alone misses non-verbal cues entirely. The solution is multimodal fusion -- combining all three channels to capture the full emotional picture, enabling robots to respond with genuine empathy and situational awareness.

End-to-End Multimodal Pipeline

Three parallel encoders extract features from each modality, which are then dynamically fused using an attention mechanism before classification and robot control.

Video + Audio + Text → Encode → Fuse → Classify → Robot
Visual Stream
🎥
Video Feed 30 FPS
Video Input

RGB camera captures facial expressions at 30 FPS. Frames preprocessed with face detection and alignment to normalize pose and lighting.

30 FPSFace Detect
👀
Face Detect MTCNN
🧠
CNN Encoder ResNet-50
Visual Feature Extractor

Fine-tuned ResNet-50 extracts 2048-d facial expression features. Pre-trained on FER2013, fine-tuned on AffectNet.

ResNet-502048-d
Audio Stream
🎤
Audio 16 kHz
🎶
Mel Spec MFCC
Audio Features

128-bin mel-frequency spectrogram combined with MFCCs, pitch contour, and energy envelope for prosodic analysis.

128 binsMFCC
🎧
Speech Enc wav2vec
Text Stream
💬
ASR Whisper
📝
NLP Encoder RoBERTa
NLP Encoder

RoBERTa-base fine-tuned on GoEmotions. Encodes semantic sentiment, emotional keywords, and contextual nuances.

RoBERTaGoEmotions
Attention-Based Fusion
⚖️
Cross-Modal Attention Dynamic Weights
Attention Fusion

Cross-modal attention dynamically weights each modality based on signal quality. Degraded channels are automatically downweighted for robust predictions.

DynamicCross-Modal
Temporal Modeling
🔁
Bi-LSTM Sequence
Bi-LSTM

Bidirectional LSTM captures emotional dynamics over time, modeling how emotions transition during conversation for context-aware predictions.

BiLSTMTemporal
🎯
Classifier 7-Class
Emotion Classifier

7-class softmax: Happy, Sad, Angry, Fear, Surprise, Neutral, Disgust. Confidence scores enable threshold-based filtering.

7 ClassesSoftmax
Robot Integration
🤖
ROS Control <100ms
Robot Control

ROS Action Server translates emotions into robot behaviors — gestures, facial displays, and conversational tone. End-to-end latency under 100ms.

ROS<100ms

Processing Pipeline

A six-stage pipeline transforms raw multimodal inputs into emotionally-aware robot responses.

Visual Feature Extraction

CNN-based facial expression analysis using a fine-tuned ResNet backbone. Face detection and alignment precede feature extraction for robust performance across poses and lighting conditions.

Audio Processing

Speech pattern and prosody analysis extracting mel-frequency cepstral coefficients, pitch contours, and energy envelopes to capture tonal emotional cues beyond lexical content.

Linguistic Analysis

NLP encoder processes transcribed speech for sentiment, emotional keywords, and contextual cues using transformer-based embeddings that capture nuanced linguistic emotional expression.

Attention Fusion

Cross-modal attention mechanism dynamically weights modalities based on signal quality and informativeness -- upweighting audio when visual is occluded, or text when audio is noisy.

Temporal Modeling

Bidirectional LSTMs capture emotional dynamics over time, modeling how emotions transition and evolve during conversation for context-aware predictions.

Robot Integration

Real-time response system translates predicted emotions into robot behaviors via ROS, adjusting facial expressions, gestures, and conversational tone within 100ms latency.

Performance Analysis

Comprehensive evaluation across seven emotion classes, modality ablation studies, and robustness testing under degraded conditions.

Per-Emotion Recognition Accuracy
Accuracy breakdown across 7 emotion classes
Radar

Scroll to zoom · Drag to pan

Modality Contribution Analysis
Overall accuracy by input modality
Bar

Scroll to zoom · Drag to pan

Confusion Matrix Approximation
Correct classification vs misclassification per emotion
Stacked Bar

Scroll to zoom · Drag to pan

Accuracy Under Different Conditions
Graceful degradation of fused system vs single modality
Line

Scroll to zoom · Drag to pan

Key Performance Metrics

89%
Fused Accuracy
Overall emotion recognition accuracy with multimodal fusion
+24%
vs Best Single
Improvement over the best single-modality baseline
7
Emotion Classes
Happy, Sad, Angry, Fear, Surprise, Neutral, Disgust
<100ms
Latency
End-to-end inference time for real-time robot interaction

Core Innovations

⚖️

Dynamic Attention Fusion

Rather than naive concatenation, the system learns to dynamically weight each modality based on signal quality and informativeness. When one channel is degraded, others compensate automatically, yielding robust predictions across varied real-world conditions.

Temporal Emotion Modeling

Bidirectional LSTMs capture how emotions evolve over time during interaction. This temporal context prevents abrupt classification switches and models natural emotional transitions, producing smoother and more accurate continuous predictions.

🤖

Robotic Integration

Full ROS integration translates emotion predictions into robot behaviors in under 100ms. The system maps recognized emotions to appropriate robotic responses -- adjusting gestures, facial displays, and conversational strategies for empathetic human-robot interaction.

Tech Stack
PyTorch CNNs LSTMs Attention Mechanisms OpenCV librosa spaCy ROS ONNX Runtime

Business Impact and Delivery Scope

Problem Solved

Single-modality emotion systems miss context and perform poorly in realistic human-robot interactions.

What I Deliver

Multimodal emotion pipeline combining vision, audio, and text signals with robust fusion and evaluation.

Expected Impact

More accurate affect detection, better interaction quality, and improved system responsiveness in real environments.

Hire Me for Multimodal Emotion AI

I can implement perception stacks for HRI, assistive tech, and conversational systems that need affect awareness.

MVP Delivery

Emotion classifier prototype on your data with baseline fusion and measurable KPIs.

Production Hardening

Robustness tuning for noise, latency targets, and deployment constraints.

Advisory + Build

Model architecture and data strategy guidance for scalable multimodal HRI systems.

Other Projects

Sensor Fusion System

Multi-sensor data fusion for autonomous systems using advanced filtering and state estimation techniques.

Multimodal LLM

Large language model with multimodal capabilities for visual question answering and cross-modal reasoning.

Knowledge Engine

Intelligent knowledge retrieval and reasoning engine powered by graph neural networks and semantic search.

Math Reasoning

Neural mathematical reasoning system for automated theorem proving and step-by-step problem solving.

Pedestrian Awareness

Real-time pedestrian detection and intent prediction for autonomous driving safety systems.