Designed a system that fuses facial expressions, speech patterns, and linguistic cues to recognize human emotions in real-time, enabling intuitive human-robot interaction.
For robots to interact naturally with humans, they need to understand not just what we say, but how we feel. Human emotions are expressed through multiple channels simultaneously -- facial expressions, vocal tone, and the words we choose. Relying on a single modality leads to critical misinterpretations: a smile paired with a sarcastic tone, or a neutral face hiding frustration in speech patterns.
Single-modality approaches consistently miss context. A vision-only system cannot detect sarcasm. An audio-only system fails when the environment is noisy. Text analysis alone misses non-verbal cues entirely. The solution is multimodal fusion -- combining all three channels to capture the full emotional picture, enabling robots to respond with genuine empathy and situational awareness.
Three parallel encoders extract features from each modality, which are then dynamically fused using an attention mechanism before classification and robot control.
RGB camera captures facial expressions at 30 FPS. Frames preprocessed with face detection and alignment to normalize pose and lighting.
Fine-tuned ResNet-50 extracts 2048-d facial expression features. Pre-trained on FER2013, fine-tuned on AffectNet.
128-bin mel-frequency spectrogram combined with MFCCs, pitch contour, and energy envelope for prosodic analysis.
RoBERTa-base fine-tuned on GoEmotions. Encodes semantic sentiment, emotional keywords, and contextual nuances.
Cross-modal attention dynamically weights each modality based on signal quality. Degraded channels are automatically downweighted for robust predictions.
Bidirectional LSTM captures emotional dynamics over time, modeling how emotions transition during conversation for context-aware predictions.
7-class softmax: Happy, Sad, Angry, Fear, Surprise, Neutral, Disgust. Confidence scores enable threshold-based filtering.
ROS Action Server translates emotions into robot behaviors — gestures, facial displays, and conversational tone. End-to-end latency under 100ms.
A six-stage pipeline transforms raw multimodal inputs into emotionally-aware robot responses.
CNN-based facial expression analysis using a fine-tuned ResNet backbone. Face detection and alignment precede feature extraction for robust performance across poses and lighting conditions.
Speech pattern and prosody analysis extracting mel-frequency cepstral coefficients, pitch contours, and energy envelopes to capture tonal emotional cues beyond lexical content.
NLP encoder processes transcribed speech for sentiment, emotional keywords, and contextual cues using transformer-based embeddings that capture nuanced linguistic emotional expression.
Cross-modal attention mechanism dynamically weights modalities based on signal quality and informativeness -- upweighting audio when visual is occluded, or text when audio is noisy.
Bidirectional LSTMs capture emotional dynamics over time, modeling how emotions transition and evolve during conversation for context-aware predictions.
Real-time response system translates predicted emotions into robot behaviors via ROS, adjusting facial expressions, gestures, and conversational tone within 100ms latency.
Comprehensive evaluation across seven emotion classes, modality ablation studies, and robustness testing under degraded conditions.
Scroll to zoom · Drag to pan
Scroll to zoom · Drag to pan
Scroll to zoom · Drag to pan
Scroll to zoom · Drag to pan
Rather than naive concatenation, the system learns to dynamically weight each modality based on signal quality and informativeness. When one channel is degraded, others compensate automatically, yielding robust predictions across varied real-world conditions.
Bidirectional LSTMs capture how emotions evolve over time during interaction. This temporal context prevents abrupt classification switches and models natural emotional transitions, producing smoother and more accurate continuous predictions.
Full ROS integration translates emotion predictions into robot behaviors in under 100ms. The system maps recognized emotions to appropriate robotic responses -- adjusting gestures, facial displays, and conversational strategies for empathetic human-robot interaction.
Single-modality emotion systems miss context and perform poorly in realistic human-robot interactions.
Multimodal emotion pipeline combining vision, audio, and text signals with robust fusion and evaluation.
More accurate affect detection, better interaction quality, and improved system responsiveness in real environments.
I can implement perception stacks for HRI, assistive tech, and conversational systems that need affect awareness.
Emotion classifier prototype on your data with baseline fusion and measurable KPIs.
Robustness tuning for noise, latency targets, and deployment constraints.
Model architecture and data strategy guidance for scalable multimodal HRI systems.
Multi-sensor data fusion for autonomous systems using advanced filtering and state estimation techniques.
Large language model with multimodal capabilities for visual question answering and cross-modal reasoning.
Intelligent knowledge retrieval and reasoning engine powered by graph neural networks and semantic search.
Neural mathematical reasoning system for automated theorem proving and step-by-step problem solving.
Real-time pedestrian detection and intent prediction for autonomous driving safety systems.