Self-Supervised Learning Audio-Visual Fusion Edge AI Pedestrian Detection Jetson Orin

Audio-Visual Fusion for Dynamic Pedestrian Awareness

Built a self-supervised audio-visual fusion system that detects and predicts pedestrian motion using ambient footstep sounds and camera imaging, achieving LiDAR-comparable performance at a fraction of the cost. Deployed on Jetson Orin Nano.

89% LiDAR-level Performance
Edge Edge Deployed
SSL Self-Supervised
~$200 Low-Cost Solution

Why Audio-Visual Fusion for Pedestrian Detection?

LiDAR remains the gold standard for pedestrian detection in autonomous systems, but it is prohibitively expensive ($1,000–$10,000+ per unit), power-hungry, and adds significant integration complexity. Camera-only approaches are cheaper but fundamentally limited — they fail under heavy occlusion, struggle in low-light and nighttime conditions, and cannot perceive pedestrians hidden behind obstacles.

Audio provides a complementary sensing modality that addresses these gaps directly. Footstep sounds, movement noise, and ambient audio signals propagate around corners and through occluding objects, effectively allowing the system to “see” where cameras cannot. By fusing audio and visual streams through a learned attention mechanism, we achieve detection performance approaching LiDAR baselines — on hardware costing a fraction of the price and consuming far less power. The key challenge is designing a self-supervised fusion architecture that learns robust audio-visual correspondences without requiring expensive manual annotation, and deploying it efficiently on edge hardware like the Jetson Orin Nano.

Audio-Visual Fusion Pipeline

The architecture processes two parallel sensor streams — camera and microphone array — through modality-specific backbones, fuses their representations via a learned attention mechanism with dynamic weighting, and produces pedestrian detection with motion prediction. A self-supervised training loop closes the learning cycle without manual labels.

Camera + Mic → Encode → Fuse → Detect → Edge Deploy
Camera Stream
📷
Camera Feed RGB Frames
Camera Input

Low-cost camera module captures RGB frames. Serves as primary visual sensor for pedestrian appearance and scene context.

🧠
Visual CNN Backbone
Visual Backbone

Lightweight CNN extracts spatial features encoding pedestrian appearance, pose, and scene context at multiple resolution scales.

CNNMulti-scale
📈
Visual Embed Spatial
Microphone Array
🎤
Ambient Audio Mic Array
Microphone Array

Multi-mic array captures ambient sounds including footsteps, movement noise. Enables direction-of-arrival estimation for spatial audio.

Multi-micDoA
🎶
Audio CNN Mel Spec
Audio Backbone

Mel spectrogram processed through temporal CNN to extract features encoding footstep patterns, direction, and movement characteristics.

MelTemporal
📈
Audio Embed Temporal
Self-supervised contrastive pre-training on unlabeled audio-visual pairs
Attention-Based Fusion
⚖️
Cross-Modal Attn Dynamic Weight
Attention Fusion

Cross-modal attention dynamically adjusts modality contribution. In darkness audio is upweighted; in clear conditions visual dominates.

Cross-AttnAdaptive
🚶
Detector Pedestrian
📍
Motion Predict Trajectory
Motion Predictor

Predicts pedestrian trajectory and crossing intent. Outputs detection boxes + predicted future positions for collision avoidance.

TrajectoryIntent
Optimization
INT8 Quant TensorRT
Quantization

INT8 quantization via TensorRT reduces model size and accelerates inference by 2.3x with <1% accuracy loss.

INT82.3x faster
🔨
Async Pipeline Lock-free
Edge Deployment
🖥️
Jetson Orin 28 FPS
Jetson Orin Nano

Real-time inference at 28 FPS within a 10W power envelope. Total hardware cost ~$200 vs $1,000+ for LiDAR systems.

28 FPS~$200
Output Detections

Pipeline Steps

The system follows a six-stage pipeline from self-supervised pre-training through edge deployment, with each stage designed for minimal supervision and maximum efficiency on constrained hardware.

Self-Supervised Pre-training

Learn audio-visual correspondences without manual labels by training contrastive objectives that align temporally co-occurring audio and visual features while pushing apart non-corresponding pairs.

Visual Feature Extraction

Camera frames processed through a lightweight CNN backbone to extract spatial features encoding pedestrian appearance, pose, and scene context at multiple resolution scales.

Audio Feature Extraction

Ambient sound captured by microphone array, converted to mel spectrograms, and processed through a temporal CNN to extract features encoding footstep patterns, direction of arrival, and movement characteristics.

Attention-Based Fusion

A cross-modal attention mechanism dynamically weights visual and audio modalities based on environmental conditions — increasing audio weight in darkness or occlusion, and visual weight in clear daylight.

Quantized Inference

INT8 quantization via TensorRT reduces model size and accelerates inference by 2.3x with less than 1% accuracy degradation, enabling real-time operation on resource-constrained edge hardware.

Jetson Orin Deployment

Asynchronous sensor pipelines decouple camera and microphone processing, with lock-free queues and double buffering to maximize GPU utilization and maintain consistent frame rates on the Jetson Orin Nano.

Interactive Charts

Quantitative evaluation across detection accuracy, adverse conditions, real-time performance, and range degradation. All benchmarks measured against LiDAR ground truth on a custom urban pedestrian dataset.

Detection Performance Comparison

Detection rate (%) across modalities

Bar Chart

Scroll to zoom · Drag to pan

Performance Under Adverse Conditions

Robustness across six challenging scenarios

Radar Chart

Scroll to zoom · Drag to pan

Real-Time FPS on Jetson Orin Nano

Frame rate stability over 100 frames

Line Chart

Scroll to zoom · Drag to pan

Detection Accuracy vs Distance

Accuracy degradation over 5m–50m range

Line Chart

Scroll to zoom · Drag to pan

System Comparison

A comprehensive comparison of sensing approaches across cost, detection capability, environmental robustness, and power requirements. Our audio-visual fusion system achieves the best cost-to-performance ratio.

System Hardware Cost Detection Rate Works in Darkness Works Through Occlusion Power Consumption
LiDAR System $1,000–$10,000+ 92% Yes No 15–30W
Camera-Only ~$50 71% No No 2–5W
Audio-Only ~$30 48% Yes Yes 1–2W
Our Fusion ~$200 89% Yes Yes 7–10W

Key Outcomes

The fused system was evaluated against LiDAR ground truth on a custom urban pedestrian dataset spanning daylight, nighttime, occluded, and adverse weather conditions.

89%
Detection Rate

Overall pedestrian detection accuracy across all conditions, closing the gap with expensive LiDAR systems.

97%
Of LiDAR Performance

Achieves 97% of LiDAR baseline detection rate at less than 2% of the hardware cost.

28
FPS on Edge

Real-time inference at 28 FPS on Jetson Orin Nano with INT8 quantization, exceeding the 24 FPS target.

~$200
Hardware Cost

Total system cost including Jetson Orin Nano, camera module, and microphone array — a fraction of LiDAR.

Design Highlights

Self-Supervised Learning

Contrastive pre-training learns audio-visual correspondences from unlabeled data by exploiting the natural temporal alignment between footstep sounds and visual pedestrian motion. This eliminates the need for costly manual annotation and enables rapid adaptation to new environments.

Attention-Based Fusion

A learned cross-modal attention mechanism dynamically adjusts the contribution of each modality based on environmental context. In darkness or heavy occlusion, audio features are upweighted; in clear conditions, visual features dominate — ensuring robust detection everywhere.

Edge-Optimized Deployment

INT8 quantization via TensorRT, asynchronous sensor pipelines with lock-free queues, and optimized memory layout deliver 28 FPS on the Jetson Orin Nano. The entire system fits within a 10W power envelope suitable for battery-powered or embedded applications.

Technologies Used

Frameworks & Tools
PyTorch ONNX TensorRT Jetson Orin Nano librosa OpenCV INT8 Quantization Async Pipelines

Business Impact and Delivery Scope

Problem Solved

Camera-only pedestrian awareness degrades under occlusion, darkness, and noisy real-world urban conditions.

What I Deliver

Audio-visual fusion pipeline with edge-optimized inference for low-cost, real-time pedestrian detection.

Expected Impact

Safer detection coverage at lower hardware cost with practical deployment on constrained devices.

Hire Me for Edge Perception Deployment

I can build cost-efficient perception pipelines for robotics and mobility products running on edge hardware.

MVP Delivery

Audio-visual detection baseline with performance measurement on your target environment.

Production Hardening

Quantization, throughput optimization, and reliability testing on device.

Advisory + Build

End-to-end support from sensing strategy to deployment and field validation.

Other Projects

Real-Time Multi-Sensor Fusion for Autonomous Perception

Unified perception pipeline fusing camera, LiDAR, and radar through cross-modal BEV architecture with attention-based alignment.

Instruction-Tuned Multimodal LLM for Scene Understanding

Vision Transformer integrated with a decoder-only LLM for conversational VQA, referring expressions, and multimodal grounding.

Knowledge-Augmented Reasoning Engine via Fine-Tuned LLM

PEFT fine-tuning with RAG pipeline injecting knowledge graph sub-graphs and Chain-of-Thought prompting for factual reasoning.

Enhancing Math Reasoning in LLMs via Self-Supervised Fine-Tuning

Qwen 2.5-32B fine-tuned with a novel "Wait" token technique achieving 56.7% on AIME 2024.

Multimodal Emotion Recognition for Human-Robot Interaction

Multimodal system combining vision, speech, and NLP with CNNs, LSTMs, and attention for real-time emotion recognition.