Built a self-supervised audio-visual fusion system that detects and predicts pedestrian motion using ambient footstep sounds and camera imaging, achieving LiDAR-comparable performance at a fraction of the cost. Deployed on Jetson Orin Nano.
LiDAR remains the gold standard for pedestrian detection in autonomous systems, but it is prohibitively expensive ($1,000–$10,000+ per unit), power-hungry, and adds significant integration complexity. Camera-only approaches are cheaper but fundamentally limited — they fail under heavy occlusion, struggle in low-light and nighttime conditions, and cannot perceive pedestrians hidden behind obstacles.
Audio provides a complementary sensing modality that addresses these gaps directly. Footstep sounds, movement noise, and ambient audio signals propagate around corners and through occluding objects, effectively allowing the system to “see” where cameras cannot. By fusing audio and visual streams through a learned attention mechanism, we achieve detection performance approaching LiDAR baselines — on hardware costing a fraction of the price and consuming far less power. The key challenge is designing a self-supervised fusion architecture that learns robust audio-visual correspondences without requiring expensive manual annotation, and deploying it efficiently on edge hardware like the Jetson Orin Nano.
The architecture processes two parallel sensor streams — camera and microphone array — through modality-specific backbones, fuses their representations via a learned attention mechanism with dynamic weighting, and produces pedestrian detection with motion prediction. A self-supervised training loop closes the learning cycle without manual labels.
Low-cost camera module captures RGB frames. Serves as primary visual sensor for pedestrian appearance and scene context.
Lightweight CNN extracts spatial features encoding pedestrian appearance, pose, and scene context at multiple resolution scales.
Multi-mic array captures ambient sounds including footsteps, movement noise. Enables direction-of-arrival estimation for spatial audio.
Mel spectrogram processed through temporal CNN to extract features encoding footstep patterns, direction, and movement characteristics.
Cross-modal attention dynamically adjusts modality contribution. In darkness audio is upweighted; in clear conditions visual dominates.
Predicts pedestrian trajectory and crossing intent. Outputs detection boxes + predicted future positions for collision avoidance.
INT8 quantization via TensorRT reduces model size and accelerates inference by 2.3x with <1% accuracy loss.
Real-time inference at 28 FPS within a 10W power envelope. Total hardware cost ~$200 vs $1,000+ for LiDAR systems.
The system follows a six-stage pipeline from self-supervised pre-training through edge deployment, with each stage designed for minimal supervision and maximum efficiency on constrained hardware.
Learn audio-visual correspondences without manual labels by training contrastive objectives that align temporally co-occurring audio and visual features while pushing apart non-corresponding pairs.
Camera frames processed through a lightweight CNN backbone to extract spatial features encoding pedestrian appearance, pose, and scene context at multiple resolution scales.
Ambient sound captured by microphone array, converted to mel spectrograms, and processed through a temporal CNN to extract features encoding footstep patterns, direction of arrival, and movement characteristics.
A cross-modal attention mechanism dynamically weights visual and audio modalities based on environmental conditions — increasing audio weight in darkness or occlusion, and visual weight in clear daylight.
INT8 quantization via TensorRT reduces model size and accelerates inference by 2.3x with less than 1% accuracy degradation, enabling real-time operation on resource-constrained edge hardware.
Asynchronous sensor pipelines decouple camera and microphone processing, with lock-free queues and double buffering to maximize GPU utilization and maintain consistent frame rates on the Jetson Orin Nano.
Quantitative evaluation across detection accuracy, adverse conditions, real-time performance, and range degradation. All benchmarks measured against LiDAR ground truth on a custom urban pedestrian dataset.
Detection rate (%) across modalities
Scroll to zoom · Drag to pan
Robustness across six challenging scenarios
Scroll to zoom · Drag to pan
Frame rate stability over 100 frames
Scroll to zoom · Drag to pan
Accuracy degradation over 5m–50m range
Scroll to zoom · Drag to pan
A comprehensive comparison of sensing approaches across cost, detection capability, environmental robustness, and power requirements. Our audio-visual fusion system achieves the best cost-to-performance ratio.
| System | Hardware Cost | Detection Rate | Works in Darkness | Works Through Occlusion | Power Consumption |
|---|---|---|---|---|---|
| LiDAR System | $1,000–$10,000+ | 92% | Yes | No | 15–30W |
| Camera-Only | ~$50 | 71% | No | No | 2–5W |
| Audio-Only | ~$30 | 48% | Yes | Yes | 1–2W |
| Our Fusion | ~$200 | 89% | Yes | Yes | 7–10W |
The fused system was evaluated against LiDAR ground truth on a custom urban pedestrian dataset spanning daylight, nighttime, occluded, and adverse weather conditions.
Contrastive pre-training learns audio-visual correspondences from unlabeled data by exploiting the natural temporal alignment between footstep sounds and visual pedestrian motion. This eliminates the need for costly manual annotation and enables rapid adaptation to new environments.
A learned cross-modal attention mechanism dynamically adjusts the contribution of each modality based on environmental context. In darkness or heavy occlusion, audio features are upweighted; in clear conditions, visual features dominate — ensuring robust detection everywhere.
INT8 quantization via TensorRT, asynchronous sensor pipelines with lock-free queues, and optimized memory layout deliver 28 FPS on the Jetson Orin Nano. The entire system fits within a 10W power envelope suitable for battery-powered or embedded applications.
Camera-only pedestrian awareness degrades under occlusion, darkness, and noisy real-world urban conditions.
Audio-visual fusion pipeline with edge-optimized inference for low-cost, real-time pedestrian detection.
Safer detection coverage at lower hardware cost with practical deployment on constrained devices.
I can build cost-efficient perception pipelines for robotics and mobility products running on edge hardware.
Audio-visual detection baseline with performance measurement on your target environment.
Quantization, throughput optimization, and reliability testing on device.
End-to-end support from sensing strategy to deployment and field validation.
Unified perception pipeline fusing camera, LiDAR, and radar through cross-modal BEV architecture with attention-based alignment.
Vision Transformer integrated with a decoder-only LLM for conversational VQA, referring expressions, and multimodal grounding.
PEFT fine-tuning with RAG pipeline injecting knowledge graph sub-graphs and Chain-of-Thought prompting for factual reasoning.
Qwen 2.5-32B fine-tuned with a novel "Wait" token technique achieving 56.7% on AIME 2024.
Multimodal system combining vision, speech, and NLP with CNNs, LSTMs, and attention for real-time emotion recognition.