Computer Vision Autonomous Vehicles Sensor Fusion 3D Detection

Real-Time Multi-Sensor Fusion for Autonomous Perception

Engineered a unified perception pipeline fusing camera, LiDAR, and radar data through a cross-modal Bird's-Eye-View (BEV) architecture with attention-based alignment and temporal self-attention for consistent 3D object detection. Deployed on an autonomous vehicle testbed with INT8 quantization and TensorRT optimization, achieving sub-50ms inference on embedded GPU hardware.

+17% Object Recall
<50ms Latency
20% Less Overhead
3 Modalities

Why Multi-Sensor Fusion Matters

Autonomous driving demands a perception system that works reliably in every condition — bright sunlight, heavy rain, fog, darkness, and cluttered urban environments. No single sensor modality is sufficient. Cameras deliver rich semantic and texture information but lack depth estimation at range and degrade in poor lighting. LiDAR provides precise 3D point clouds but becomes sparse at distance and struggles with adverse weather like rain or fog that scatters its laser pulses. Radar is robust to weather and measures velocity directly, but its angular resolution is too coarse for fine-grained object classification.

Existing approaches often process each sensor in isolation and attempt late fusion at the decision level, losing complementary information early in the pipeline. This project addresses that gap by designing a unified, BEV-centric perception architecture that fuses multi-modal features at the representation level, preserving geometric and semantic synergies across sensors. The result is a system that maintains high recall and precision across the full detection range and all environmental conditions, while meeting the strict latency budget required for real-time autonomous operation.

Complete Perception Pipeline

From raw sensor streams to 3D tracked objects — hover each stage for details. The pipeline processes three modalities in parallel, fuses them in BEV space, and applies temporal reasoning for coherent tracking.

Sensor Inputs → Feature Extraction → BEV Fusion → Detection → Deployment
Camera Stream
📷
RGB Frames 1920×1080
Camera Input

Multi-camera rig captures 360° surround view at 30 FPS. Images are undistorted and synchronized across all cameras.

30 FPSSurround
🧠
CNN Backbone ResNet-50
Visual Feature Extractor

Pre-trained ResNet-50 extracts multi-scale feature maps. FPN neck generates hierarchical features for dense prediction tasks.

ResNet-50FPN
LiDAR Stream
☁️
Point Cloud 300K pts/frame
LiDAR Input

64-beam rotating LiDAR produces dense 3D point clouds at 10 Hz. Points encode XYZ coordinates, intensity, and ring index.

64-beam10 Hz
🔬
PointPillars Voxelize
3D Feature Encoder

PointPillars voxelizes the point cloud into vertical columns and applies per-pillar PointNet to produce a pseudo-image BEV representation.

PointPillarsBEV
Radar Stream
📡
Radar Signals 77 GHz FMCW
Radar Input

77 GHz FMCW radar provides range-Doppler maps with velocity data. Robust in all weather conditions including fog and rain.

77 GHzFMCW
Signal Proc Range-Doppler
Radar Processing

CFAR detection on range-Doppler maps extracts targets. Velocity and angle features are encoded into a dense tensor for fusion.

CFARDoppler
Cross-Modal BEV Fusion
🔄
BEV Projection Unified Grid
Bird’s Eye View Projection

Camera features are projected to BEV via learned depth estimation. LiDAR and radar features are directly mapped to the same BEV grid.

LSSBEV Grid
⚖️
Attention Fusion Cross-Modal
Attention-Based Fusion

Cross-attention aligns features across modalities in the shared BEV space. Dynamically weights each sensor based on signal quality and overlap.

Cross-AttnDynamic
📊
Fused BEV 256×256
Fused Feature Map

Dense 256×256 BEV feature grid encoding geometry, appearance, and velocity from all three sensor modalities.

Detection & Tracking
🚗
3D Detection CenterPoint
3D Object Detection

CenterPoint-style head predicts 3D bounding boxes (x,y,z,w,l,h,yaw) for vehicles, pedestrians, and cyclists from the fused BEV features.

CenterPoint3D BBox
🚶
Multi-Object Tracking Hungarian
📍
Trajectory Prediction 3s Horizon
Deployment
TensorRT INT8 Quant
🚘
AV Runtime <50ms
Real-Time AV Inference

Full pipeline runs at <50ms on NVIDIA Orin with TensorRT INT8 optimization. Asynchronous sensor ingestion with lock-free queues.

Orin<50ms

Pipeline Steps

The perception pipeline follows a six-stage process, from raw sensor calibration through to final 3D bounding box prediction, with each stage designed for parallelism and low latency.

Sensor Calibration

Extrinsic and intrinsic calibration of camera, LiDAR, and radar using a joint optimization over calibration targets. Temporal synchronization via hardware PTP timestamps to align frames within 2ms.

Feature Extraction

Camera images processed through ResNet-50 with Feature Pyramid Network. LiDAR point clouds encoded via PointPillars into pseudo-images. Radar signals processed through CFAR detection and a lightweight MLP backbone.

BEV Projection

Camera features lifted to 3D using learned depth estimation and projected onto a unified BEV grid. LiDAR pillars naturally map to BEV. Radar detections scatter into the same grid with velocity channels.

Cross-Modal Attention

A deformable cross-attention module aligns and fuses BEV features from all three modalities. Learned query positions attend to relevant spatial locations across sensor maps, resolving geometric misalignments.

Temporal Self-Attention

Fused BEV features from the current frame attend to warped features from previous frames using ego-motion compensation. This temporal module stabilizes detections across frames and enables velocity estimation.

3D Detection

CenterPoint-style detection heads regress 3D bounding box centers, dimensions, orientation, and velocity from the temporally fused BEV map. Non-maximum suppression produces the final detection output.

Interactive Charts

Quantitative evaluation across detection range, weather conditions, inference latency, and overall system capabilities. All benchmarks measured on the nuScenes validation set and a proprietary adverse-weather test suite.

Object Recall vs Detection Range

Recall at IoU=0.5 across distance bins

Line Chart

Scroll to zoom · Drag to pan

Performance Across Weather Conditions

mAP comparison under degraded conditions

Bar Chart

Scroll to zoom · Drag to pan

Inference Latency per Frame

End-to-end latency on NVIDIA Orin (ms)

Line Chart

Scroll to zoom · Drag to pan

System Capabilities

Multi-dimensional comparison

Radar Chart

Scroll to zoom · Drag to pan

Key Outcomes

The fused system was evaluated on the nuScenes benchmark and a proprietary adverse-weather test suite. All metrics are reported relative to the strongest single-sensor baseline (LiDAR-only CenterPoint).

+17%
Object Recall

Improvement over single-sensor baselines at IoU=0.5, especially at long range (>60m) and under occlusion.

<50ms
Latency

End-to-end inference on NVIDIA Orin with INT8 quantization and TensorRT, meeting the 20 Hz real-time budget.

-20%
Runtime Overhead

Compared to running three separate detection models, the unified pipeline reduces total compute by 20%.

3
Sensor Modalities

Camera, LiDAR, and radar fused at the feature level in a shared BEV representation for complementary perception.

Design Highlights

BEV-Centric Fusion

Projecting all sensor features into a unified Bird's-Eye-View space eliminates the geometric ambiguity of perspective fusion and enables straightforward 3D reasoning. The shared BEV grid serves as the backbone for both spatial and temporal aggregation.

Temporal Self-Attention

By attending to ego-motion-warped BEV features from prior frames, the model stabilizes detections across time, reduces false positives from single-frame noise, and enables implicit velocity estimation without explicit tracking modules.

Edge Deployment Ready

INT8 quantization via TensorRT, operator fusion, and optimized memory layout bring inference below 50ms on NVIDIA Orin. The pipeline is containerized with NVIDIA Triton for production-grade serving on the vehicle compute platform.

Technologies Used

Frameworks & Tools
PyTorch TensorRT NVIDIA Orin CUDA OpenCV Open3D nuScenes SDK PointPillars CenterPoint Deformable DETR ONNX Docker NVIDIA Triton ROS2 Python C++

Business Impact and Delivery Scope

Problem Solved

Perception stacks fail under adverse conditions when camera, LiDAR, and radar are not aligned in a unified architecture.

What I Deliver

Feature-level BEV fusion pipeline, calibration-aware training, and deployment optimization for real-time multi-sensor inference.

Expected Impact

Higher object recall at range, lower miss rate in edge cases, and stable latency on embedded automotive hardware.

Hire Me for Sensor Fusion Delivery

I can support dataset strategy, fusion model implementation, and edge deployment validation for AV perception programs.

MVP Delivery

Camera + LiDAR + radar fusion baseline with reproducible metrics on your target scenarios.

Production Hardening

Latency profiling, quantization, and reliability testing under weather and occlusion conditions.

Advisory + Build

Architecture review and hands-on execution to accelerate internal AV perception teams.

Other Projects

Instruction-Tuned Multimodal LLM for Scene Understanding

Vision Transformer integrated with a decoder-only LLM for conversational VQA, referring expressions, and multimodal grounding.

Knowledge-Augmented Reasoning Engine via Fine-Tuned LLM

PEFT fine-tuning with RAG pipeline injecting knowledge graph sub-graphs and Chain-of-Thought prompting for factual reasoning.

Enhancing Math Reasoning in LLMs via Self-Supervised Fine-Tuning

Qwen 2.5-32B fine-tuned with a novel "Wait" token technique achieving 56.7% on AIME 2024.

Multimodal Emotion Recognition for Human-Robot Interaction

Multimodal system combining vision, speech, and NLP with CNNs, LSTMs, and attention for real-time emotion recognition.

Audio-Visual Fusion for Dynamic Pedestrian Awareness

Self-supervised audio-visual fusion achieving LiDAR-comparable pedestrian detection deployed on Jetson Orin Nano.