Computer Vision Autonomous Driving Adverse Weather Multi-Sensor Fusion

Weather-Resilient Multi-Sensor Perception for Autonomous Driving

Built a weather-resilient perception pipeline fusing camera, LiDAR, radar, and map priors through a BEV-centric architecture with condition-aware gating. The system adapts fusion weights under rain, fog, and nighttime scenarios and maintains stable 3D detection with low-latency edge deployment.

+24% Object Recall
<55ms Latency
31% Less Overhead
4 Modalities

Why Weather Resilience Matters

Autonomous driving demands a perception system that works reliably in every condition — bright sunlight, heavy rain, fog, darkness, and cluttered urban environments. No single sensor modality is sufficient. Cameras deliver rich semantic and texture information but lack depth estimation at range and degrade in poor lighting. LiDAR provides precise 3D point clouds but becomes sparse at distance and struggles with adverse weather like rain or fog that scatters its laser pulses. Radar is robust to weather and measures velocity directly, but its angular resolution is too coarse for fine-grained object classification.

Existing approaches often process each sensor in isolation and attempt late fusion at the decision level, losing complementary information early in the pipeline. This project addresses that gap by designing a unified, BEV-centric perception architecture that fuses multi-modal features at the representation level, preserving geometric and semantic synergies across sensors. The result is a system that maintains high recall and precision across the full detection range and all environmental conditions, while meeting the strict latency budget required for real-time autonomous operation.

Complete Perception Pipeline

From raw sensor streams to 3D tracked objects — hover each stage for details. The pipeline processes three modalities in parallel, fuses them in BEV space, and applies temporal reasoning for coherent tracking.

Sensor Inputs → Feature Extraction → Weather-Adaptive Fusion → Detection → Edge Deployment
Camera Stream
📷
RGB Frames 1920×1080
Camera Input

Multi-camera rig captures 360° surround view at 30 FPS. Images are undistorted and synchronized across all cameras.

30 FPSSurround
🧠
CNN Backbone ResNet-50
Visual Feature Extractor

Pre-trained ResNet-50 extracts multi-scale feature maps. FPN neck generates hierarchical features for dense prediction tasks.

ResNet-50FPN
LiDAR Stream
☁️
Point Cloud 300K pts/frame
LiDAR Input

64-beam rotating LiDAR produces dense 3D point clouds at 10 Hz. Points encode XYZ coordinates, intensity, and ring index.

64-beam10 Hz
🔬
PointPillars Voxelize
3D Feature Encoder

PointPillars voxelizes the point cloud into vertical columns and applies per-pillar PointNet to produce a pseudo-image BEV representation.

PointPillarsBEV
Radar Stream
📡
Radar Signals 77 GHz FMCW
Radar Input

77 GHz FMCW radar provides range-Doppler maps with velocity data. Robust in all weather conditions including fog and rain.

77 GHzFMCW
Signal Proc Range-Doppler
Radar Processing

CFAR detection on range-Doppler maps extracts targets. Velocity and angle features are encoded into a dense tensor for fusion.

CFARDoppler
Cross-Modal BEV Fusion
🔄
BEV Projection Unified Grid
Bird’s Eye View Projection

Camera features are projected to BEV via learned depth estimation. LiDAR and radar features are directly mapped to the same BEV grid.

LSSBEV Grid
⚖️
Attention Fusion Cross-Modal
Attention-Based Fusion

Cross-attention aligns features across modalities in the shared BEV space. Dynamically weights each sensor based on signal quality and overlap.

Cross-AttnDynamic
📊
Fused BEV 256×256
Fused Feature Map

Dense 256×256 BEV feature grid encoding geometry, appearance, and velocity from all three sensor modalities.

Detection & Tracking
🚗
3D Detection CenterPoint
3D Object Detection

CenterPoint-style head predicts 3D bounding boxes (x,y,z,w,l,h,yaw) for vehicles, pedestrians, and cyclists from the fused BEV features.

CenterPoint3D BBox
🚶
Multi-Object Tracking Hungarian
📍
Trajectory Prediction 3s Horizon
Deployment
TensorRT INT8 Quant
🚘
AV Runtime <50ms
Real-Time AV Inference

Full pipeline runs at <50ms on NVIDIA Orin with TensorRT INT8 optimization. Asynchronous sensor ingestion with lock-free queues.

Orin<50ms

Pipeline Steps

The perception pipeline follows a six-stage process, from raw sensor calibration through to final 3D bounding box prediction, with each stage designed for parallelism and low latency.

Sensor Calibration

Extrinsic and intrinsic calibration of camera, LiDAR, and radar using a joint optimization over calibration targets. Temporal synchronization via hardware PTP timestamps to align frames within 2ms.

Feature Extraction

Camera images processed through ResNet-50 with Feature Pyramid Network. LiDAR point clouds encoded via PointPillars into pseudo-images. Radar signals processed through CFAR detection and a lightweight MLP backbone.

BEV Projection

Camera features lifted to 3D using learned depth estimation and projected onto a unified BEV grid. LiDAR pillars naturally map to BEV. Radar detections scatter into the same grid with velocity channels.

Cross-Modal Attention

A deformable cross-attention module aligns and fuses BEV features from all three modalities. Learned query positions attend to relevant spatial locations across sensor maps, resolving geometric misalignments.

Temporal Self-Attention

Fused BEV features from the current frame attend to warped features from previous frames using ego-motion compensation. This temporal module stabilizes detections across frames and enables velocity estimation.

3D Detection

CenterPoint-style detection heads regress 3D bounding box centers, dimensions, orientation, and velocity from the temporally fused BEV map. Non-maximum suppression produces the final detection output.

Interactive Charts

Quantitative evaluation across detection range, weather conditions, inference latency, and overall system capabilities. All benchmarks measured on the nuScenes validation set and a proprietary adverse-weather test suite.

Recall vs Distance Across Conditions

Recall at IoU=0.5 under clear and adverse weather

Line Chart

Scroll to zoom · Drag to pan

Weather Robustness by Sensor Stack

mAP comparison across clear, rain, fog, and night

Bar Chart

Scroll to zoom · Drag to pan

Inference Latency Under Sensor Load

End-to-end latency on NVIDIA Orin across traffic density

Line Chart

Scroll to zoom · Drag to pan

Autonomy Stack Capabilities

Multi-dimensional comparison vs camera-only baseline

Radar Chart

Scroll to zoom · Drag to pan

Key Outcomes

The fused system was evaluated on the nuScenes benchmark and a proprietary adverse-weather test suite. All metrics are reported relative to the strongest single-sensor baseline (LiDAR-only CenterPoint).

+24%
Object Recall

Improvement over single-sensor baselines at IoU=0.5, especially at long range (>60m) and under occlusion.

<55ms
Latency

End-to-end inference on NVIDIA Orin with INT8 quantization and TensorRT, meeting the 20 Hz real-time budget.

-31%
Runtime Overhead

Compared to running three separate detection models, the unified pipeline reduces total compute by 31%.

4
Sensor Modalities

Camera, LiDAR, and radar fused at the feature level in a shared BEV representation for complementary perception.

Design Highlights

BEV-Centric Fusion

Projecting all sensor features into a unified Bird's-Eye-View space eliminates the geometric ambiguity of perspective fusion and enables straightforward 3D reasoning. The shared BEV grid serves as the backbone for both spatial and temporal aggregation.

Temporal Self-Attention

By attending to ego-motion-warped BEV features from prior frames, the model stabilizes detections across time, reduces false positives from single-frame noise, and enables implicit velocity estimation without explicit tracking modules.

Edge Deployment Ready

INT8 quantization via TensorRT, operator fusion, and optimized memory layout bring inference below 50ms on NVIDIA Orin. The pipeline is containerized with NVIDIA Triton for production-grade serving on the vehicle compute platform.

Technologies Used

Frameworks & Tools
PyTorch TensorRT NVIDIA Orin CUDA OpenCV Open3D nuScenes SDK PointPillars CenterPoint Deformable DETR ONNX Docker NVIDIA Triton ROS2 Python C++

Business Impact and Delivery Scope

Problem Solved

Autonomy perception stacks often regress in rain, fog, and low-light conditions where safety margins matter most.

What I Deliver

Weather-aware fusion architecture with condition-adaptive weighting and robust performance validation.

Expected Impact

More stable detection quality across degraded environments without unacceptable latency overhead.

Hire Me for Weather-Robust Perception

I can help teams improve perception reliability under adverse environmental conditions for safety-critical systems.

MVP Delivery

Condition-aware benchmark setup and baseline model tuned for your critical scenarios.

Production Hardening

Domain adaptation, calibration checks, and failure-case mitigation under weather shift.

Advisory + Build

Architecture and evaluation strategy support for reliability-focused AV teams.

Other Projects

Instruction-Tuned Multimodal LLM for Scene Understanding

Vision Transformer integrated with a decoder-only LLM for conversational VQA, referring expressions, and multimodal grounding.

Knowledge-Augmented Reasoning Engine via Fine-Tuned LLM

PEFT fine-tuning with RAG pipeline injecting knowledge graph sub-graphs and Chain-of-Thought prompting for factual reasoning.

Enhancing Math Reasoning in LLMs via Self-Supervised Fine-Tuning

Qwen 2.5-32B fine-tuned with a novel "Wait" token technique achieving 56.7% on AIME 2024.

Multimodal Emotion Recognition for Human-Robot Interaction

Multimodal system combining vision, speech, and NLP with CNNs, LSTMs, and attention for real-time emotion recognition.

Audio-Visual Fusion for Dynamic Pedestrian Awareness

Self-supervised audio-visual fusion achieving LiDAR-comparable pedestrian detection deployed on Jetson Orin Nano.