Engineered a unified perception pipeline fusing camera, LiDAR, and radar data through a cross-modal Bird's-Eye-View (BEV) architecture with attention-based alignment and temporal self-attention for consistent 3D object detection. Deployed on an autonomous vehicle testbed with INT8 quantization and TensorRT optimization, achieving sub-50ms inference on embedded GPU hardware.
Autonomous driving demands a perception system that works reliably in every condition — bright sunlight, heavy rain, fog, darkness, and cluttered urban environments. No single sensor modality is sufficient. Cameras deliver rich semantic and texture information but lack depth estimation at range and degrade in poor lighting. LiDAR provides precise 3D point clouds but becomes sparse at distance and struggles with adverse weather like rain or fog that scatters its laser pulses. Radar is robust to weather and measures velocity directly, but its angular resolution is too coarse for fine-grained object classification.
Existing approaches often process each sensor in isolation and attempt late fusion at the decision level, losing complementary information early in the pipeline. This project addresses that gap by designing a unified, BEV-centric perception architecture that fuses multi-modal features at the representation level, preserving geometric and semantic synergies across sensors. The result is a system that maintains high recall and precision across the full detection range and all environmental conditions, while meeting the strict latency budget required for real-time autonomous operation.
From raw sensor streams to 3D tracked objects — hover each stage for details. The pipeline processes three modalities in parallel, fuses them in BEV space, and applies temporal reasoning for coherent tracking.
Multi-camera rig captures 360° surround view at 30 FPS. Images are undistorted and synchronized across all cameras.
Pre-trained ResNet-50 extracts multi-scale feature maps. FPN neck generates hierarchical features for dense prediction tasks.
64-beam rotating LiDAR produces dense 3D point clouds at 10 Hz. Points encode XYZ coordinates, intensity, and ring index.
PointPillars voxelizes the point cloud into vertical columns and applies per-pillar PointNet to produce a pseudo-image BEV representation.
77 GHz FMCW radar provides range-Doppler maps with velocity data. Robust in all weather conditions including fog and rain.
CFAR detection on range-Doppler maps extracts targets. Velocity and angle features are encoded into a dense tensor for fusion.
Camera features are projected to BEV via learned depth estimation. LiDAR and radar features are directly mapped to the same BEV grid.
Cross-attention aligns features across modalities in the shared BEV space. Dynamically weights each sensor based on signal quality and overlap.
Dense 256×256 BEV feature grid encoding geometry, appearance, and velocity from all three sensor modalities.
CenterPoint-style head predicts 3D bounding boxes (x,y,z,w,l,h,yaw) for vehicles, pedestrians, and cyclists from the fused BEV features.
Full pipeline runs at <50ms on NVIDIA Orin with TensorRT INT8 optimization. Asynchronous sensor ingestion with lock-free queues.
The perception pipeline follows a six-stage process, from raw sensor calibration through to final 3D bounding box prediction, with each stage designed for parallelism and low latency.
Extrinsic and intrinsic calibration of camera, LiDAR, and radar using a joint optimization over calibration targets. Temporal synchronization via hardware PTP timestamps to align frames within 2ms.
Camera images processed through ResNet-50 with Feature Pyramid Network. LiDAR point clouds encoded via PointPillars into pseudo-images. Radar signals processed through CFAR detection and a lightweight MLP backbone.
Camera features lifted to 3D using learned depth estimation and projected onto a unified BEV grid. LiDAR pillars naturally map to BEV. Radar detections scatter into the same grid with velocity channels.
A deformable cross-attention module aligns and fuses BEV features from all three modalities. Learned query positions attend to relevant spatial locations across sensor maps, resolving geometric misalignments.
Fused BEV features from the current frame attend to warped features from previous frames using ego-motion compensation. This temporal module stabilizes detections across frames and enables velocity estimation.
CenterPoint-style detection heads regress 3D bounding box centers, dimensions, orientation, and velocity from the temporally fused BEV map. Non-maximum suppression produces the final detection output.
Quantitative evaluation across detection range, weather conditions, inference latency, and overall system capabilities. All benchmarks measured on the nuScenes validation set and a proprietary adverse-weather test suite.
Recall at IoU=0.5 across distance bins
Scroll to zoom · Drag to pan
mAP comparison under degraded conditions
Scroll to zoom · Drag to pan
End-to-end latency on NVIDIA Orin (ms)
Scroll to zoom · Drag to pan
Multi-dimensional comparison
Scroll to zoom · Drag to pan
The fused system was evaluated on the nuScenes benchmark and a proprietary adverse-weather test suite. All metrics are reported relative to the strongest single-sensor baseline (LiDAR-only CenterPoint).
Projecting all sensor features into a unified Bird's-Eye-View space eliminates the geometric ambiguity of perspective fusion and enables straightforward 3D reasoning. The shared BEV grid serves as the backbone for both spatial and temporal aggregation.
By attending to ego-motion-warped BEV features from prior frames, the model stabilizes detections across time, reduces false positives from single-frame noise, and enables implicit velocity estimation without explicit tracking modules.
INT8 quantization via TensorRT, operator fusion, and optimized memory layout bring inference below 50ms on NVIDIA Orin. The pipeline is containerized with NVIDIA Triton for production-grade serving on the vehicle compute platform.
Perception stacks fail under adverse conditions when camera, LiDAR, and radar are not aligned in a unified architecture.
Feature-level BEV fusion pipeline, calibration-aware training, and deployment optimization for real-time multi-sensor inference.
Higher object recall at range, lower miss rate in edge cases, and stable latency on embedded automotive hardware.
I can support dataset strategy, fusion model implementation, and edge deployment validation for AV perception programs.
Camera + LiDAR + radar fusion baseline with reproducible metrics on your target scenarios.
Latency profiling, quantization, and reliability testing under weather and occlusion conditions.
Architecture review and hands-on execution to accelerate internal AV perception teams.
Vision Transformer integrated with a decoder-only LLM for conversational VQA, referring expressions, and multimodal grounding.
PEFT fine-tuning with RAG pipeline injecting knowledge graph sub-graphs and Chain-of-Thought prompting for factual reasoning.
Qwen 2.5-32B fine-tuned with a novel "Wait" token technique achieving 56.7% on AIME 2024.
Multimodal system combining vision, speech, and NLP with CNNs, LSTMs, and attention for real-time emotion recognition.
Self-supervised audio-visual fusion achieving LiDAR-comparable pedestrian detection deployed on Jetson Orin Nano.