In this comprehensive 3D Computer Vision guide, you will learn how to extract and leverage three-dimensional information from images and video. We cover depth estimation methods, point cloud processing pipelines, and simultaneous localization and mapping (SLAM) techniques that enable robots, autonomous vehicles, and AR/VR systems to perceive and navigate complex environments. By the end of this guide, you will understand:
- The difference between 2D, 2.5D, and 3D data
- Classical and deep-learning approaches to depth estimation
- Point cloud acquisition, filtering, registration, and segmentation
- 3D reconstruction techniques: multi-view stereo and volumetric fusion
- SLAM fundamentals and state-of-the-art visual and lidar-based systems
- Public datasets and evaluation metrics for depth and point clouds
- Deployment strategies for real-time 3D vision on edge devices
- Detailed case studies in autonomous driving and indoor mapping
- FAQs covering key concepts
- A glossary and quick-reference cheat-sheet
Whether you are building a depth-aware robot, an AR application, or a self-driving car, this guide equips you with the theory, practical workflows, and tools to master 3D computer vision.
Table of Contents
1. From 2D to 2.5D to 3D Data
1.1 2D Images
Standard images capture the intensity or color at each pixel: width × height.
1.2 2.5D Depth Maps
Depth maps assign a distance value to each pixel often obtained from stereo cameras or depth sensors (RGB-D). They encode a single view of 3D structure.
1.3 3D Point Clouds
A point cloud is an unordered set of (x,y,z) coordinates representing surface geometry in three-dimensional space. Point clouds can include additional attributes such as color or intensity.
2. Depth Estimation

Depth estimation recovers per-pixel distance from a camera.
2.1 Stereo Depth Estimation
- Stereo Matching compares left/right image pairs to compute disparity, then depth = baseline × focal_length / disparity.
- Block Matching and Semi-Global Matching are classical algorithms.
- Deep Networks such as PSMNet and GANet learn end-to-end matching for improved accuracy.
# Example: Using OpenCV stereo block matcher
import cv2
left = cv2.imread('left.png', cv2.IMREAD_GRAYSCALE)
right = cv2.imread('right.png', cv2.IMREAD_GRAYSCALE)
stereo = cv2.StereoSGBM_create(minDisparity=0,
numDisparities=128, blockSize=5)
disparity = stereo.compute(left, right).astype(float) / 16.0
depth = (focal_length * baseline) / (disparity + 1e-6)2.2 Monocular Depth Estimation
- Structure from Motion (SfM) reconstructs sparse 3D points by matching features across multiple views and solving for camera poses.
- Deep Monocular Models like MiDaS and DepthNet use CNNs to predict depth from a single image, trained on diverse datasets.
3. Point Cloud Processing
Once you have a point cloud from LiDAR, stereo, or RGB-D sensors, you must preprocess and analyze it.
3.1 Acquisition and Formats
- LiDAR sensors (Velodyne, Ouster) capture millions of points per second.
- RGB-D Cameras (RealSense, Kinect) provide aligned color and depth.
- File Formats: PLY, LAS/LAZ, PCAP for raw lidar streams.
3.2 Filtering and Denoising
- Statistical Outlier Removal removes isolated points
- Voxel Grid Downsampling reduces density via 3D grid averaging
- Radius Outlier Removal deletes points with too few neighbors
import open3d as o3d
pcd = o3d.io.read_point_cloud("scan.ply")
# Downsample
pcd = pcd.voxel_down_sample(voxel_size=0.05)
# Remove noise
pcd, _ = pcd.remove_statistical_outlier(nb_neighbors=20, std_ratio=2.0)3.3 Registration (Alignment)
- ICP (Iterative Closest Point) aligns source to target by minimizing point distances.
- Global Registration through feature matching (FPFH features) for large initial misalignment.
3.4 Segmentation and Clustering
- Plane Segmentation via RANSAC to extract dominant planes (floors, walls).
- Euclidean Clustering groups nearby points into object clusters.

4. 3D Reconstruction
Turning multiple depth views into a coherent 3D model.
4.1 Multi-View Stereo
- Photogrammetry algorithms reconstruct meshes from overlapping images
- PatchMatch Stereo speeds dense matching
4.2 Volumetric Fusion
- TSDF (Truncated Signed Distance Function) integrates depth frames into a volumetric grid (KinectFusion).
- Poisson Surface Reconstruction extracts meshes from volumetric data.
5. SLAM Fundamentals
SLAM simultaneously builds a map of an environment and localizes the sensor.
5.1 Visual SLAM
- ORB-SLAM2 uses ORB features for tracking and mapping with loop closure.
- LSD-SLAM performs direct semi-dense tracking without features.
5.2 Visual-Inertial SLAM
- Fuses IMU data with camera frames for robust tracking in low-texture regions.
5.3 LiDAR SLAM
- Cartographer and LOAM use lidar scans for 3D mapping and pose estimation.

6. Datasets and Evaluation Metrics
6.1 Depth Datasets
- KITTI Depth: Stereo and lidar ground truth for driving
- NYU Depth v2: Indoor RGB-D scenes
6.2 Point Cloud Datasets
- SemanticKITTI: Labeled lidar sequences for semantic segmentation
- ScanNet: 3D indoor scans with annotations
6.3 SLAM Benchmarks
- TUM RGB-D: Ground truth trajectories and maps
- KITTI Odometry: LiDAR and vision odometry sequences
6.4 Evaluation Metrics
- Depth: AbsRel, RMSE, δ < 1.25
- Point Clouds: Chamfer distance, F-Score
- SLAM: Absolute Trajectory Error (ATE), Relative Pose Error (RPE)
7. Real-Time Deployment
7.1 Optimization
- Model Pruning and Quantization reduce size and latency
- CUDA and TensorRT accelerate inference on NVIDIA GPUs
7.2 Edge Frameworks
- Open3D-ML for point cloud deep learning on edge
- ROS integration for robotics applications
8. Best Practices and Pitfalls
- Calibrate sensors accurately to avoid depth biases
- Normalize point densities before processing
- Handle dynamic objects in SLAM by rejecting moving points
- Regularly retrain monocular models on new environments to prevent domain shift
9. Case Studies
9.1 Autonomous Driving Depth Perception
A self-driving car company fused stereo depth and lidar to produce dense depth maps for obstacle avoidance. They used PSMNet for stereo and Kalibrated KITTI sequences. The combined system achieved AbsRel 0.055 and enabled real-time planning at 20 Hz on an NVIDIA DRIVE AGX.
9.2 Indoor Mapping with RGB-D SLAM
A robotics lab used ORB-SLAM2 with an Intel RealSense camera to create 3D maps of office interiors. They logged ATE of 0.02 m over 50 m trajectories. The resulting maps drove a navigation stack for service robots.
10. FAQs
What is 3D vision and depth estimation?
3D vision recovers spatial structure from 2D images, while depth estimation computes per-pixel distance to the camera using stereo, structure from motion, or deep networks.
What is 3D point cloud processing?
Point cloud processing involves filtering, downsampling, registration, segmentation, and analysis of unordered 3D point sets captured by lidar or depth sensors.
What is 3D vision in computer vision?
3D vision extends traditional image analysis to infer three-dimensional geometry, enabling applications like robotics, AR/VR, and autonomous navigation.
What is LiDAR point cloud processing?
LiDAR processing includes converting raw range measurements into point clouds, filtering noise, aligning scans, and extracting semantic or geometric features.
Glossary
- TSDF Truncated Signed Distance Function for volumetric fusion
- ATE Absolute Trajectory Error for SLAM evaluation
- FPFH Fast Point Feature Histograms for point cloud matching
- SfM Structure from Motion for 3D reconstruction from images
Quick-Reference Cheat-Sheet
| Task | Common Method | Key Metric |
|---|---|---|
| Depth from Stereo | Semi-Global Matching, PSMNet | AbsRel, RMSE |
| Monocular Depth | MiDaS, DepthNet | δ < 1.25 |
| Point Cloud Filtering | Voxel Grid, Outlier Removal | Chamfer Distance |
| Point Cloud Registration | ICP, RANSAC | RMSE of alignment |
| SLAM | ORB-SLAM2, LOAM | ATE, RPE |