3D Computer Vision: Depth Estimation, Point Cloud Processing, And SLAM

In this comprehensive 3D Computer Vision guide, you will learn how to extract and leverage three-dimensional information from images and video. We cover depth estimation methods, point cloud processing pipelines, and simultaneous localization and mapping (SLAM) techniques that enable robots, autonomous vehicles, and AR/VR systems to perceive and navigate complex environments. By the end of this guide, you will understand:

The difference between 2D, 2.5D, and 3D data
Classical and deep-learning approaches to depth estimation
Point cloud acquisition, filtering, registration, and segmentation
3D reconstruction techniques: multi-view stereo and volumetric fusion
SLAM fundamentals and state-of-the-art visual and lidar-based systems
Public datasets and evaluation metrics for depth and point clouds
Deployment strategies for real-time 3D vision on edge devices
Detailed case studies in autonomous driving and indoor mapping
FAQs covering key concepts
A glossary and quick-reference cheat-sheet

Whether you are building a depth-aware robot, an AR application, or a self-driving car, this guide equips you with the theory, practical workflows, and tools to master 3D computer vision.

1. From 2D to 2.5D to 3D Data

1.1 2D Images

Standard images capture the intensity or color at each pixel: width × height.

1.2 2.5D Depth Maps

Depth maps assign a distance value to each pixel often obtained from stereo cameras or depth sensors (RGB-D). They encode a single view of 3D structure.

1.3 3D Point Clouds

A point cloud is an unordered set of (x,y,z) coordinates representing surface geometry in three-dimensional space. Point clouds can include additional attributes such as color or intensity.

2. Depth Estimation

3D computer vision: Diagram of stereo pair to disparity to depth and single image to depth network. — Stereo and monocular depth estimation workflows.

Depth estimation recovers per-pixel distance from a camera.

2.1 Stereo Depth Estimation

Stereo Matching compares left/right image pairs to compute disparity, then depth = baseline × focal_length / disparity.
Block Matching and Semi-Global Matching are classical algorithms.
Deep Networks such as PSMNet and GANet learn end-to-end matching for improved accuracy.

Python

# Example: Using OpenCV stereo block matcher
import cv2
left  = cv2.imread('left.png', cv2.IMREAD_GRAYSCALE)
right = cv2.imread('right.png', cv2.IMREAD_GRAYSCALE)
stereo = cv2.StereoSGBM_create(minDisparity=0,
    numDisparities=128, blockSize=5)
disparity = stereo.compute(left, right).astype(float) / 16.0
depth = (focal_length * baseline) / (disparity + 1e-6)

2.2 Monocular Depth Estimation

Structure from Motion (SfM) reconstructs sparse 3D points by matching features across multiple views and solving for camera poses.
Deep Monocular Models like MiDaS and DepthNet use CNNs to predict depth from a single image, trained on diverse datasets.

3. Point Cloud Processing

Once you have a point cloud from LiDAR, stereo, or RGB-D sensors, you must preprocess and analyze it.

3.1 Acquisition and Formats

LiDAR sensors (Velodyne, Ouster) capture millions of points per second.
RGB-D Cameras (RealSense, Kinect) provide aligned color and depth.
File Formats: PLY, LAS/LAZ, PCAP for raw lidar streams.

3.2 Filtering and Denoising

Statistical Outlier Removal removes isolated points
Voxel Grid Downsampling reduces density via 3D grid averaging
Radius Outlier Removal deletes points with too few neighbors

Python

import open3d as o3d

pcd = o3d.io.read_point_cloud("scan.ply")
# Downsample
pcd = pcd.voxel_down_sample(voxel_size=0.05)
# Remove noise
pcd, _ = pcd.remove_statistical_outlier(nb_neighbors=20, std_ratio=2.0)

3.3 Registration (Alignment)

ICP (Iterative Closest Point) aligns source to target by minimizing point distances.
Global Registration through feature matching (FPFH features) for large initial misalignment.

3.4 Segmentation and Clustering

Plane Segmentation via RANSAC to extract dominant planes (floors, walls).
Euclidean Clustering groups nearby points into object clusters.

3D computer vision: Pipeline diagram for point cloud pre- and post-processing. — Steps for processing and analyzing point clouds.

4. 3D Reconstruction

Turning multiple depth views into a coherent 3D model.

4.1 Multi-View Stereo

Photogrammetry algorithms reconstruct meshes from overlapping images
PatchMatch Stereo speeds dense matching

4.2 Volumetric Fusion

TSDF (Truncated Signed Distance Function) integrates depth frames into a volumetric grid (KinectFusion).
Poisson Surface Reconstruction extracts meshes from volumetric data.

5. SLAM Fundamentals

SLAM simultaneously builds a map of an environment and localizes the sensor.

5.1 Visual SLAM

ORB-SLAM2 uses ORB features for tracking and mapping with loop closure.
LSD-SLAM performs direct semi-dense tracking without features.

5.2 Visual-Inertial SLAM

Fuses IMU data with camera frames for robust tracking in low-texture regions.

5.3 LiDAR SLAM

Cartographer and LOAM use lidar scans for 3D mapping and pose estimation.

3D computer vision: Diagram of frontend tracking and backend optimization modules. — Core components of a SLAM pipeline.

6. Datasets and Evaluation Metrics

6.1 Depth Datasets

KITTI Depth: Stereo and lidar ground truth for driving
NYU Depth v2: Indoor RGB-D scenes

6.2 Point Cloud Datasets

SemanticKITTI: Labeled lidar sequences for semantic segmentation
ScanNet: 3D indoor scans with annotations

6.3 SLAM Benchmarks

TUM RGB-D: Ground truth trajectories and maps
KITTI Odometry: LiDAR and vision odometry sequences

6.4 Evaluation Metrics

Depth: AbsRel, RMSE, δ < 1.25
Point Clouds: Chamfer distance, F-Score
SLAM: Absolute Trajectory Error (ATE), Relative Pose Error (RPE)

7. Real-Time Deployment

7.1 Optimization

Model Pruning and Quantization reduce size and latency
CUDA and TensorRT accelerate inference on NVIDIA GPUs

7.2 Edge Frameworks

Open3D-ML for point cloud deep learning on edge
ROS integration for robotics applications

8. Best Practices and Pitfalls

Calibrate sensors accurately to avoid depth biases
Normalize point densities before processing
Handle dynamic objects in SLAM by rejecting moving points
Regularly retrain monocular models on new environments to prevent domain shift

9. Case Studies

9.1 Autonomous Driving Depth Perception

A self-driving car company fused stereo depth and lidar to produce dense depth maps for obstacle avoidance. They used PSMNet for stereo and Kalibrated KITTI sequences. The combined system achieved AbsRel 0.055 and enabled real-time planning at 20 Hz on an NVIDIA DRIVE AGX.

9.2 Indoor Mapping with RGB-D SLAM

A robotics lab used ORB-SLAM2 with an Intel RealSense camera to create 3D maps of office interiors. They logged ATE of 0.02 m over 50 m trajectories. The resulting maps drove a navigation stack for service robots.

10. FAQs

What is 3D vision and depth estimation?

3D vision recovers spatial structure from 2D images, while depth estimation computes per-pixel distance to the camera using stereo, structure from motion, or deep networks.

What is 3D point cloud processing?

Point cloud processing involves filtering, downsampling, registration, segmentation, and analysis of unordered 3D point sets captured by lidar or depth sensors.

What is 3D vision in computer vision?

3D vision extends traditional image analysis to infer three-dimensional geometry, enabling applications like robotics, AR/VR, and autonomous navigation.

What is LiDAR point cloud processing?

LiDAR processing includes converting raw range measurements into point clouds, filtering noise, aligning scans, and extracting semantic or geometric features.

Glossary

TSDF Truncated Signed Distance Function for volumetric fusion
ATE Absolute Trajectory Error for SLAM evaluation
FPFH Fast Point Feature Histograms for point cloud matching
SfM Structure from Motion for 3D reconstruction from images

Quick-Reference Cheat-Sheet

Task	Common Method	Key Metric
Depth from Stereo	Semi-Global Matching, PSMNet	AbsRel, RMSE
Monocular Depth	MiDaS, DepthNet	δ < 1.25
Point Cloud Filtering	Voxel Grid, Outlier Removal	Chamfer Distance
Point Cloud Registration	ICP, RANSAC	RMSE of alignment
SLAM	ORB-SLAM2, LOAM	ATE, RPE

Additional Resources

Open3D Documentation

Table of Contents