3DCV Workshop 02 - Monocular Depth Estimation

Furtwangen University
Study program name and year or semester

Supervisor: Supervisor

Aliquam vitae elit ullamcorper tellus egestas pellentesque. Ut lacus tellus, maximus vel lectus at, placerat pretium mi. Maecenas dignissim tincidunt vestibulum. Sed consequat hendrerit nisl ut maximus.

Abstract

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ullamcorper tellus sed ante aliquam tempus. Etiam porttitor urna feugiat nibh elementum, et tempor dolor mattis. Donec accumsan enim augue, a vulputate nisi sodales sit amet. Proin bibendum ex eget mauris cursus euismod nec et nibh. Maecenas ac gravida ante, nec cursus dui. Vivamus purus nibh, placerat ac purus eget, sagittis vestibulum metus. Sed vestibulum bibendum lectus gravida commodo. Pellentesque auctor leo vitae sagittis suscipit.

Comparison

Kinect Color Image Kinect Depth Image (03/2020) DINOv2 (2023) Depth Anything v1 (01/2024) Depth Anything v2 (05/2024) DepthPro (10/2024) MoGe (10/2024)
img img img img - img -
img img img img img img -
img img img img img img img
img img img img img img img
img img img - - img img


Kinect Depth Image (03/2020) DINOv2 (2023) Depth Anything v1 (01/2024) Depth Anything v2 (05/2024) DepthPro (10/2024) MoGe (10/2024)
Requirements Azure Kinect camera (discontinued, originally 399$) Linux, NVIDIA GPU Apple System
Real-Time Yes No No No No No
Quality Limited range, low depth resolution very patchy good, but not very detailed good good, can detect glass panes edges not detected properly
Robustness doesn't work on reflective/transparent surfaces; noise/artifacts due to dust more robust, doesn't work on transparent surfaces more robust, doesn't work on transparent surfaces more robust, doesn't work on transparent surfaces more robust more robust

3D cameras vs. monocular AI models

+ -
3D cameras monocular AI models
Real-time Higher computational cost
Absolute measurements Relative measurements -> calibration necessary
- +
Limited range, low quality Very high quality
Requires specialized camera -> more expensive Possible without specialized hardware (camera) -> any images can be used, cheaper

Comparison

While 3D cameras like the Azure Kinect offer real-time depth sensing with absolute scale, they are limited by range, depth resolution, sensitivity to dust, and hardware constraints. In contrast, monocular AI models demonstrate significantly higher depth quality and robustness without needing specialized sensors, though they are currently slower and produce relative rather than absolute depth estimates. Given the rapid advancements in AI depth modeling, monocular approaches are becoming a strong alternative for many applications, especially where cost, flexibility, and high fidelity are priorities. However, for scenarios requiring precise, real-time metric measurements, dedicated 3D cameras still maintain an important role.