Monocular depth estimation infers scene depth from a single RGB image using CNNs/Transformers, learning cues like perspective, size, and texture to produce depth maps. It’s low-cost for driving, robotics, and AR/VR, though absolute scale remains challenging.
Check out the answers to some of the most common questions about monocular depth estimation below:
Monocular depth estimation is a computer vision method for predicting scene depth from a single RGB image. A neural network generates a depth map where each pixel’s intensity represents distance, learning to infer 3D structure from 2D visuals using only one camera viewpoint.
It uses deep learning models like CNNs or Transformers to detect depth cues—perspective, object size, and texture—from a single image. Trained on large datasets with ground-truth depth, the model learns to translate 2D images into depth maps without stereo or LiDAR input.
A 2D image lacks direct depth data, making the task ambiguous. Models must infer scale and distance from visual cues such as object size or position. They typically predict relative depth accurately but struggle with absolute (metric) scale without additional information.
It’s used in autonomous driving, robotics, and AR/VR for obstacle detection and realistic 3D scene rendering. Smartphones use it for portrait effects, and it also supports 3D reconstruction, medical imaging, and other tasks where depth sensors are impractical.
Unlike stereo or LiDAR, which measure depth directly, monocular methods infer it from one camera image. They’re cheaper and more flexible but less precise in absolute terms. Monocular depth works well for relative depth estimation where specialized sensors aren’t feasible.
In computer vision, we have come a long way from identifying objects in 2D images to understanding the 3D space they occupy.
Perceiving depth (depth awareness) is crucial for many real-world applications, such as self-driving cars and augmented reality.
Monocular depth estimation is a key technology that lets machines see in 3-dimensions using only one camera.
In this guide, we’ll cover:
If you want to adapt self-supervised learning to create a domain-specific vision model for tasks such as monocular depth estimation, use LightlyTrain. It provides a self-supervised learning pipeline that pretrains vision models on your unlabeled images and reduces the need for labeled datasets.
Interested in trying LightlyTrain? Try it out for free!

Monocular depth estimation is a computer vision task that infers depth information from a single 2D image. In simple terms, it teaches a machine to estimate the distance of objects in a scene from a single camera viewpoint, similar to how a person sees with one eye.
The depth estimation process outputs a depth map (also called a depth image or depth mask) of the same size as the RGB images it is associated with.
And each pixel's intensity in the map (depth value) represents the distance to the corresponding point in the original scene. Typically, lighter pixels indicate closer objects, while darker pixels indicate objects farther away.
Practically, monocular depth estimation uses deep learning approaches to estimate depth. Since a single image contains no direct distance measurements, a neural network learn to recognize visual patterns that correlate with depth (monocular cues).
These cues are similar to those our brain uses to perceive 3D space, including:

For example, when the model takes an image of a chessboard, it processes it and then produces a depth map (often normalized so that larger values indicate greater distance).
In this map, the chess pieces in the foreground appear bright white, while those further back are shown in gradually darker shades of gray, with the most distant parts fading to nearly black. And it provides a detailed (pixel-level) understanding of the scene geometry.

When discussing depth estimation, it's important to distinguish between two main categories:

Depth information is crucial for many vision tasks because depth maps enable us to go beyond 2D object recognition by turning a flat image into a 3D understanding.
Here are some of the most common applications.
Depth information helps estimate distances to vehicles, plan paths, and avoid collisions in autonomous driving and robotics.

Traditionally, they depended on costly sensors like LiDAR or stereo cameras to measure depth. But monocular depth estimation provides a more affordable alternative that can augment or even replace LiDAR.
It can be the primary depth sensor in budget-conscious applications or act as a redundant safety system to other sensors and provide crucial distance data for live decision-making.

Monocular depth estimation improves AR and VR by enabling the natural placement of virtual objects in the real world on devices like smartphones and AR glasses.
Depth maps ensure proper occlusion and make interactions between virtual and real objects seem more believable.

Monocular depth models can convert video or single photos into 3D models of a scene. It can be used for applications like interior scanning, cultural heritage preservation, and creating assets for virtual reality.
3D reconstruction is achieved by combining each pixel's 2D coordinates (x, y) with its predicted depth value (z) to create a 3D point (x, y, z) in space. When done for all pixels, this process generates a 3D point cloud of the scene geometry.
For example, this paper proposed the framework MURRe (Multi-view Reconstruction via SfM-guided Monocular Depth Estimation), which uses a multi-stage pipeline for 3D reconstruction.

MURRe individual depth predictions from various angles are combined to create a single 3D reconstruction of the environment (a collection of objects, an indoor room, or a large-scale cityscape).

Depth maps improve semantic segmentation and object detection by providing a third dimension (a sense of how far away objects are).
For example, a simple object detection system can detect a car in an image and draw a box around it, and show the object's location and type (what and where in 2D). However, it doesn't know how far the car is from us or the scale.
Monocular depth estimation adds this information by estimating the distance to objects.
For example, using the 2D bounding box from an object detection model with the predicted depth value for that object, we can perform 3D object detection. It creates a 3D bounding box that describes the object's position and scale in the real world.
For any autonomous agent, this fusion of data is important, as in self-driving cars, which need to know whether a pedestrian is 10 meters or 100 meters away, since the required safety action is completely different.

Monocular depth estimation principles are also used in medicine.
During minimally invasive surgeries, an endoscopic camera provides a single video feed from inside the body.
A depth model can analyze this feed to create a 3D map of organs and tissues. It then helps surgeons see better in 3D and makes it easier to navigate and perform surgeries accurately.
For example, the neural network takes endoscopic images as input and uses an encoder-decoder structure to extract feature maps.
From these features, the model generates a disparity (depth) map and a confidence map, which are used to reconstruct the 3D scene geometry of the internal anatomy in real-time.

Estimating depth from one image is difficult (ill-posed problem). Since no stereo parallax is available, the model has to rely on learned priors and visual cues. Key challenges include:
Developing a monocular depth estimation model involves a thorough process that spans data collection, architecture design, training, and evaluation.
Each step presents unique considerations that an ML engineer must address to build a powerful system.
A good depth estimation model training starts with a high-quality, relevant dataset of RGB images paired with ground-truth depth maps.
The data you choose depends on your use case, and there are many publicly available datasets you can utilize.
So, let's go over some popular datasets.





Pro tip: Check our Guide to Synthetic Data Generation (And How Lightly Can Help)

Most of these datasets typically come as PNG or JPG image formats and depth maps (often 16-bit PNGs or NumPy arrays). So it’s important to align the depth maps with the images (same width/height and camera parameters).
All the above-mentioned datasets can be used in the first place.
But if you want a depth estimation model that can understand your domain. Then you need your domain data and perform depth annotation with some software or methods (if LiDAR is costly).
Pro tip: If you are looking for the perfect data annotation tool or service, make sure to check out our list of 12 Best Data Annotation Tools for Computer Vision (Free & Paid) and 5 Best Data Annotation Companies in 2025.
To collect domain data, you can use LightlyEdge. It is a smart data selection SDK that runs on edge devices (data collection devices). It analyzes incoming data in real-time and only collects the most valuable and informative frames that match your individual criteria, like the Scene of interest.

After collecting the data, you can use LightlyStudio for further curation and labeling. It can help select a diverse, high-quality subset from large image collections.
LightlyStudio ensures that when you send data for depth annotation, you focus on samples that actually improve model generalization.

If you like code, then access LightlyStudio Python SDK and API support, and build on open-source standards designed for flexibility and scale.

Now, let's understand the model architectures.
Monocular depth models are typically encoder–decoder neural networks.
This two-part architecture nicely turns a high-resolution input image into a dense, pixel-aligned depth map.
The encoder is usually a Convolutional Neural Network (CNN) that processes the input image to extract a rich set of feature maps. It learns to recognize an implicit relation between color pixels and depth (capturing semantic and geometric cues).
Since building and training a CNNs encoder from scratch is difficult, we usually use a pretrained backbone (ResNet or DenseNet). They trained on a general dataset, but with transfer learning, we make it work for our more specific use case.
And the decoder then takes these compact features and intelligently upsamples them, often using skip connections to reintroduce fine details from the encoder, to reconstruct the final, full-resolution depth map image.

However, modern approaches add sophistication, for instance, transformers or self-attention blocks for global scene understanding, or multi-scale fusion.
For example, AdaBins use a CNN backbone plus a transformer-based “adaptive binning” module that divides depth values into learned bins and allows more precise predictions.

Similarly, models like DPT (Dense Prediction Transformer) perform pixel-based analysis to analyze the entire scene context at once.
The self-attention block in transformers allows the depth model to have a higher capacity interpretation of scene geometry and improves ambiguity resolution.

Methods like DPT work well with supervised learning and depend on large, labeled datasets (ground truth), which are expensive, time-consuming, and can have biases.
These limitations can be overcome using unsupervised and self-supervised learning for monocular depth estimation.
Self-supervised learning uses the data itself as the supervision signal (guidance). The key task of the model is to predict depth maps from images (or stereo pairs), without requiring expensive ground-truth depth labels.
A depth network and a pose network are jointly trained to reconstruct target views from source views using photometric consistency (reprojection loss + image similarity).

The architecture choice also depends on the use case. For example, if you're planning to run the model on mobile devices, then lightweight neural networks such as DepthNet and MobileNet-based are good options.
On the other hand, heavy transformer models require GPUs to get top accuracy on benchmarks.
After the data and model are prepared, we start teaching the model to make accurate predictions. It's an iterative loop guided by a loss function, which measures the error between the model’s depth prediction and the ground truth.
Common supervised loss choices include:
Furthermore, in a self-supervised setting where no ground-truth depth is available, the main guide is the photometric reconstruction error. Here, the model predicts a depth map from one image and uses it to "warp" that image to a neighboring viewpoint (from a video or stereo pair).
The loss is the difference (typically a mix of L1 and SSIM) between this synthesized image and the actual target image.
Similarly, unsupervised methods (including self-supervised) add a smoothness loss to encourage local gradients (since depth should be piecewise smooth).
The calculated error loss is then fed to an optimizer, such as Adam, which uses this signal to adjust the model's weights. This process is fine-tuned by hyperparameters like the learning rate (how big a step the optimizer takes), the number of epochs, and the batch size.
Throughout training, the model's performance is monitored on a separate validation set to check for overfitting and determine when the model has learned effectively.
Sometimes, the model's raw output can be slightly noisy. To avoid this, you can apply post-processing filters, such as a bilateral filter, to the predicted depth map to smooth out imperfections while preserving sharp edges.
Another way to overcome this for visual improvement is to pretrain the model before fine-tuning. With LightlyTrain, pretraining is as easy as writing a few lines.
See the sample code below. For an end-to-end implementation of monocular depth estimation with fastai U-Net + LightlyTrain, check it out here.

After training is complete, the model's final performance is measured on a hidden test set. We use specific evaluation metrics to quantify its accuracy.
Common evaluation metrics include:


Depth models accessed through platforms like PyTorch Hub have made it much simpler for us to get started and get great results with pre-trained models.
So, let's write some code and perform inference using the MiDaS model.
First, we define functions to load the pre-trained MiDaS model from the PyTorch Hub and set it to evaluation mode. The transform function prepares the image by resizing and normalizing it to the model's expected input format.

Next, we need a function to load an image from either a URL or a local file path. The load_image function ensures the image is in RGB format, which the model requires.

Now, the function below (estimate_depth) takes the loaded model, the transforms, and the image to perform the depth estimation, then resizes the output depth map back to the original image's dimensions.

Here, we create functions to normalize and visualize the depth map to interpret the model's output.

We will display the original image alongside the depth map rendered with various colormaps (Magma, Viridis, and Grayscale) to highlight different depth details with the function below.

Now we are all good with the function. Finally, a main function combines the entire process, calling each helper function in sequence to go from a source image URL to a final visualization.

Output:

Monocular depth estimation has seen rapid progress, with methods and training approaches constantly pushing the limits of performance. The table below compares several landmark models.
Simply put, monocular depth models are now usable for many tasks. But you must choose the right model for the job, a fast, lightweight model for real-time on-device use vs. a heavy, accurate model for offline processing.
Training robust depth estimation models needs extensive data to account for diverse lighting, weather, and complex scene geometry.
But collecting and then labeling this data is inefficient. Much of it, like repetitive driving footage, is redundant and adds little value to the model while driving up costs.
This is where Lightly AI's platform steps in to address the data bottleneck and provides tools for developing high-accuracy depth models while minimizing costs and effort
The best part is that LightlyEdge + LightlyTrain easily integrates with LightlyStudio. Data intelligently captured by Edge can be fed into Studio for quality assurance and further curation.
Then, Trian can pretrain the model using that unlabeled data. They create an end-to-end pipeline for building better datasets and models with less effort.

Monocular depth estimation turns a flat RGB image into 3D scene understanding. Despite their challenges, advances in neural networks and large-scale training have yielded impressive results.
Also, depth estimation merges with other vision tasks and lets models learn depth, semantics, and geometry jointly.
By following best practices in data preparation and training, using the latest models and data curation tools, ML engineers can create systems that bring depth awareness to any single-camera app.

Get exclusive insights, tips, and updates from the Lightly.ai team.


See benchmarks comparing real-world pretraining strategies inside. No fluff.