Achieving reliable real-time localization and mapping is essential for robotics and AR
applications. The Computer Vision Toolbox™ provides a performant, configurable, and easy-to-use interface that offers an
out-of-the-box solution for visual simultaneous localization and mapping (vSLAM), handling
tasks such as feature extraction, matching, pose estimation, mapping, loop closure, and IMU
sensor fusion internally. To meet performance demands, you can improve the accuracy,
robustness, and efficiency of your visual SLAM system by optimizing sensor use for loop
closure and tuning key parameters. For a general description on why SLAM matters and how it
works for different applications, see What is SLAM?
Using Verbose Mode to Diagnose SLAM Errors
During SLAM processing, you can diagnose and troubleshoot errors using runtime
messages returned to the command line as the algorithm runs. To display these messages,
set the Verbose name-value argument to true for
the monovslam,
stereovslam, or rgbdvslam
object. In addition to enabling Verbose mode, see Techniques to Improve Accuracy for common sources of
inaccuracy and ways to improve SLAM accuracy.
Progress information display, specified as [],
1, 2, or 3. Paths to
the location of log files, when they are created, is displayed on the command
line.
Verbose value
Display description
Display location
[] or false
Display is turned off
—
1 or true
Stages of vSLAM execution
Command window
2
Stages of vSLAM execution, with details on how the frame is
processed, such as the artifacts used to initialize the map.
The MATLAB® command window displays a link to the log
file.
3
Stages of vSLAM, artifacts used to initialize the map, poses
and map points before and after bundle adjustment, and loop
closure optimization data.
The MATLAB command window displays a link to the log
file.
This table lists some of the most common messages and root causes you may
encounter.
Verbose Message
Root Cause
Parameters to Tune
Not enough matched points with frame {K}.
NumMatchedPoints=X is less than
MinNumPoints=Y
Not enough feature points in frame {K}.
NumMatchedPoints=X is less than
MinExtractedPoints=Y
Not enough world points in frame {K}.
NumWorldPoints=X is less than
MinWorldPoints=Y
Occasionally, the count of tracked features or points may
fall below a critical threshold, resulting in initialization
failures or a loss of tracking. This issue can arise from many
factors such as: inadequate image quality, abrupt variations in
brightness, or rapid movements.
To mitigate this
problem, consider extracting a larger number of 2-D features or
reduce the number of frames skipped between each pair of
keyframes.
Loop not closed. All loop candidates were
rejected
Loop closure failures typically arise from two primary
factors:
The loop closure threshold may be set too high,
resulting in missed matches. Gradually lower this
threshold to enhance your results, but be cautious not
to set it too low, as this may lead to false
positives.
The bag of words utilized might not be well-suited to
the input data.
When other methods do not yield sufficient
performance, consider generating a new bag-of-words
model using data from a camera sensor with
characteristics similar to the target sensor.
Achieving high accuracy in visual SLAM is challenging because errors can arise from
many sources. Issues with sensor calibration, data association, or environmental
complexity can all lead to drift or inaccurate maps. Understanding where these
inaccuracies originate is the first step toward improving system performance. The
accuracy of SLAM systems can be affected by several factors, including:
Camera
Calibration — Inaccurate camera calibration, such as errors in
intrinsic parameters can lead to incorrect pose estimation and mapping
results.
SLAM
Initialization — Issues during initialization. If the system cannot
extract or reliably match enough visual features between initial frames, it may
struggle to track motion or build a consistent map.
Tracking and Keyframe
Management — Tracking can be lost due to factors such as motion blur,
fast camera movements, or scenes with few distinctive visual features.
Loop Closure —
A missed loop closure can occur if the system either fails to recognize that it
has revisited a location (a false negative) or incorrectly detects a loop
closure when it hasn't (a false positive). In both cases, accumulated errors in
the system’s position estimate may not be properly corrected.
Visual-Inertial SLAM
(Sensor Fusion) — Poor sensor fusion between camera and IMU data in
SLAM is often caused by IMU calibration and incorrect noise models.
Techniques to Improve Accuracy
Improving SLAM accuracy involves optimizing several key components of the system. This
section outlines techniques such as camera calibration, initialization, tracking and
keyframe management, loop closure, and visual-inertial sensor fusion, each contributing
to more reliable and precise mapping and localization.
Camera Calibration Accuracy
Accurate camera calibration is essential in SLAM because it ensures precise
mapping of 3-D environments and reliable pose estimation. A camera calibration is
accurate when the reprojection error is low, typically below one pixel, and remains
evenly distributed across all images. Undistorting images that contain straight
lines should preserve their straightness, with no bending or structured artifacts.
The calibration should also perform reliably in downstream tasks such as pose
estimation or SLAM and should not introduce curvature, drift, or scale inconsistencies.
Camera intrinsic parameters describe how a camera projects 3-D world points
onto a 2-D image plane, including focal length, principal point, and lens
distortion coefficients. Accurate intrinsic parameters is critical for 3-D
reconstruction, visual SLAM, and image distortion.
To improve calibration accuracy, capture a sufficient number of high-quality
images that provide diverse views and full image coverage of the calibration
pattern. For image capture guidelines, see Prepare Camera and Capture Images for Camera Calibration
Image quality directly affects the accuracy of SLAM and other feature-based
computer vision tasks. Lens distortion can bend straight lines and bias feature
detection, reducing geometric consistency across images. To improve image
quality and downstream accuracy, undistort images using the camera parameters
obtained from calibration. Image undistortion removes lens distortion and
preserves scene geometry, enabling feature detection and matching with corrected
images.
Use the undistortImage function with
the distorted image and the cameraParameters object to generate an undistorted image. This
correction ensures that straight lines in the real world appear straight in the
image.
Fisheye images introduce strong lens distortion that can reduce feature
tracking and geometric consistency. To improve SLAM accuracy and compatibility,
convert fisheye images to a standard pinhole model using the undistortFisheyeImage function with parameters estimated from
fisheye calibration. This conversion generates new intrinsic parameters
corresponding to the equivalent pinhole camera model. Providing these
undistorted images and updated parameters to the monovslam object ensures compatibility with algorithms designed
for pinhole cameras while maintaining accurate feature detection and
tracking.
SLAM initialization establishes the first reference frame and creates the initial
3-D map of the environment. During this phase, the system detects and matches visual
features to estimate the camera’s pose and the positions of scene keypoints. A good
initialization means that the position of the camera is stable and does not jump
around unexpectedly. It also requires a non-degenerate baseline between the first
keyframes, which means the camera must move enough so that the 3D structure of the
scene can be estimated clearly. In addition, the points in the map should be
well-triangulated, with positive depth and enough parallax for accurate
reconstruction. If the first few frames can be tracked reliably and the map does not
quickly collapse, change shape, or scale incorrectly, then the initialization is
sufficient for normal SLAM operation to continue.
The initialization process differs primarily based on how depth information is obtained:
Monocular SLAM — Depth must be inferred from motion. The system
analyzes feature correspondences across several frames to estimate
relative depth through parallax. This process introduces scale
ambiguity, meaning the map is created up to an unknown scale until
additional data (e.g., IMU or a known object size in the scene)
resolves it. Parallax, the apparent shift in
the position of objects when the camera moves sideways, is essential
for estimating depth and constructing an accurate map. During
initialization, insufficient camera motion can reduce parallax,
making it difficult to estimate depth reliably. These early errors
can persist throughout the mapping process, affecting overall
accuracy.
Stereo or RGB-D SLAM — These systems have immediate access to
metric depth information, either through disparity computation
(stereo) or a depth sensor (RGB-D), allowing initialization with
absolute scale and improved robustness.
The initialization stage
relies heavily on feature extraction and matching to estimate the initial camera
pose and the 3-D positions of keypoints. Image resolution directly affects this
process by determining how many distinctive features can be detected and matched
across frames. Finding the right balance between feature richness and processing
speed is essential for stable and efficient initialization.
Use image resolution between 480x640 (SD) and 1920x1080 (HD) and adjust the
tuning parameters accordingly. These tuning parameters are typically specified
as name-value arguments in SLAM objects such as monovslam, stereovslam, or rgbdvslam.
MaxNumPoints — Controls the number of ORB
keypoints extracted from each frame. Higher values improve map
density and matching reliability but results in more
computations.
ScaleFactor — Determines the scale step between
pyramid levels during feature extraction. Smaller values produce
more pyramid levels, increasing scale invariance and matching
robutstness at the cost of speed.
NumLevels — Defines the number of pyramid
levels for feature detection. More levels improve robustness to
scale changes but results in more computations.
Recommended MaxNumPoints values for different
resolutions:
Resolution
MaxNumPoints
Characteristics
Low (~480x640)
1000
Fewer, less distinctive features
Fast processing but low robustness
Medium (~720x1280)
2000
Moderate feature density and
distinctiveness
Balanced accuracy and speed
High (~1080x1920)
2000-3000
Rich, detailed features
Slower initialization due to greater number of
computations.
These examples show the tuning of ScaleFactor and
NumLevels and their effect on the total number of matches
and runtime. Runtime can vary based on your hardware configuration.
For stereo visual SLAM, initialization relies on accurate disparity estimation
between the left and right camera images to reconstruct depth from pixel
correspondences. The DisparityRange name-value argument of
the stereovslam object, defined as a two-element array
[minDisparitymaxDisparity], defines the valid pixel range used during this
stereo matching process.
The disparity range directly affects both the quality of depth reconstruction
and the computational efficiency of the initialization process:
Range too narrow — Valid depth points are lost, resulting in
incomplete or noisy map reconstruction.
Range too wide — Computational cost increases and false
correspondences may occur, reducing map accuracy.
Best practice range — Select a range that fully spans the expected
depth variation of your environment. For guidance on choosing
appropriate values, refer to Choosing Range of Disparity.
Tracking and Keyframe Management
Tracking and keyframe management are critical components of SLAM systems. Tracking
estimates the camera's motion over time, while keyframes are selected frames that
capture significant changes in viewpoint and serve as stable reference points to
maintain map consistency and support robust localization. The methods for managing
tracking and keyframes are described in these techniques:
In monocular visual SLAM, tracking continuously estimates the camera pose by
detecting and matching visual features between the current frame and the
existing key frames. This process allows the system to localize the camera,
decide when to add new keyframes, and update the map with newly observed
features.
Stable tracking depends on maintaining a sufficient number of reliable feature
correspondences across frames. If tracking is lost, mapping halts and
relocalization may be required. Tracking behavior and keyframe selection are
primarily controlled by the SkipMaxFrames and
TrackFeatureRange name-value arguments, which can be
configured by the monovslam, stereovslam, or rgbdvslam object.
SkipMaxFrames — Defines the maximum number of
frames that can be skipped before forcing a new keyframe. Lower
values are recommended for sequences with fast or irregular motion.
If videos are not recorded at 30fps or have already been
downsampled, then consider reducing the value of
SkipMaxFrames.
Frame Rate/Motion Scenario
SkipMaxFrames
Characteristics
Slow or static motion
~20
Skips more frames between keyframes to improve
speed when motion is minimal. Safe for static or
slow sequences. Excessive skipping during motion can
cause drift.
Moderate motion/handheld
10-15
Balances performance and robustness. Maintains
consistent localization with manageable
computational load.
Fast or abrupt motion
5-10
Reduces skipped frames to maintain robustness
during rapid camera movement. Increases
computational load but prevents tracking
failure.
TrackFeatureRange — Specifies the lower and
upper limits for the number of tracked points required for keyframe
creation. Helps control the rate of new keyframe insertion. The
lower limit should be in the range [30,50]. The upper limit should
be approximately 15% of MaxNumPoints
value.
The checkStatus enumeration provides diagnostic feedback
during runtime, indicating the health of the tracking process. Use these
messages to identify issues such as insufficient feature matches or complete
tracking loss, and adjust parameters like MaxNumPoints,
SkipMaxFrames, or feature extraction settings as
needed.
checkStatus
Definition
Recommended Action
TrackingLost
Too few reliable correspondences exist. The number
of tracked feature points in the current frame is below
the lower limit set by
TrackFeatureRange. This indicates
the image does not contain enough features, or that the
camera is moving too fast.
One or both of:
Increase the upper limit value of
TrackFeatureRange
Decrease the
SkipMaxFrames value to add key
frames more frequently.
TrackingSuccessful
Tracking is successful. The number of tracked
feature points in the current frame is between the lower
and upper limits set by
TrackFeatureRange.
Continue mapping.
FrequentKeyFrames
Tracking adds key frames too frequently. The number
of tracked feature points exceeds the upper bound of
TrackFeatureRange.
Consider increasing the lower limit of
TrackFeatureRange so keyframes
aren’t inserted too frequently, or reduce
MaxNumPoints to limit feature
density.
Loop Closure
Loop closure is a process in SLAM that detects when the camera revisits a
previously mapped area. By recognizing these revisits, the system can correct
accumulated drift and refine both the trajectory and the map, ensuring consistency.
Loop closure typically runs in the background using feature-based place recognition,
matching visual features from the current view against those from past keyframes.
Effective loop closure significantly improves the accuracy and robustness of SLAM in
large or repeatedly traversed environments.
LoopClosureThreshold — Sets the similarity
threshold for confirming a loop closure between keyframes.
CustomBagOfFeatures — Custom bag of words (BoW)
vocabulary for loop closure detection. Using this argument requires
a pre-trained BoW vocabulary.
Argument
Purpose
Sensitivity
Best Practice
CustomBagOfFeatures
Define a custom Bag of Words (BoW) vocabulary to improve
place recognition during loop closure.
Using an untrained or generic vocabulary may cause missed
matches or false positives, especially in scenes with
repetitive textures or unique lighting.
Train a BoW vocabulary on representative images from the
target environment using the bagOfFeaturesDBoW object (based on DBoW2). A
well-trained vocabulary improves loop closure detection
reliability and reduces false matches.
LoopClosureThreshold
Set the similarity score threshold for confirming a loop
closure candidate.
If set too high, the system may miss valid loop closures;
if set too low, it increases the risk of incorrect matches
and map distortion.
Start with the default threshold, then adjust: increase
in feature-rich environments to reduce false positives, or
decrease in low-texture scenes to avoid missed closures.
When increasing MaxNumPoints, raise this
threshold proportionally.
Visual-Inertial SLAM (Sensor Fusion)
Visual-inertial SLAM uses both camera and IMU data to improve motion tracking. By
combining these measurements, the system stays accurate even during rapid motion or
challenging visual conditions, where feature extraction degrades. Key techniques for
leveraging IMU data and optimizing its integration:
Incorporating IMU data into SLAM improves robustness by providing continuous
motion information, which helps maintain accurate tracking during rapid
movements or texture-poor environments, which produces blurry images. IMU
measurements supply accelerations and angular velocities at high rates, filling
gaps between camera frames and compensating for visual ambiguities.
Visual–Inertial SLAM (VI-SLAM) combines camera and IMU data to achieve robust
localization and mapping, even in challenging environments. The fusion of visual
camera and inertial IMU sensor data provides scale information, stabilizes
tracking, and improves accuracy during fast motion or visual degradation.
Achieving precise results requires careful calibration, parameter tuning, and
initialization.
To enable visual-inertial fusion, you must configure a factorIMUParameters (Navigation Toolbox) object that stores IMU-specific parameters
such as the sampling rate, sensor noise characteristics, and biases for both the
accelerometer and gyroscope. These parameters are typically provided by the IMU
sensor manufacturer. However, if they are not available, you can estimate them
using techniques such as Allan variance analysis. For example, by using the
allanvar (Navigation Toolbox) function. For an example
that uses this function, see Inertial Sensor Noise Analysis Using Allan Variance (Navigation Toolbox).
In monovslam, IMU noise values are expected as covariances rather
than standard deviations. If the manufacturer provides random walk standard
deviations (or if you obtain them from Allan variance analysis), square them to
convert to covariance. This differs from some open-source frameworks, which
typically use standard deviations.
For example, if the gyroscope random walk = 10-4,
then:
GyroscopeBiasNoise =
(10-4)2 =
1e-8
You must ensure that all arguments are specified in the correct units:
GyroscopeBiasNoise —
(rad/s)2
AccelerometerBiasNoise —
(m/s2)2
GyroscopeNoise —
(rad/s)2
AccelerometerNoise —
(m/s2)2
A well-tuned IMU parameter set improves sensor fusion accuracy, reduces drift,
and ensures consistent pose estimation across long sequences.
IMU initialization in monocular visual-inertial SLAM involves estimating both
gravity rotation and pose scale. These steps are essential to resolve the scale
ambiguity inherent in monocular vision and to align inertial and visual data
within a consistent reference frame.
Gravity rotation — The gravity rotation estimation aligns the
inertial measurements with the visual data, ensuring the orientation
of the system reflects the true gravitational direction. This
alignment is essential for accurate motion estimation because
accelerometer readings include the constant acceleration due to
gravity, which does not represent actual motion and must be removed
before sensor fusion.
Since the input pose reference frame may not match the IMU local
navigation frame, typically North–East–Down (NED) or East–North–Up
(ENU), in which the gravity direction is known, it is necessary to
transform the estimated camera poses to the local navigation frame
to remove the known gravity effect. The estimated rotation provides
this transformation, aligning the input pose reference frame to the
IMU local navigation reference frame.
The estimated gravity alignment is returned in the GravityRotation property. When this
alignment is successfully estimated, the ISIMUAligned property is set to
true.
Pose scale — Estimation determines the real-world metric scale of
the scene, enabling accurate and drift-free 3-D reconstruction and
trajectory estimation.
For monocular systems, estimating the pose scale is necessary
because the real-world scale of the scene cannot be directly
inferred from images alone. By leveraging inertial data, the system
can resolve this scale ambiguity, resulting in more accurate and
reliable mapping and localization.
The estimated scale factor is available in the IMUScale property.
Together, the gravity rotation and pose scale estimations enable
the system to produce metrically accurate 3-D reconstructions and trajectories.
It is important to note that camera-IMU fusion cannot proceed if the IMU
initialization is not successful.
The animation illustrates the effects of properly tuning the gravity rotation
and pose scale estimations by showing the SLAM trajectory before and after
alignment. After the estimation is applied, the trajectory plot is automatically
updated to reflect the path in the newly aligned reference frame, incorporating
the corrected (estimated) scale. This ensures that the visualized trajectory is
both spatially accurate and metrically consistent with the real-world
environment.
The monovslam, stereovslam, and
rgbdvslam SLAM objects automatically estimate gravity
rotation and pose scale using internally designed factor graphs. With sufficient
data coverage and appropriate tuning of the NumPosesThreshold
and AlignmentThreshold name-value pairs, reliable
initialization can be achieved with minimal user intervention.
Best Practices for a Successful Camera-IMU
Alignment:
The estimation of gravity rotation and pose scale is typically performed early
in the sequence once a sufficient number of camera poses have been collected.
For more information on this calibration technique, see Gravity Rotation and Pose Scale (Navigation Toolbox).
To obtain a reliable estimation of gravity rotation and pose scale, the
collected poses should satisfy several conditions:
Number of poses — Try to keep the number of poses under 30 for
most cameras and frame rates. A larger number increases drift, while
fewer than 10 poses may not provide enough information for a robust
estimation.
Motion diversity — Include rotation around all three axes.
Vertical translation — Incorporate upward motion (opposite the
gravity direction).
Pose accuracy — Ensure accurate camera pose estimates by tuning
SLAM name-value arguments, such as
TrackFeatureRange or
SkipMaxFrames.
The number of camera poses used for camera–IMU alignment is controlled by the
NumPosesThreshold and
AlignmentFraction name-value arguments. These settings
determine when the alignment process begins and how much of the available data
is used.
To perform accurate calibration between the camera and IMU, a sufficient
number of camera-only poses must be collected. The
NumPosesThreshold defines the minimum number of camera
poses required before alignment can begin, while
AlignmentFraction determines the proportion of the total
dataset to use during the alignment process.
In summary, these arguments help ensure that enough spatial and temporal
information is available to reliably align the camera and IMU data streams. Each
plays a distinct role in the calibration process:
NumPosesThreshold — Number of
estimated camera poses required to trigger IMU alignment. Too few
poses yield unstable estimates; too many can incorporate drift. Try
to keep the number of poses under 30 for most cameras and frame
rates. A larger number increases drift, while fewer than 10 poses
may not provide enough information for a robust estimation.
Choosing an appropriate threshold is critical. A value set too low
may not provide enough data for accurate calibration, while a value
set too high can introduce drift and noise from accumulated pose
errors.
AlignmentFraction — Subset of the
most recent poses used for alignment, specified as a scalar in the
range of (0,1]. This helps exclude early noisy estimates for more
accurate calibration. The number of poses considered is calculated
as
round(NumPosesThreshold*AlignmentFraction)
This value effectively filters out initial, potentially noisy pose
estimates, ensuring only the most relevant data contributes to the
alignment for improved accuracy.
Key Takeaways for Improving SLAM Accuracy
Achieving robust and accurate SLAM depends on careful tuning and validation. After
setting parameters for camera calibration, initialization, tracking, loop closure, and
IMU fusion, validate your system by visualizing trajectories, checking for drift, and
confirming that loop closures and IMU alignment occur consistently. To compare estimated
trajectories against ground truth, you can use the compareTrajectories function.
Use the diagnostic messages, mapping visualizations, and performance metrics to
identify weak points in the processing of your data and environment. Adjust parameters
as needed until tracking remains stable under varying motion, lighting, and
environmental conditions.
Improving SLAM accuracy is an iterative process that combines precise sensor
calibration, thoughtful parameter tuning, and validation against real-world data. By
systematically refining your configuration and verifying performance using the
visualization and diagnostic tools in the Computer Vision Toolbox and the Navigation Toolbox™, you can achieve high-accuracy, real-time SLAM suitable for robotics, AR,
and autonomous navigation applications.
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window.
Web browsers do not support MATLAB commands.
Sélectionner un site web
Choisissez un site web pour accéder au contenu traduit dans votre langue (lorsqu'il est disponible) et voir les événements et les offres locales. D’après votre position, nous vous recommandons de sélectionner la région suivante : .
Vous pouvez également sélectionner un site web dans la liste suivante :
Comment optimiser les performances du site
Pour optimiser les performances du site, sélectionnez la région Chine (en chinois ou en anglais). Les sites de MathWorks pour les autres pays ne sont pas optimisés pour les visites provenant de votre région.