Introduction

Loopy-SLAM is a dense RGBD SLAM system equipped with loop closing. This system inherits the point-based scene encoding of Point-SLAM, since point-based representations are especially suitable for performing map corrections.

Recent online dense 3D reconstruction methods are either coupled (using the same representation for tracking and mapping) or decoupled (using independent frameworks). While decoupled methods currently achieve better tracking accuracy, they suffer from data redundancy since tracking is performed independently of the estimated dense map.

In terms of loop closure and map corrections, almost all of the coupled methods often implement only frame-to-model tracking, leading to significant camera drift on noisy real-world data and resulting in corrupted maps. Decoupled methods use multi-resolution hash grids, which are not easily transformable for map corrections, requiring expensive gradient-based updates and storage of input frames.

Point-based representations are especially suitable for map corrections, as they can be transformed quickly and independently.

The main contribution of this work is the direct way of implementing loop closure for dense neural SLAM that does not require any gradient updates of the scene representation. They also introduce a feature fusion strategy of the submaps in the overlapping regions to avoid visible seams.

Related Work

Dense Visual SLAM and Online Mapping

Test-time optimization techniques have gained popularity due to differentiable renderers that enable effective reprojection error minimization. Inspired by Neural Radiance Fields, these methods have evolved into full dense SLAM pipelines that use a coupled scene representation for both mapping and tracking. The inherent coupling of these tasks suggests they should be addressed together.

This work builds upon the Point-SLAM framework, which is particularly suited for loop closure due to its simple, point-based scene representation. This allows for map corrections without needing to reintegrate each frame, avoiding the resource demands of storing the entire history of input frames in larger scenes.

Loop Closure on Dense Maps

To achieve globally consistent maps, many dense SLAM methods divide the map into submaps that are rigidly registered using pose graph optimization. Sometimes they are followed by global bundle adjustment for refinement. Loopy-SLAM also splits the map into submaps and employs online pose graph optimization.

Among recent dense neural SLAM approaches, Orbeez-SLAM and NEWTON use a decoupled strategy by integrating ORB-SLAM2 for tracking and utilize multi-resolution hash grids. GO-SLAM also uses a decoupled approach by extending DROID-SLAM for online loop closure and couples it with a map via Instant-NGP.

A common limitation among these hash grid-based methods is that they require additional training iterations for map corrections and necessitate storing all input frames, limiting scalability. In contrast, Loopy-SLAM rigidly aligns submaps without these restrictions.

Method

Neural Point Cloud-based SLAM

Extending from Point-SLAM, Loopy-SLAM uses a same neural point cloud-based scene representation. The difference is that the neural point cloud is redefined as a set of $s$ submaps, each containing a neural point cloud $P_s$.

$P_s = \{ (p_i, f^{s, g}_i, f^{s, c}_i) | i = 1, ..., N \}$

Each neural point contain its position and geometric and color feature descriptors.

Mapping and tracking are performed on the active submap – the most recently created one. The first frame of each submap serves as a global keyframe, defining the submap's pose in the global reference frame. Submaps grow progressively as points are added to the active submap using the dynamic resolution strategy from Point-SLAM.

Rendering also follows the Point-SLAM framework. Points are sampled along the ray based on the camera pose, and occupancies and colors are decoded using MLPs. Here, to support the map correction, Gaussian positional encodings became learnable. They are optimized on the fly to handle point shifts during loop closure without expensive feature updates.

Tracking and mapping are alternated on the active submap. The basic methodology is same as Point-SLAM.

A new submap is created when a new global keyframe is added. Global keyframe selection is based on the camera movement, e.g. when rotation or translation exceeds a certain amount. Each new submap is initialized by projecting the previous submap's neural point cloud into the new global keyframe, speeding up the mapping process.

Within each submap, local keyframes are generated at regular intervals to constrain mapping, similar to previous methods but on a per-submap basis. These local keyframes are deleted when a new submap is initialized to manage computational resources efficiently.

Loop Closure and Refinement

At the end of each submap creation, loop closure system is triggered. First, global place recognition is performed. Every time a global keyframe is created, it is added to the BoW database.

Then, the pose graph is built upon the global keyframes. The object function of the pose graph optimization is as follows:

$E(\mathbb{T}, \mathbb{L}) = \sum\limits_s f(T_s, T_{s + 1}, I_s) + \lambda \left( \sum\limits_{s, t} l_{st} f(T_s, T_t, T_{s + 1}) + \mu(\sum\limits_{s, t} (\sqrt{l_{st} - 1})^2) \right)$

The first term is the odometry constraints between consecutive nodes. It measures the difference between the transformed points of submap $s$ and submap $s + 1$ using the identity transformation $I_s$.

The second term is the loop closure constraints. It measures the difference between the transformed points of submap $s$ and submap $t$ using the estimated transformation $T_{st}$, which is obtained by querying the BoW database.

The last term is the line process regularization term to prevent the trivial solution. It encourages the weights $l_st$ to be close to 1, meaning that loop closure constraints are trusted by default. However, during optimization, if a loop closure constraint does not fit well (i.e., introduces high error), the optimization can reduce $l_st$ towards 0, effectively down-weighting or rejecting that constraint.

The function $f$ is the dense surface registration term. It is defined as the sum of squared distances between corresponding points in submaps $P_s$ and $P_t$.

Finally, this objective function is optimized with Levenberg-Marquardt.

In this loop closure system, a few tricks are used to make it more robust. First, when performing the dense surface registration, the points are sampled from TSDF Fusion result, to suppress individual point's depth noise. Second, when calculating $T_{st}$, coarse-to-fine registration is used. As coarse alignment, Rusu et al.'s method is used. As fine alignment, ICP is used on the full-resolution point cloud.

After the loop closure is completed, the neural features of the overlapping regions are fused. This is done by averaging the features of the overlapping points. This prevents visible seams between submaps.

Experiments

For datasets, Replica, TUM-RGBD, and ScanNet datasets are used. For evaluation, different types of metrics are used. For mesh quality, F-score is used. For depth accuracy, depth L1 is used. For tracking, ATE RMSE is used. For rendering quality, PSNR, SSIM, LPIPS are used. Baseline methods are ESLAM, Point-SLAM, and GO-SLAM.

Reconstruction

The table above compares Loopy-SLAM to state-of-the-art dense RGBD neural SLAM methods in terms of the geometric reconstruction accuracy. We outperform all methods on the majority of scenes.

Tracking

Loopy-SLAM outperforms the existing methods on all scenes except one. The authors attribute this to robust frame-to-model local pose estimation coupled with pose graph optimization which globally aligns the submap frames.

When evaluated on real-world data of the TUM-RGBD dataset, Loopy-SLAM outperforms the existing neural SLAM methods on average, but still not as good as traditional methods such as ORB-SLAM2.

Rendering

Loopy-SLAM achieves the best rendering quality among all methods.

Conclusion

Loopy-SLAM is a dense RGBD SLAM system that utilizes submaps of neural point clouds for local mapping and tracking, along with loop closure. This point-based representation allows for efficient local map updates by shifting points. Also, this submap-based integration strategy offers better scalability.

As for limitations, a more robust tracker can be built with a combination of frame-to-model and frame-to-frame queues. Also, more robust and faster registrations can be obtained by making use of not only 3D point features, but also image features from the associated keyframes. Currently, the implementation is using Pytorch and Open3D via python bindings and not optimized for real-time operation. A direct CUDA implementation can be used to improve the runtime speed. Finally, the system does not implement relocalization, which is an important part of a robust SLAM system.