EgoLifter: Open-world 3D Segmentation for Egocentric Perception

Preprint

1University of Toronto, 2Meta Reality Labs

EgoLifter takes an egocentric video as input and lifts 2D instance segmentation results to 3D by contrastive learning. The results are 3D Gaussian models that can be decomposed into individual object instances by querying or clustering.

Interactive 3D Segmentation

Abstract

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

Approach

EgoLifter solves 3D reconstruction and open-world segmentation simultaneously from egocentric videos. EgoLifter augments 3D Gaussian Splatting with instance features and lifts open-world 2D segmentation by contrastive learning, where 3D Gaussians belonging to the same objects are learned to have similar features. In this way, EgoLifter solves the multi-view mask association problem and establishes a consistent 3D representation that can be decomposed into object instances.

Naive 3D reconstruction from egocentric videos creates a lot of "floaters" in the reconstruction and leads to blurry rendered images and erroneous instance features (bottom right). EgoLifter tackles this problem using a transient prediction network, which predicts a probability mask of transient objects in the image and guides the reconstruction process. In this way, EgoLifter gets a much cleaner reconstruction of the static background in both RGB and feature space (top right), which in turn leads to better object decomposition of 3D scenes.


Experiments

Egocentric videos capture frequent human-object interaction and thus contain a huge amount of dynamic motion with challenging occlusions. EgoLifter is designed to provide useful scene understanding from egocentric data by extracting hundreds of different objects while being robust to sparse and rapid dynamics.

To demonstrate this, we compare the following variants in our experiments:

  • EgoLifter: Our proposed full method
  • EgoLifter-Static: a baseline with the transient prediction network disabled. A vanilla static 3DGS is learned to reconstruct the scene.
  • EgoLifter-Dynamic: a baseline using a dynamic variant of 3DGS, instead of the transient prediction network to handle the dynamics in the scene.
  • Gaussian Grouping: a concurrent work that also learns instance features in 3DGS. However, Gaussian Grouping uses a video tracker to solve instance identities rather than contrastive learning.

Here are the qualitative results of EgoLifter on the Aria Digital Twin (ADT) dataset. Quantitative evaluation results and more analysis can be found in the paper. Note that baseline puts ghostly floaters on the region of transient objects, but EgoLifter filters them out and gives a cleaner reconstruction of both RGB images and feature maps.

Here is the qualitative comparison with Gaussian Grouping. Note that EgoLifter has a cleaner feature map probably because our contrastive loss helps learn more cohesive identity features than the classification loss used in Gaussian Grouping.

More Qualitative Results

Here are more qualitative results of EgoLifter on the Ego-Exo4D dataset and Aria Everyday Activities dataset.

BibTeX

@article{gu2024egolifter,
  author    = {Gu, Qiao and Lv, Zhaoyang and Frost, Duncan and Green, Simon and Straub, Julian and Sweeney, Chris},
  title     = {EgoLifter: Open-world 3D Segmentation for Egocentric Perception},
  journal   = {arXiv preprint arXiv:2403.18118},
  year      = {2024},
}