Diminished reality (DR) is the process of using technology for eliminating or diminishing objects that exist in our visual perception by using tools such as inpainting. It does not lie alongside the traditional immersive technologies, Augmented Reality (AR), Virtual Reality (VR), and Mixed Reality (MR), since it does not explicitly mix reality types. Nevertheless, diminished reality can be used in conjunction with AR to provide high-quality experiences to the users.
Increasing availability of commercial 360° cameras
Huge interest in emerging technologies namely AR
Need for novel applications such as interior space re-design
We aim to diminish objects in indoor spherical panoramas. On the one hand, the regional nature of our problem is similar to image inpainting, since part of the content needs to be photo-realistically hallucinated. On the other hand, the inherent ambiguity of generating content that hallucinates occluded areas necessitates drawing away from image inpainting, towards image-to-image translation, where only specific parts of the original image need to be preserved, and others need to be adapted to another domain. To this end, we approximate the DR task by translating a scene filled with objects to an empty, background only, scene.
Bridging the two “worlds” of Image Inpainting
and image-to-image translation
In our work, we bring the best of the two worlds. We leverage the advantages of image inpainting and image-to-image translation to generate photo-realistic and semantic-aware content, respectively. More specifically, we exploit the photorealistic generation capacity of generative models in combination with the boundary preserving properties of image-to-image translation, to replace parts of a 360° image with the occluded background.
The PanoDR model that preserves the structural reality of the scene while counterfactually inpainting it. The input masked image is encoded twice, once densely by the structure encoder outputting a layout segmentation map, and once by the surrounding context encoder, capturing the scene’s context while taking the mask into account via a series of gated convolutions. These are then combined by the structure-aware decoder with a set of per layout component style codes that are extracted from the complete input image. The final diminished result is created via compositing the predicted and input images using the diminishing mask.
Mobirise site builder - Click now