What can we learn from predicting shadows?

 Kushagra Tiwary

By predicting sparse shadow cues, our physics-inspired machine learning algorithm can reconstruct the underlying 3D scene. 

Abstract. We present a method that learns neural scene representations from only shadows present in the scene. While traditional shape-from-shadow (SfS) algorithms reconstruct geometry from shadows, they assume a fixed scanning setup and fail to generalize to complex scenes. Neural rendering algorithms, on the other hand, rely on photometric consistency between RGB images but largely ignore physical cues such as shadows, which have been shown to provide valuable information about the scene. We observe that shadows are a powerful cue that can constrain neural scene representations to learn SfS, and even outperform NeRF to reconstruct otherwise hidden geometry. We propose a graphics-inspired differentiable approach to render accurate shadows with volumetric rendering, predicting a shadow map that can be compared to the ground truth shadow. Even with just binary shadow maps, we show that neural rendering can localize the object and estimate coarse geometry. Our approach reveals that sparse cues in images can be used to estimate geometry using differentiable volumetric rendering. Moreover, our framework is highly generalizable and can work alongside existing 3D reconstruction techniques that otherwise only use photometric consistency. Our code is made available in our supplementary materials.

Neural Shadow Fields

We propose Neural Shadow Fields that learn object structure from binary shadow masks. We create a graphics-inspired differentiable forward model that can render shadows using volumetric rendering. 

A point x in the scene is defined to be in shadow if no direct path exists from the point x to the light source which implies that there must be an occluding surface between x and the light source. We differentiably render the scene’s depth from the camera and the light’s perspective at each pixel and then project the camera pixel and its depth into the light’s frame of reference. We then index the light’s depth map to get z1L. We note that z1L is less than z2L, i.e. there must be an occluding surface as a ray projected from the light’s perspective terminates early. This implies that this point is in shadow. Figure (b) shows a 2D slice of our approach and represents a volume (cloud) with the shadow mask unraveled. The network learns an opacity per point (dots) via the shadow mapping objective which penalizes predicted geometries that don’t cast perfect ground truth shadows. Through this, the networks learns 3D geometry that is consistent across all shadows maps for all cameras given a particular light source.


We show our method on some meshes and compare it with a vanilla framework that relies on RGB Images instead of binary shadow images (ours). 

We can also visualize the evolution of depth maps and shadow masks during training to better understand what our algorithm is learning.