Researchers at York University and the University of Utah have developed a perceptually inspired computer vision system called LS3D that is able to automatically infer the 3D structure of complex architecture from a single image.
The project is a collaboration between York University PhD candidate Yiming Qian, Professor James Elder at York University’s Centre for Vision Research (CVR) and Professor Srikumar Ramalingam from the School of Computing at the University of Utah. The work will be presented this week at the 2018 Asian Conference on Computer Vision in Perth, Australia.
“We live in a 3D world, but an image is only 2D,” said Elder. “The human visual system can extract the missing depth dimension by triangulating features detected in both left and right eyes (stereopsis) and by tracking features over multiple images as we move over time (structure from motion). However, since the early Renaissance, it has been known that we can also extract depth information from a single image by using the cue of linear perspective.”
LS3D uses this linear perspective cue to “hallucinate” the 3D configuration of structures from a single image, employing a staged computation in which line segments detected in the image are first grouped in 2D according to Gestalt principles, and then “lifted” to 3D using linear perspective constraints. These 3D linear scaffolds are then “skinned” using cuboidal solid models, and any remaining depth ambiguities are resolved using Gestalt principles, said Qian.
While at this point LS3D works only for modern buildings where surfaces meet at right angles (the so-called “Manhattan” constraint), the researchers believe that the approach can be generalized to a broader range of architectures.
Elder, Qian’s PhD supervisor and senior author of the study, said the LS3D system is very different from recent deep neural network approaches to single-view 3D reconstruction that learn a non-linear regression model relating depth to image pixels. These deep models, he said, have tens of millions of free parameters and require large volumes of labelled training data. Moreover, once they are trained, it is difficult to explain how these deep networks arrive at an estimate, making them less attractive for performance-critical applications.
“LS3D, on the other hand, uses a very different form of AI based upon a small set of principles rooted in perceptual psychology, combined with powerful ‘sparse’ mathematical optimization techniques,” said Elder. “As a result, the method has only two parameters, requires no training data, and its operation is easily understood and explained.”
Despite its simplicity, LS3D outperforms deep networks on the problem of 3D Manhattan building reconstruction by a large margin and produces CAD models that can be used directly by architectural software systems, rather than the noisy 3D point clouds produced by deep networks.
“The success of the approach highlights how a deeper understanding of human visual processing can lead to more powerful computer vision algorithms,” said Elder, who holds the York Research Chair in Human and Computer Vision and is jointly appointed to both the Department of Psychology and the Department of Electrical Engineering & Computer Science at York.
The work was supported by the Natural Sciences & Engineering Research Council (NSERC) CREATE Training Program in Data Analytics & Visualization and by the Ontario-funded project Intelligent Systems for Sustainable Urban Mobility (ISSUM), both led at York by Elder.
The researchers expect that LS3D will be most useful for architectural applications (building management, renovation, city mapping, urban planning), but also for 2D to 3D film conversion, applications being pursued with the assistance of York University’s Vision: Science to Applications program.
Watch a visual demonstration of LS3D in this video.