What Motivates My Research - John K. Tsotsos

I am interested in understanding vision whether it be in humans or computers. There are thousands of talented people with the same goal around the world; vision has been studied seriously since at least the ancient Greeks and there is still much to be learned and understood. This is mostly a testament to its inherent difficulty.

What were the roots of my research interests? It’s pretty simple. As an undergraduate student I subscribed to Scientific American in early 1971. In that year’s issues, two papers caught my fancy and I haven’t looked back since. They were “Advances in pattern recognition” by R. Casey and G. Nagy, and “Eye movements and visual perception” by D. Noton and L. Stark. The first dealt in part with optical character recognition by computer, defining algorithms that might capture the process of vision and allow a computer to see. The second described the possible role of eye movements in vision and how they might define our internal representations of what we see. There had to be a connection! I have been trying to understand vision and what the connection between machine and biological vision might be since about 1974.

My work has led me to:

The first computer vision system to interpret visual motion in high level terms (ALVEN) with continuing examination of the design of high level Motion Understanding systems and how the more abstract levels of motion representation interact with the earlier levels via attentive processes
The first proofs showing the inherent computational difficulty of vision (Foundations for Attentive Processes)
A novel model of Vision and Attention – the Selective Tuning model – predicting many aspects of human/primate visual processes
A new conception of and solution to the Binding Problem, a problem that has eluded solution since 1961when Frank Rosenblatt first described it.

These represent my current main research path.

30 or so years ago attention was an obvious topic for those in computer vision or image processing. One had to go to great lengths to process only the most relevant parts of images because computers had so little power. But now, computer power abounds and is cheap. It is the era of the large database and machine learning and brute force solutions. And results are impressive and only promise to become more so with computer power increases and declining costs.

But the goal for vision systems is really still the same as it was 30 years ago. We want systems to be robust to all the variability in the visual world, to the way we view the world and to the knowledge we have of the world, we want them to be flexible and not be single-task systems, we want them to do more than classify an image, and visual reasoning and problem solving remain important unsolved problems. We want them to properly deal with the unexpected or a not-previously-viewed scene (for a wonderful view of these goals, as relevant today as when it was written, see Zucker, Rosenfeld and Davis, General Purpose Models: Expectations about the Unexpected, Proc. IJCAI 1975).

Today’s powerful statistical classifiers may yield impressive single task performance but they do not lead to representations suitable for visual reasoning and cognition (Dickinson, S., The Evolution of Object Categorization and the Challenge of Image Abstraction, in S. Dickinson, A. Leonardis, B. Schiele, and M. Tarr, eds., Object Categorization: Computer and Human Vision Perspectives, Cambridge University Press, 2009). Further, they ignore the fact that the world is 3D and that vision is an active process and cameras must move in the world to find items of interest (R. Bajcsy, Active Perception vs Passive Perception, Proc. IEEE Workshop on Computer Vision: Representation and Control, Bellaire, MI, 1985). Even the databases on which they are tested require re-consideration (Pinto N, Cox DD, DiCarlo JJ. Why is real-world visual object recognition hard? PLoS Comput Biol 4(1): e27. 2008, doi:10.1371/journal.pcbi.0040027).

Part of the problem is that many assumptions or manipulations often precede processing in order to put the data into forms amenable to the classifier approach; but the assumptions and manipulations do not reflect the reality of vision, the fact that a vision system is embodied into a behaving agent in a real world. Among the assumptions that do not hold in real situations: fixed camera systems negate the need for selection of visual field; pre-segmentation or user interaction eliminates the need to select a region of interest; ‘clean’ backgrounds ameliorate the segmentation problem; task domain knowledge negates the need to search a stored set of all domains; knowledge of which objects/events appear in scenes negates the need to search a stored set of all objects/events. In effect, these assumptions reduce or eliminate the need for attention but cannot be justified for systems that must function in the real world. Finally, it is clear that although knowledge of human object recognition and vision is growing rapidly, corresponding theoretical modeling remains a challenge as a recent review concludes (Peissig, J., Tarr, M. (2007). Visual Object Recognition: Do we know more than we did 20 years ago?, Ann. Rev. Psychol. 58:75-96.).

It seems that the basic proofs of my 1989 IJCAI, 1992 IJCV and 1995 AIJ papers still have relevance – the general problem has an exponential complexity character that is not easily defeated. Yet our brain – and likely the brains of most seeing animals – have found a solution. This is what I seek. I am firmly convinced that one of the major elements of the solution is visual attention and thus seek a deep understanding of this multi-faceted and complex phenomenon of human and animal vision. The approach is highly interdisciplinary and connects science and engineering, computation and neurobiology, experiment and empiricism. We develop theory, make real predictions about human behavior that we and others have tested using visual psychophysics and brain imaging (fMRI, MEG) techniques, build vision systems, and embody them in real robots. Our experimental and empirical efforts refine our theories and science hopefully moves forward. But always, the hope is that better knowledge of biological vision will lead to the kind of robust, flexible and extensible systems we have been searching for since the late 1960’s.

There have been many related side trips as well. Active Vision, Robotics and Object Recognition have also been topics of interest. We have developed:

Active binocular camera robotic systems
Active visual search algorithms
Active object recognition methods
Theoretical results on the computational complexity of active vision strategies
Theoretical results on the computational complexity of behaviour-based robot control

We have also worked on many application areas, with the bulk in the medical area – cardiology, dentistry, assistive technology all have been of interest. My students, postdocs and I have tried to develop methods:

To assess the dynamics of the human left ventricle from X-Ray movies
To characterize the growth patterns of children’s teeth
To detect arrhythmia in the electrocardiogram
To develop an autonomous wheelchair for disabled children and the mobility impaired
To analyze MRI images of children’s hearts
To construct a visually-guided mobile platform for the nuclear power industry

Of course, very little would have been possible without the amazing cadre of undergraduate, graduate students, post-doctoral fellows and research associates that I have had the pleasure to work with. For more about them, click Lab Alumni.

And last, but not least, I try hard to teach my students and post-docs how to be good researchers and respected/respectful members of the scientific community. Here is a recipe that many seem to like: