Depth perception through stereo imaging: March 2007

Thursday, March 29, 2007

Computer Vision (10)

Let’s come back to depth, which is our main topic. To perceive depth we know that both our eyes need to capture some common region. Even in this common region you cannot see two different objects at once even though you have two eyes. Try it out right now! Take up a long word and try to see the first and the last characters at the same time. You will not be able to do it because whenever you look somewhere, the same object will be placed on the macula of both the eyes. Now isn’t that redundant? No, that’s exactly what is responsible for the perception of depth. But, how does one eye know where the other is seeing? What if we have many similar objects placed around us, will our brain be fooled? This is exactly what the computer vision scientists are trying to crack from several decades. The concept is called stereo correspondence. In order to mock what our eyes are doing we use two cameras, place them at an offset similar to how our eyes are placed and take an image from both of them. When you look at such a photograph (I have a sample below) a lot of objects would have appeared in both the images, which are redundant for 2D perception but required for 3D viewing. So these are the objects we are interested in, and need to match them in both the images to get the relative depth.

In case of our eye since the entire surrounding cannot be captured on the macula, we have to move them relative to each other to see different objects. Our retina is not a uniform sensor. In order to see something clearly we have to place it on the macula and hence the need for this movement. On the other side, a camera sensor is uniform in density and hence the entire surrounding can be analyzed just with a single shot from both the cameras, no movement required.

Your brain can actually perceive depth from these two 2D images, if viewed properly. You will need some practice for that. Here’s how, if you are interested in it. To appreciate how our brain creates 3D out of these two 2D images and why we are so keen in copying from it, it’s better you learn and then only proceed.

In my next post I will explain about triangulation which is the central idea behind the calculation of depth using two 2D images.

Tuesday, March 27, 2007

Computer Vision (9)

For a temporary moment let’s forget about 3D and depth and concentrate on one more observation related to the angle of view and the kind of image sensor we have in our eyes. When you open both your eyes you get nearly 180 deg view of your surrounding, but how much of that 180 deg can you really see or perceive? Not much, here’s how it is. There are two kinds of specialized sensors in our retina; one that is necessary for perception and the other specialized for detecting changes. The sensors responsible for perception are placed at the center of the retina in a region called the macula. These sensors are densely packed in this region and are responsible for clear vision necessary for reading and perception. To prove this, look at a particular word somewhere at the center of the page and try reading the line at the top of the page. Even though the entire page falls in the region of the visual field, you can’t really get a clear picture of whatever falls outside the macula. Cats don’t have this region at all and therefore how much ever you try, you can’t train it to read something. In the remaining region of the retina lie the other kind of sensors, which are responsible for detecting motion or changes in the surrounding. When we wave our hand in front of people to get their attention (when they are lost in some deep thought) it is this region that interrupts their brain. In our day today life we fail to observe these minor things and feel as though we can clearly see the entire 180 deg around us. So if at all you have to see something clearly you have to position your macula over it. That is the reason our eyeballs keep swaying as we look at different objects. Observe this right now!

Saturday, March 24, 2007

Feats

Had been to IIMA this weekend to attend the CIIE award distribution ceremony. I was awarded silver for my entry under the Ideaz category. Gave a short talk on it too. Here are some of the snaps from my visit (does not include the award distribution).

Thursday, March 22, 2007

Computer Vision (8)

The reader, by this time would have understood the problem at hand and also how we are looking forward to solve it. If not, just keep a few things in mind. With just one eye, it is not possible to perceive depth. Without depth, it is not possible to segment the objects around us so effectively. Without segmentation it is not possible for us to learn or update our knowledge. If you have got the essence of it, some of the questions that would definitely pop in your mind are:

Is solving this problem so difficult?
Why would we want to solve it the way our brain does, isn’t there a better way?
When a camera auto focus system can estimate the depth using IR, why can’t we use say LASER to get the exact depth?

To explain why we would want to solve it in the same way as our brain does, I would like to quote these lines taken from the introduction section of one of the related papers from MIT. It states,
“The challenge of interacting with humans constrains how our robots appear physically, how they move, how they perceive the world, and how their behaviors are organized. We want it to interact with us as easily as any another human. We want it to do things that are assigned to it with a minimum of our interaction. In other words we can never predict how it is going to react to a stimulus and what decision it is going to take.

For robots and humans to interact meaningfully, it is important that they understand each other enough to be able to shape each other’s behavior. This has several implications. One of the most basic is that robots and humans should have at least some overlapping perceptual abilities. Otherwise, they can have little idea of what the other is sensing and responding to. Vision is one important sensory modality for human interaction, and the one in focus here. We have to endow our robots with visual perception that is human-like in its physical implementation. Similarity of perception requires more than similarity of sensors. Not all sensed stimuli are equally behaviorally relevant. It is important that both human and robot find the same types of stimuli salient in similar conditions. Our robots have a set of perceptual biases based on the human pre-attentive visual system. Computational steps are applied much more selectively, so that behaviorally relevant parts of the visual field can be processed in greater detail.”

I think that this completely justifies the claim made above. For us what is important is how useful it will be for us humans. Take for example, the compression algorithms used in audio and image processing. Audio compression is based on our ability to perceive or reject certain frequencies and intensities. It is compressed such that there won’t be any perceptual difference between the original and compressed data for our system. For a dog it might really play out weird! Image compression also works on the same basic concepts.

As you go on reading my posts you will get to know whether the problem is difficult or not (that is the main reason why I started writing). I can’t answer this question in one or two lines here. Coming to the third question, LASER will always give you an exact depth or distance of an object, but our brain doesn’t work on exactness. Even though your brain perceives depth it doesn’t really measure it. Secondly getting intelligence out of a LASER based system is a tough one. If you use a single ray to measure the depth of your surrounding, what if your LASER is always pointing on an object moving in unison with the LASER? We need a kind of parallel processing mechanism here, like the one that we get from an image. The entire surrounding is captured at one shot and analyzed, which a LASER fails to do. You cannot use multiple LASERs, because in that case, how would you distinguish the received signals from the ones that left out. The ray that leaves the transmitter a particular point need not comeback to the same point (due to deflections). In that case what will be resolution of the transmitters and receivers or how densely should we pack them? What if there was something we wanted to perceive in between this left out space? This is neither the best way to design a recognition system nor a competitor to our brain, so let’s just throw it away.

Assuming that evolution has designed the best system for us, which has been tried and tested continuously for millions of years, we don’t want to think of something else. We have a working model in front of us, so why not replicate it? And this is not something new for us; we have designed planes based on birds, boats based on marine animals, robots based on us and other creatures, etc, etc.

Tuesday, March 20, 2007

Photography with Computer Vision (7)

This is with regard to the the post where I was talking about the design of the visual system for predators and prey. I had taken an example of a chameleon for my explanation but could not photograph it on time :(. A frog never the less is an equally good specimen for this dual role played by many creatures in nature. Moreover, I had got a really good side shot of a frog when I had been to Agumbe and wanted a reason to share it. At that time I didn't know I would be coming up with this blog, or would have photographed the FV also. I thank Kalyan Varma for allowing me to use the FV of the frog captured by him here.

Observe the design of its face and eyes. The eyes are placed at almost 45 deg to the face. In the side view it can keep an eye on almost one complete side of its body and watch out for predators.

In the front view you can see that they still have some room for common visual region in order to perceive depth which will be used to strike the prey with their tongue.

Monday, March 19, 2007

Computer Vision (6)

In my earlier post I was saying that even though we have two eyes we cannot use them independently. If our eyes cannot move independent of each other, what is it that is holding them? For both the eyes to see the same object either our brain has to be doing some kind of correlation between the images and providing feedback to the eyes asking them to position on a common point or the eyes themselves know where they have to be pointing. I mean, either it is a process of learning or it comes along when our system (life) is booted.
This is actually a debatable topic. I tried to find an answer to this by observing it in small babies, but haven’t been successful enough to conclude. Anyway I have some other observations to share. Depth perception does not produce an interrupt in the brain like the way sound, motion or color do. During the initial learning stages it is interrupt that matters because you need to draw the attention of a baby’s brain to observe something, so depth takes a back seat. I term it is an interrupt because it immediately brings your brain into action. In order to achieve this you generally tend to get some colorful toys that make interesting sounds and wafture in front of a baby. So how does it work?
Sound, as you know definitely produces interrupt in your brain, which is why you use an alarm to wake up in the morning. Colorful objects produce high contrast images in your brain which are like step and impulse functions; strong signals that your brain becomes interested in. Now you know what kind of dress to wear to draw the attention of everyone around you!
If you remember awakening a day dreamer by wavering you hand in front of him, you know how motion produces interrupt in your brain. This is actually because of the way our visual processor and retina are designed, which I will come to shortly. So next time you are buying a toy think about these.Secondly why interrupt matters is because the new born baby’s brain is like a formatted hard disk, ready to accept data, but has nothing. When it doesn’t understand anything around it, there is absolutely no meaning in perceiving depth. Whether it perceives or not, it is just going to be a colored patch and nothing else. Again it wouldn’t know which color it is! So interrupts help it to make sense of its surrounding, and when that is done depth and motion help it to segment the objects from one another to form its database.

Thursday, March 15, 2007

Computer Vision (5)

If you closely observe the different species in the animal kingdom you will see that there are two kinds of creatures. Ones, that do perceive depth through vision and others that don’t. The ones that don’t will have their eyes towards the sides. Haven’t seen such a creature? Give a deep thought, you would have even painted them in your childhood drawing classes. Fishes have their eyes towards the sides and hence cannot perceive depth. Then how do they move about, you were finding it very difficult with just one eye open? Won’t their survival get affected from it? Not really! Instead it is evolution that has given them such an eye sight just for survival.

In general the observation is this; predators have their eyes towards their front and prey will have them on their sides. Let’s take the ferocious tiger for instance. A tiger needs to pounce on its prey exactly and can’t use the trial and error method that you used to catch hold of the wire :) (refer my earlier post). To take a decision it needs to know the exact location of its prey which is given by depth. For a prey on the other hand, it is enough to know the presence of the predator, its exact location takes a low priority. For the predator its focus is on the prey and not the surrounding, for a prey its focus is on the surrounding, because it needs to look out for any possible danger from all sides. Evolution has hence given a predator a narrow angle vision but an overlapping one to perceive depth, while the prey has a much wider angled vision but lacks depth.

This does not mean that a prey does not have depth perception at all, it is just that wide angle is more important than depth. So the overlapping region is very small between the two eyes. Its face is designed in that way. We for example, along with say a tiger and other predators have a flatter face to hold both the eyes on the front, while a deer for example has a curved face so that their eyes are somewhat towards the sides. There is also a special case, creatures that play a dual role of predator and prey. Chameleons have adjustable eye sockets. When they sense danger the sockets move towards the sides to get a wider look, and while hunting they come closer to get an overlapped view! In the overlapped view both the eyes look at the same object while in the independent view they can process the two images separately. In our case that is not possible. Even though we have two eyes we cannot see two different objects at the same time, our eyes cannot move independent of each other.

Wednesday, March 14, 2007

Computer Vision (4)

We humans have 5 different kinds of senses; touch, smell, sight, hearing and taste (correct me if I missed out something). We have one tongue, two eyes, two ears and two nostrils and of course skin for the sense of touch (skin is a special case I will come to it later). Ever wondered why we don’t have two tongues? Does this number two or one make any sense to our senses? Let me illustrate their significance with some examples.

You can pick up a pen that is lying in front of you at one go. (Vision)
When someone calls you from your left you immediately turn towards your left instead of searching for the voice all around you. (perception of sound)
And of course fragrance definitely attracts you towards it. (sense of smell)

Each of these senses is highly developed in the order mentioned. As you can observe in these examples, when you have a pair of sensors they answer the question WHERE? WHERE is the object, WHERE is the sound and WHERE the smell is coming from? You don’t have two tongues because you know that to taste something you have to place it on your tongue and can’t do it wirelessly. WHERE, is something that becomes obvious in this case. The final sense is touch and when it comes to skin there is nothing like one and two and it covers our entire body. But we all know that it is sufficient to touch us at one place to feel it, rater than at two. You have to make a contact to have a sense of touch which eliminates the need to answer the question WHERE?

Just the presence of two senses needn’t always guarantee the answer WHERE, it is their placement that gives it an extra edge. In general there needs to be some common signal that passes through both of the same kind of sense. Light is a high frequency wave and cannot bend along the corners. I mean you can’t light up your room and go outside behind the room wall to read something, while this is not the case with sound or smell. So irrespective of where on your head the two ears or nostrils are placed, common signals will definitely reach them but you can’t place one eye at the front and one behind your face. Light can’t bend so you don’t get any overlap or in other words a common signal in both the eyes. We humans have both our eyes on the front of our face, so it’s very easy to get common signals. Want to experiment? Fix the position of your face and close one of your eyes, say the left one first. Remember the region that your right eye is seeing. Now close your right eye and open the left and compare the two regions. Most of the region that one of your eyes sees will also be seen by the other, which is the common region. The right eye will not be able to see the left most portion of the region seen by the left eye and vice versa. It is in this common region that we perceive depth. How?, I will explain it in detail later, for now you just need to remember that “2 sensors == 3D perception”.

Computer Vision (3)

One statement that I usually get to hear from people is: “I understand the significance of depth on our perception of the surrounding, but a photograph as you said is a 2D image and my brain still manages to extract all the information from it. So should robotic vision depend so much on depth? Why can’t we do away with it? Also take movies for example, which are a sequence of 2D images. You can actually feel depth in them, don’t you? I still don’t understand why we need 3D?”

We understand a 2D image completely because of our previous knowledge and not necessarily due to the image processing happening in our brain. Previous knowledge does not mean that we should have seen the exact image before, it means that we are aware of the context and content. Like in image processing, we don’t understand the image after segmentation; both of them go hand in hand in our brain. What kind of segmentation our brain uses is still not very clear, but I can demonstrate how knowledge rules over the kind of segmentation that we can perform on an image. Look at the image below for example. What do you think the image contains? I am sure, 100% of the people would say “a man and a woman”. You are totally wrong; the artist had actually drawn jumping dolphins on a pot! Now that you know the content of the image (knowledge) you can easily extract the dolphins out.

You feel the perception of 3D in a cinema due to motion; a cinema is a motion picture! Motion can be obtained in two ways; one by keeping the camera static and having motion in the subject or bring about motion in the camera itself, irrespective of the subject. What our brain does using two eyes could have been done with a single eye, by oscillating it left and right to get the two images that it needs. The only difference would be that the images would not be from the same instance of time. From the time we start learning about our surroundings it is 3D vision that helps us segment the objects around us and put it in our database. Once we have gained sufficient knowledge about our surrounding we do not need 3D to perceive them, which is why we understand a 2D photograph without any problems.

I will be dealing with these topics in detail later on under illusion and 3D perception through motion. I have only been introducing you, to all of them now.

Monday, March 12, 2007

Computer Vision(2)

First of all why are we so fascinated about our ability to perceive depth, or for a layman what does all this mean? After having vision (eyes) for so many years imagine a world without it. Frightening, right? Imagine having sight in just one eye. Most of them will be okay with it and some even ask me, what difference does it make? Now this is really frightening to us; computer vision researchers. We have been chasing this problem since so many decades, many researchers have even spent their entire life in vain trying to decode it and here we have some people who do not know its significance in spite of using it. No problem, what’s this article for, then? There are two major things involved in vision; sight and depth. Many of them fail to distinguish between the two. Sight is the perception of light, and depth is the perception of the space around you. “An experience is worth reading 1000 pages”, so better try it out yourself. Right from the time you get up in the morning spend the entire day closing one of your eyes. Observe if you can live life as easily as you could with two eyes open. (Disclaimer: I own no responsibilities for any accidents that might happen as a result of performing this experiment). But to get a feel of what is driving so many people in pouring so much effort for giving a machine the perception of depth, you got to try it out. Do not read my other posts till you have got at least something from this activity. One experiment that I don’t want you to miss out is here: Hang a rope, a wire, stick, anything from a point such that there is space all around it. Get your fingers ready in the wire grasping position and move your hand towards the wire in the direction perpendicular to it to and grasp it. Remember to close one eye! If you get it right believe me, you are the luckiest person. If not, you would definitely want to know the magic that your brain is doing with two images. That is exactly what all our research concentrates on. Also try judging the depth between two objects placed at different depths with just one eye open. Try experimenting on as many objects as possible. It is impossible for you to know the distance between two objects without opening two eyes, except from monocular cues (I will come to this later). If you think about it carefully, there is nothing new I am talking of. When I say one eye, it is equivalent to taking an image from a camera. In a camera image the 3D surrounding is projected on to a 2D surface. From just this projection it is impossible to know at what depth the object was originally. Take a look at the image below. Square and circle are two objects in front of the sensor. Assume they are initially placed at (circle) 10m and (square) 15m. Their projection on the sensor would be as shown at the right. Try placing the circle anywhere along its line and also the square along its straight line. Do you see any difference in the place where they are projected? Not at all, you get the same image irrespective of where the two are relative to each other along their respective lines. Some people argue with me saying you should definitely be able to observe the change in the size of the object on the sensor as it moves far away from the sensor, so in some way you know whether the object is far or near. I totally agree, but what difference does it make? Who knows the size of the objects? I just have its projection with me at a particular instance of time and nothing else. When I move the object closer to the sensor, the size of the object definitely increases, but here we are talking about depth between two objects, which our brain accomplishes with two images. Even if the size of the object changes as you move it away or towards the sensor how does it give you the absolute depth of the object? We can always solve for two distances and sizes of objects such that one is big and far from the sensor and the other small and closer to it, both giving the same projection. Looking at the sensor you never know where the objects were because you don’t know their actual sizes!

When you look at a photograph you almost get to know the depth associated with it due to a lot of monocular cues that your brain uses along with the knowledge gained over the years. I will have a separate post on monocular cues, so wait for that.

Computer Vision(1)

I have been doing research in the field of vision since the last few years but a universal solution is nowhere to be seen; not only in my pockets but in any of the research institutes as well. Sometimes I feel it is impossible to find one universal solution to solving depth through stereo. Researchers have been thinking very deeply about just the stereo correspondence problem from many decades, but in vain. Either the problem is very difficult or we have missed the right track at the very beginning. I believe in either of these. The problem might be very difficult because it has taken its present shape through evolution, over millions of years. When I say evolution I refer to Darwin’s theory of “The survival of the fittest”. All these complex biological systems have taken rigorous stress test from nature and have survived to date, and to crack it might be a very challenging task. On the other side of the court we have man who has been able to build high speed miniature components and complex systems which should easily be able to replicate these relatively low speed systems. This is because biological vision is seen in small organisms like the insects, which hardly have say a few thousands of neurons dedicated for visual processing. Aren’t our GHz processors able to achieve what has evolved in these small creatures? After so much of research and thought at least I believe that it is impossible to achieve depth perception through the currently tried out image processing techniques. Towards the end of my discussion I will try to take you on a walk along what path I believe can solve this problem. It might not be practical as of today, but who knows what technology is waiting for us at the door. If the current image representation and processing techniques are so poor at achieving recognition, why are all those researchers glued to it even now? That’s because we always take our brain as reference for developing any recognition system and our brain is still able to perceive depth given a 2D stereo image pair. I believe the reader understands what a stereo image pair is. If not Google it right now! Or just skip it for now; I am going to take you all on a long journey covering each and every topic related to stereo images, depth perception, etc, etc. A lot of questions and answers were brought about during my research period and I have tried to give my best possible solutions to all of them. The whole idea behind writing these articles is to share these Q&A’s so that a person fascinated about vision today will be able to start off from a much thoughtful point rather than repeating the experiments again and again. To race against nature we have to make sure we compress those millions of years of evolution into a much smaller duration.

Depth perception through stereo imaging