Wednesday, November 26, 2008

Computer Vision(40): 3D Physical to 2D Projection of Depths

When our brain is not sure of what is being sensed it tries to use knowledge to come to a conclusion. When there is some audio playing with a lot of foreground noise your brain will find it difficult to recognize it only till it uncovers the language in which the background audio is playing. If the language is the one it knows it decodes it else everything will be noise. When it comes to illusions also you can tune your brain to recognize the image in certain way only and not the others. I guess one of the things that might have happened to your friend is this. The other is in the way he might have seen the image. I have clearly mentioned in the footer of the image that this applies only when the image is viewed as cross stereo. I will rule out the later case as even you tend to see it either way.
Sense is something that is fuzzy in humans; you know something is hot but cannot tell its exact temperature, you know something is farther away from the other but cannot tell the exact distance between the two, you know some sound is louder than the other but not by how much, etc… So long as you are interested in only the relative information our brain performs exactly but if you question its exactness it becomes relative and fuzzy.
Let me explain relative information taking the top view of the scenario as shown below.
Currently the objects are placed one beside the other (images not to scale). The relative depth between them will be zero as the distance between the objects in both the eyes will be the same. To observe relative depth between them we will have to first create physical depth between the two by pushing either one of them back.
CASE 1: Let me push the rectangle back initially as shown below. ‘L’ and ‘R’ are the projected distances between the objects on the left and the right eye respectively. Whenever the object on the right is in front compared to the one on the left, the projected distance on the left eye will be greater than that on the right.

CASE 2: Now consider the image below in which the circle is pushed back. For our brain to perceive this physical relative depth the projected distance on the right eye has to greater than that on the left.
Let me bring into picture the image that actually created this doubt. If you cross view this stereogram the 2D distance between the objects that the left eye would perceive will be greater than that on the right which is CASE 1.

Saturday, August 30, 2008

Computer Vision (39): SDF continued

S1 and S2 - Two sensor positions.
C1 and C2 - Cones of two real points whose focused images are P1 and P2 respectively.
L - Lens.

To solve for a cone, one needs the base diameter and the height, if it is right circular. But for points away from the optical axis the cone is no longer right circular, so in order to solve for it, one needs to know it at two cross sections. Joining these two at all points around it would give you the cones, as shown above. If one can solve this for all points on the sensor, you can then control the focus point through software, post capture. But there is a problem.
C3 - Newly introduced cone.

Wednesday, August 27, 2008

Computer Vision (38): Software Defined Focus

A software solution is always rated over the corresponding hardware one. Some of the reasons being that a software solution is more portable, does not rot or burn and is easily scalable. Macro photographers have a bad time trying to get a perfectly focused image in every one of their shots. In post processing you can only sharpen the entire image a bit but not focus the region you want. Video surveillance systems though would have captured the criminal’s footage might be blur and hence difficult to recognize. What if we could find a solution to correct the focus even after the live video/photo was captured?

I had already written a detailed article on what focus actually means to a camera through simple ray diagrams. Now I will take it beyond and will try to find a solution and define focus through software. I will again use simple ray diagrams to explain my observations and not the complex QED.


To revise the concepts a bit:

1. Light diverges in all possible directions from a point after getting reflected or emitted.
2. When it intersects the camera aperture, only a section of this spherical region enters to reach the camera sensor.
3. A cross section at the face of the lens wrt the point of commencement of light would give us a cone which in turn gets converged by the lens to fall on the sensor.
4. The image this cone creates on the sensor (circle or a point) depends on where along this convergence path it intersects the sensor.

Pr - real point of commencement of light.
Po - some other real point.
L - Lens.
Puu - Unfocused Uncrossed Image of the point Pr with sensor at S1.
Pf - Focused Image of the point Pr with sensor at S2.
Puc - Unfocused Crossed Image of the point Pr with sensor at S3.
S1, S2 and S3 - Different positions of the sensor.

Unfocused image is the one obtained when the sensor is in a position other than S2 for point Pr. Points at different distances from the lens would have different positions of the sensor where they would converge. So, though point Pr would produce a focused image at S2 point Po would register a circle. Focusing a point in one plane wrt the lens would out of focus the points in other planes.

One way to get most or all the points to appear in focus is to increase the depth of focus by decreasing the aperture to as minimum (in size) as possible. This would make the light cone very narrow in width and the digital sensor would not capture it as a circle in spite of not being focused.


Now a days due to the increase in the sensor pixel density even a slight movement in the sensor from the position S2 would give an out of focused image of Pr. Though auto focus systems might be very accurate even a small movement in the camera position would out of focus the desired point in the image especially for macro shots in wide aperture. Once the image is frozen there is no way you can correct the focus except for a slight sharpening. This is because a circle is not uniquely defined by a cone. For example in the figure shown below image of the circle Ic on sensor S can be formed by any of the cones getting focused at F-1 (before the sensor), F1, F2 or F3 (after the sensor). In this format there is no way we can bring the focus back once the image is captured since we do not know the cone whose cross section this is.


Assuming even intensity distribution some people would argue that irrespective of where this point would converge we could simply take the sum of the spread intensity and put it at the location of focus. This can only be true for points on the optical axis and a single lens system. Camera lenses generally contain many groups of lenses and hence would be complex to analyze. If there are any experts who can solve this problem do let me know.

Secondly, in the software solution that I am talking about as one point gets focused others at different planes should change correspondingly which is difficult in the above method. Moreover it is not possible to isolate a circle of a particular point in a natural scene where millions of points would get mixed up at the sensor. So what is the way out?

Monday, August 18, 2008

Photography and travel: Melkote

One day trip from Bangalore and around 130 km. Take Mysore road initially and take a deviation to the right after reaching Mandya. There is not much to see other than the place where the movie GURU was shot, so not a recommended place to plan an exclusive trip.

This is the place where Ashwaria Rai runs/dances with the ducks for some song that I don't remember currently.

Sunday, August 17, 2008

Photography and Travel: Masinagudi

A weekend trip to wildlife and nature enthusiasts. Masinagudi is 7 km from Mudhumalai elephant reserve in Tamil Nadu. Masinagudi is just a small town and a resting place while the safaris are mainly conducted in Mudhumalai itself. One can also opt for night safari. If you can leave early in the morning Gopalswamy betta would also be a good place to be on the way nera Bandipur. Ooty is just 30 km from here and if there is a day to spare you can drive to the top.

Mudhumalai Morning Safari

Ooty landscape

Gopalaswamy Betta

Wednesday, June 11, 2008

Photography and Travel: Savandurga

Savandurga has the largest monolithic rock in Asia. It is around 45 kms from Bangalore and is good for one day drives.

Road is pretty good at some places along the stretch as you can see below.

One can also visit the nearby dam and backwaters, which is a good spot to tent at night. On the way back you can also visit Dodda Alada Mara (The big baniyan tree). This is where one of the fight scenes of Khalnayak was shot if you remember.

Wednesday, April 2, 2008

Stargazing Olympics 2008:

Being close to the beta version of my MSMM software, I find Olympics 2008 the rightest place to demonstrate it. Here is a list of events to be held in this event: http://en.beijing2008.cn/cptvenues/schedule/. MSMM can be used to depict motion events like; Athletics, Badminton, Basketball, Canoe/Kayak -- Slalom, Artistic Gymnastics, Gymnastics -- Trampoline, Rhythmic Gymnastics, Aquatics -- Diving, Taekwondo, Volleyball and Beach Volleyball in an effective way. In fact from the video transmission perspective once a MSMM image is created there would be no need to transmit the replay of the entire sequence, as this image(MSMM) would depict all of this in a single frame. So it would be a kind of compression in a way.

Any company willing to promote its camera through this software??? Check out the online demo here: http://www.multishotimaging.com/

Friday, March 7, 2008

Voice munching

Lip reading in computer vision tries to uncover the conversation through an audio less video sequence. It tries to exploit the movement of the lips and the jaw which are assumed to have a unique correspondence to what we speak. When it comes to broadcasting of media content for example say a news show, ideally it would only be required to transmit the video of the person and the software residing locally should be able to give you the news through lip reading. But would it really be worth the effort? How much bandwidth would the audio signal after all take? I would rather be impressed if the whole concept was reversed; bring in the lip movement by looking at the voice. Of course this would not replicate the exact video of the person talking, but no other better example could be found to support this concept. Transmit only the first frame of video and guess the next frames, I mean lip movement through the transmitted voice. The bandwidth to transmit a news channel will just be equal to the bandwidth of voice which means I will be able to use my landline phone with no special modem to make a video call. Crazy stuff! But all depends on how best you can make the person’s lips dance to the tunes of his voice.