Software developers as a group seem to be particularly interested in the topic and promise of artificial intelligence. I am not going to speculate as to the reasons why this is the case, but I will confess to being a software developer intrigued by the pursuit of modeling intelligent thought. I had heard of a company called Numenta (link) and recently had the time to go to their web site and see what they are up to. If you happen to share my interest, I strongly recommend you check them out. Their papers on HTM (hierarchical temporal memory) are insightful and their demo on image recognition (link) is the coolest thing I have seen in a while.
The founder, Jeff Hawkins, who happens to be a co-founder of Palm Computing and Handspring, has posed the statement: while any child can identify a drawing of a cat or dog the first time it's seen, computers find the same task nearly impossible. It is an interesting and indisputable (to date) statement. In this article I would like to discuss how image recognition is a mathematical exercise that our minds perform routinely. These thoughts could apply to other types of pattern recognition (sound, touch, etc.) as well, but I will focus on the recognition of two dimensional images.
Disclaimer: There are many sub-disciplines and countless publications in the area of artificial intelligence. I have only been able to read a small sampling of these writings. If the ideas in this article infringe on the work of somebody else, I apologize.
To begin, consider this image:
You recognize it as a line segment. Let us not take that accomplishment for granted. You do not think of the image as one thousand dots lined up right next to each other. Nor did you think of it as a blue square with a white square with a slit in it place on top of the blue square. Your mind compresses the image into the most logical efficient representation: a short blue line.
Now I do not know how this line is modeled in our brain, but I do know how I could model it using algebra and a Cartesian coordinate system. The formula for a straight line is: y= mx +b. Where
Let us arbitrarily place this segment is the Cartesian plane of our mind (you mean you didn't know you had one?).
The equation that represents this line is:
where x is constrained between the values 1 and 4 (1<= x <=4).
And by the way, the length of this line is determined with the distance formula:
which turns out to be the square root of eighteen in this case, or roughly 4.24.
Now let us look at the first line along with another line:
You recognize that these objects are similar in nature (they are both line segments) but the red one is a bit steeper and a little longer than the blue. Let us place them both in the Cartesian plane:
The formula that represents this red line is:
You see the slope is greater than the value of the blue line (three instead of one, making the line "steeper").
And using the distance formula above, we determine a length of roughly 6.32.
So algebra confirms your assessment. Even if you never took a day of algebra in your life and these formulas look foreign to you, your mind is performing these calculations in some manner. Your mind interprets these two-dimensional images as some type of Cartesian line function.
Look at this image:
You recognize it as a circle. The Cartesian coordinate formula for a circle centered at the origin is:
where r is the radius and x and y are the respective point coordinates.
As your eye traces and your mind interprets the shape you see, it evaluates the shape against this formula (again, in some manner) and categorizes the shape as a circle. You may say, no, I just recognize a circle because I know what a circle is. But that is not the case. Consider these similar shapes and decide which of them you consider a circle.
By definition, every point in the circle is equidistant from its center. The second and fourth shapes are circles; the others are more elliptical. Your mind really does evaluate the Cartesian function that determine these shapes.
Consider this circle in a Cartesian plane:
The formula for the circle above is:
confirms your statement. The radius (r) of the first circle is 2, and the radius of the second circle is 4.
Now let us look at these two images:
It is clear to see that the circles in the box on the left are closer to each than the circles in the box on the right. If we placed these images in a Cartesian plane and drew a line between the centers or each pair, we could calculate (using the distance formula) that the line between the centers of the first two circles is shorter than the line draw between the center of the second pair of circles. Once again, as you visually assess the images, your brain performs this calculation in some manner.
So far we have investigated how we can represent simple two-dimensional shapes as functions in a Cartesian plane and how our mind is able to assess position, proximity and proportion. We are ready for the next step -- what does the demonstration have to do with discerning between the images of cats and dogs?
To start answering that, look at the shapes below:
Fundamentally, all four images consist of a circle at the end of something that resembles a line (a thick line in some cases).
Now I will ask you to match each shape to one of the labels below, the one that best matches your mental template of that label:
Spoiler alert - the answers are below:
This image is best labeled as a matchstick. The detached circle in the second image defies our understanding of a matchstick. The third and fourth images have circular heads that are proportionally too large to match the template of a matchstick.
This is the best match for a lowercase "i". The dot is detached from the body of the letter and the widths of body and dot are proportionally similar. And unlike the first image, this one is vertically straight.
This is the best match for a lollipop. The length of the stick and the proportion of the "candy" to the stick are closest to our template of a lollipop. The first image, that fit the drumstick, could be considered a near match, but as we are applying labels one-to-one with images, its proportions to not make it as good as a match as this image.
This is the best match for a tree. The trunk width in proportion to the crown of leaves matches our picture of a tree. None of the other images present a better representation of a tree template.
The four images are very similar to each other in a geometric sense. I would imagine that many image recognition programs would have trouble telling them apart. But given the choices, you are able to make a best match. And you used the relative properties of scale, positioning, and rotation to make your assessment.
The point is that our mind identifies, isolates and aggregates the line functions that make up the boundaries of everyday objects. For boxes and balls, these functions are on the simple side. For a person's face or a sports car, the collection of functions gets more complicated. Here are some examples that are somewhere between the very simple and the more complicated. Look at these images:
Although I am not going to make any money selling these pictures at an art auction, you almost definitely recognize them as a house and a car. Just to emphasize the fact that these images are a collection of line functions, place them in a Cartesian coordinate system:
Both pictures are collections of simple lines, but based on the directionality, size, proportion, and placement of the lines, they map to specific concepts we have established in our minds, in this case, a house and a car.
Now as I mentioned, the processing our mind does on the images delivered to it range from the very simple to the very complex. In the case of the complex, the mathematical representations of those shapes and patterns would be complicated, but definable. This variance is why it is easier for us to recall from memory and sketch a ball as opposed to the portrait of Washington crossing the Delaware.
So now let us look at some rough sketches of the outlines of a cat and dog in a Cartesian coordinate system:
The first picture is a cat; the second is a dog. The contours of the face, particularly around the nose and ears, are different. The tails have different outlines. There are differences in the body shapes the leg length proportions. Of course cats and dogs differ across the species. These are just representative images. But if you repeat the exercise with different examples, you find that the images that are clearly "more dog than cat" are comprised of line patterns that drive that distinction.
Now of course this article does not completely describe how our mind interprets what it sees. The examples are all two-dimensional snapshots isolated to one point in time. We do not address the mathematics that deal with three-dimensional rotation, stretching, skewing, and other manipulation of images. And we only deal with the lines that identify boundaries, not the colors and other visual textures. And context plays a big part in how we interpret things. In the case of dogs vs. cats, what noise does the object make? Is somebody walking it? Is it playing with string? And so forth. Each of these considerations is a topic for further investigation in understanding the working of our intelligence. But for now, as you look out the window or at photos of celebrities or at a lasagna, be impressed with the amount of analysis and processing you are constantly doing.