Compare and contrast AI APIs at ESC Silicon Valley 2017
I sat in on some very interesting sessions while I was attending the Embedded Systems Conference (ESC) in Minneapolis a couple of weeks ago.
One of these talks Cognitive Computing Evaluation & Comparison, was given by Mak Agashe, who is the CEO of Pifocal. The reason I mention this here is that this is an on-going evaluation, and Mak will be presenting an updated version of this session at the forthcoming ESC Silicon Valley (click here for more details), which will take place December 5-7, 2017, at the San Jose Convention Center, San Jose, California.
There was so much in Mak's talk that either I hadn’t thought of before or I hadn’t considered the nitty-gritty details. Let's start with the fact that, as reported in the 2017 Embedded Markets Study, a lot of embedded designers are looking to build advanced technologies into their next generation devices, where these technologies include embedded speech, embedded vision, artificial intelligence (AI), deep learning, and cognitive (thinking, reasoning) capabilities.
The problem is that things like mathematics, cognitive science, artificial intelligence, machine learning, and deep learning are specialized fields. Building one's own version of the aforementioned technologies would be extremely time-consuming and expensive. In order to address this, companies like Amazon, Google, IBM, and Microsoft have developed APIs for vision, speech, and language.
Let's say you need to analyze a series of images to detect human faces, for example. For each face, you might want to infer information such as gender and age. You might also be interested in understanding the emotions of the people involved (happy, angry, disgusted...). So, what you can do is to make an API call, which hands your image over to an AI in the cloud, which responds with the data you've requested.
But which API and associated cloud service should you use? Are you working with images containing individuals, small groups of people, or crowds? What about images containing artistic renderings of faces (paintings, sculptures, abstract pieces)? Would you be surprised to discover that one API might be better when working with small groups of people, while another API may shine when working with larger crowds?
What Mak and his colleagues have done is to create a cognitive capability framework that can be used to compare and contrast different APIs. As a starting point, they focused on still images containing one or more faces. They took an open dataset comprising 1,000 images containing 2,500+ faces, then they analyzed and annotated these images by hand. Next, they fed these images to the APIs from Amazon, Google, IBM, and Microsoft to see how well they performed.
The results may surprise you -- they certainly surprised me. One analysis that was missing from the Minneapolis presentation was how well the APIs did at inferring the ages of the faces in the images. After his talk, Mak told me that they were planning on adding these results to the Silicon Valley session.
I've been thinking about this quite a lot. What I would like to see would be a comparison of the actual ages of the people in the images, the age range for each face as guestimated by a team of human observers, and the inferred age range for each face as evaluated by the various AIs. It will be interesting to see how Mak presents his age-related results.
Will you be attending ESC Silicon Valley? If so, be sure to stop me and say "Hi." I'll be the one in the Hawaiian shirt. As always, all you have to do is shout "Max, Beer!" or "Max, Bacon!" to be assured of my undivided attention.