Meta AI researchers recently unveiled OpenEQA, an open-source benchmark dataset designed to assess an artificial intelligence system’s ability for “embodied question answering”, or developing an understanding of an environment to answer natural language queries about it.
Meta is positioning their dataset as a key benchmark for “embodied AI,” comprising more than 1,600 questions covering 180 environments such as homes and offices. These tests cover seven question categories designed to put an AI through rigorous assessments in terms of object recognition, attribute recognition, spatial reasoning and functional reasoning, commonsense knowledge acquisition.
“Considering this context, we argue that Embodied Question Answering (EQA) serves both an end application and evaluation tool,” according to researchers in a paper released today. EQA refers to understanding an environment well enough to answer queries about it using natural language queries.
Robotics, Computer Vision and Language AI
The OpenEQA Project lies at the crossroads of some of the hottest areas of Artificial Intelligence: computer vision, natural language processing, knowledge representation and robotics. Our vision is to create artificial agents which can perceive and interact with the world naturally while communicating naturally with humans while drawing upon knowledge to assist our daily lives.
Researchers envision two primary applications for this “embodied intelligence” in the near term. One application would involve AI assistants embedded into augmented reality glasses or headsets, using video and sensor data to give users photographic memories, such as “Where did I leave my keys?”; mobile robots that autonomously explore an environment to gather information such as finding answers to “Do I still have any coffee left?.” Secondly, mobile robots capable of autonomous exploration such as searching homes to see if any remain could also serve this function.
Establishing a Robust Benchmark In order to create the OpenEQA dataset, Meta researchers collected video footage and 3D scans of real environments. Next, they showed videos to human volunteers and asked them what questions would arise from having access to that visual data.
The 1,636 questions cover an array of perception and reasoning capabilities. For example, to answer “How many chairs are around the dining table?” an AI would need to recognize objects in the scene, understand the spatial concept “around,” and count relevant objects; other questions require basic knowledge about objects’ uses and attributes.
Each question includes answers produced by multiple humans to reflect that questions can be answered in various ways. To evaluate AI agents, researchers employed large language models that automatically score how similar their generated answer is to those provided by humans.