
Apple researchers have unveiled an artificial intelligence system capable of understanding ambiguous references to on-screen entities as well as contextual clues in conversation and background situations, enabling more natural interactions with voice assistants, according to a paper published Friday.
ReALM (Reference Resolution as Language Modeling) uses large language models to transform the complex task of reference resolution — such as understanding visual elements on a screen — into a pure language modeling issue, leading to substantial performance gains over existing approaches.
“Understanding context, such as references, is vital for an effective voice assistant experience,” wrote Apple researchers. Allowing users to issue queries regarding what is displayed on their screens is also key in providing true hands-free functionality of voice assistants.
Enhancing conversational assistants
ReALM’s key innovation for handling screen-based references lies in reconstructing it using parsed on-screen entities and their locations to generate a textual representation that accurately captures its visual layout. According to researchers, this method could outperform GPT-4 when combined with language models specifically tuned for reference resolution resolution.
Researchers of GPT-4 found significant improvements in performance over an existing system with similar functionality across various reference types; their smallest model obtained absolute gains of over 5% for on-screen references alone. “Our larger models substantially outshone GPT-4.”
Practical Applications and Limitations
This work highlights the promise of focused language models to address tasks like reference resolution in production systems where using massive end-to-end models would be infeasible due to latency or compute limitations. By publishing its research, Apple is signaling its ongoing investment in making Siri and other products more conversant and context aware.
However, researchers note that automated parsing of screens may have its limits. To handle more complex visual references – like distinguishing multiple images – computer vision and multimodal techniques may be needed to provide more robust results.