Understanding Visual Reference Expressions in the Real World:How to Use Visual Features

Taiki Funakoshi (AY 2023)

Referential expressions are natural language representations that distinguish certain objects from others, such as "the blue cup on the table" or "the chocolate near the cup," and the task of visual reference understanding is to make the computer understand these referential expressions. Research on visual reference comprehension has been formulated as a visual language reasoning task, and has generally been approached as a deep learning model trained on a large number of bounding boxes and their paired reference representations. However, the recently emerged large-scale language model reformulates the task of visual referential representation understanding as a pure language reasoning task, and it is clear that visual referential representation understanding by a large-scale language model is possible by inputting meta-information about visual features as text into a large-scale language model. This research was conducted to clarify the mechanism of visual referential expression comprehension in large-scale language models.

This study investigates a feedback method for incomplete reference expressions that has not been used in visual reference comprehension. Since visual reference comprehension using a large-scale language model is based on the input of visual features as meta-information, it is possible to determine in advance the visual features that play an important role in uniquely identifying objects, so that the visual features included in the representation input by the user do not satisfy the amount of information required for visual reference comprehension. This study was conducted on a bookshelf. In this study, we conducted experiments on bookshelves to determine the percentage of objects that can be uniquely identified by the combination of visual features obtained from bookshelves, and to identify the visual features that play an important role in object identification.

The results of this study are expected to contribute to a smooth understanding of reference expressions in scenarios that may occur in the real world and to improve information transfer in everyday life.

(Translated by DeepL)

Back to Index