Abstract
The goal of this thesis is to model the resolution of referring expressions (e.g., "the red ball") to visual entities in real world. This task is known as visual reference resolution. In order to address it, two types of information have to be combined: the visual aspects of the objects in the world and the linguistic information provided by the speaker. In this thesis, we use a machine learning approach to construct a model that incorporates both types of information. For each object in the world and each referring expression, we calculate the probability of resolving this referring expression to each object given this referring expression and the visual aspects of the world. A binary logistic regression classifier using a combination of visual and linguistic features is trained to resolve such references. Both simple references ("the red ball") and relational references ("the red ball under the green cube") are handled. The model has been evaluated on two datasets using both virtual and real-world scenes. The evaluation shows that the model performs well, in several cases outperforming existing baselines. It is also shown to be robust to visual uncertainty in the world and to noisy speech input. The model can be extended to incorporate other modalities.