Abstract
With the increase in accuracy and usability of Artificial Intelligence (AI), especially deep neural networks, there has been a big demand for these networks. These methods are implemented in various domains to increase productivity, create new industries, and enhance people’s lives. However, these networks are often large and complex, which does not give insight into the prediction process. In order to make the models more functional and be able to improve them, humans need to understand how they reason. This work studies explanatory models and how they can bring value and insight into how the underlying fully developed model interprets data. The experiments specifically examine how Visual Question Answering (VQA) models can be explained in both the visual and linguistic domains. Two distinct methods are proposed to bridge the gap between models with high accuracy and interpretability. The first model combines the task of VQA with the Explainable Artificial Intelligence (XAI) method Faithful Linguistic Explanations (FLEX). The second method encodes extracted image features into the text prompt of a Large Language Model (LLM). Quantitative experiments are used to find the insights necessary. The experiments are conducted using the language model, which is explained using visualizations of the model’s transition score, and a proxy model explained by Local Interpretable Model-agnostic Explanations (LIME). The main finding of this research is that larger and more complex models, like an LLM, can be explained by smaller methods added after the primary model has completed training. These models can combine complex methods with layers of explanation that bring valuable insights with no cost to the accuracy of the primary model.