To test whether this problem holds for today’s large multimodal models, the team conducted a controlled evaluation. They trained the selected models on five target tasks, including fine-grained bird classification, counting, medical visual question answering, OCR reading, and time reading. They then measured how much performance dropped across eight standard benchmarks that were not part of the fine-tuning set.
These experiments led to two key discoveries, according to the paper. Tuning only the self-attention projection layers (SA Proj), the part of the model that helps it decide which input elements to focus on, allowed the models to learn new tasks with little or no measurable forgetting. Also, what initially appeared as forgotten knowledge often resurfaced when the model was later trained on another specialized task.
“We thus hypothesize that perhaps what looks like forgetting or interference after fine-tuning on a narrow target task is actually bias in the output distribution due to the task distribution shift,” the researchers added. “Through in-depth analysis when tuning the counting task, we confirm this hypothesis: tuning the MLP increases target accuracy but also increases the likelihood of outputting numeric tokens and a highly correlated drop in held-out task accuracy, while tuning the self-attention achieves the target learning without much bias toward numeric tokens and without losing held-out accuracy.”