Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

Abstract

A hallmark of human vision is its active, selective nature. Guided by internal goals or task demands, we focus on relevant parts of the visual world while ignoring distractions. In contrast, today’s leading vision models produce static, pre-computed representations without reference to the specific query they are meant to serve. We address this challenge with LIVE (Language-Instructed Vision Embeddings), a simple and effective framework for creating language-steered vision embeddings. LIVE enables dynamic, fine-grained control of a vision encoder by training it to follow textual instructions. We use LLMs to generate synthetic instruction-response pairs, which we combine with images into contrastive triplets. This teaches the vision encoder to steer its embeddings based on textual commands, allowing it to highlight relevant attributes or suppress adversarial cues.

Publication
In International Conference on Learning Representations (ICLR)
Date