Multimodal►Image-to-Text

InstructBLIP

InstructBLIP model using Flan-T5-xl as language model. InstructBLIP was introduced in the paper InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Dai et al.

Disclaimer: The team releasing InstructBLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

InstructBLIP is a visual instruction tuned version of BLIP-2. Refer to the paper for details.

InstructBLIP architecture