ViT综述

Peeling Back the Layers: Interpreting the Storytelling of ViT (opens new window)

MM 2024 ViT逐层解码：揭示图像理解过程 (opens new window)

使用了Instruct-Blip作为基础模型，包含一个40层的图像编码器（EVA-CLIP-ViT）和一个大模型作为文本解码器，逐层逐头分析了ViT的内部结构。借鉴该思路分析一下ViT-B/16。

上次更新: 2025/06/25, 11:25:50