Retrieval with Vision Language Model

The discussion focused on the evolution and challenges of document retrieval systems, particularly highlighting the transition from traditional text-based approaches to multimodal retrieval systems. While conventional methods like TF-IDF and BM25 are simple and fast, they fail to consider document layout and visual information. Modern neural approaches using bi-encoders and cross-encoders have improved relevance through dense embeddings but still primarily focus on text. The emergence of Vision-Language Models (VLMs) represents a significant advancement, offering the capability to process both textual and visual elements in documents, though finding the right balance between performance and scalability remains a challenge.

ColPali was presented as an innovative solution to these challenges, introducing a unique approach to document processing and retrieval. The system works by dividing PDF pages into approximately 1,030 patches (each 32x32 pixels), converting these patches into 128-dimension vectors, and then processing them to gain contextual understanding. ColPali's architecture shows significant performance improvements over traditional PDF parsers and methods like SigLip. The system also demonstrates impressive interpretability through similarity maps that highlight relevant parts of documents based on query terms.

A significant portion of the discussion addressed the scaling challenges and optimization strategies for ColPali. With the system generating over 1,000 vectors per page, processing large document collections (e.g., 20,000 pages) would require trillions of comparisons. To address this, optimization strategies were proposed, including a hybrid approach using pooling techniques to compress 1,024 vectors into 32 vectors for initial retrieval, followed by using the original ColPali model as a reranker. This approach aims to maintain retrieval quality while significantly improving computational efficiency, demonstrating how the system balances performance with practicality in real-world applications.

Slides link: https://docs.google.com/presentation/d/1hDXIDHBC3P-QaF7wmHccuqKNbh_Qf44ZCnjjBSrPgUI/edit?usp=sharing

Session 6: Retrieval with Vision Language Model

Presenters

Retrieval with Vision Language Model