KEMBAR78
GitHub - infly-ai/INF-MLLM
Skip to content

infly-ai/INF-MLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INF-MLLM: Multimodal Large Language Models from INF Tech

Introduction

INF-MLLM is a series of open-source multimodal large language models developed by INF Tech. This repository contains the code, models, and documentation for our projects, which aim to advance the state-of-the-art in visual-language understanding and document intelligence. We are committed to open research and have released our models and datasets to the community to foster collaboration and innovation.

Updates

  • [2025/06/30] The Infinity-Doc-55K dataset and Infinity-Parser web demo are now available.
  • [2025/05/27] We have added an introduction to our latest model, Infinity-Parser.
  • [2025/04/22] VL-Rethinker models (7B & 72B) are released! They achieve new state-of-the-art results on MathVista, MathVerse, and MathVision benchmarks.
  • [2024/08/19] We have released INF-MLLM2, with the INF-MLLM2-7B model and evaluation code now available.
  • [2023/12/06] The models and evaluation code for INF-MLLM1 are now available.
  • [2023/11/06] We have released INF-MLLM1 and uploaded the initial version of the manuscript to arXiv.

Models

Here is a brief overview of the models available in this repository. For more details, please refer to the respective project directories.

Infinity-Parser is an end-to-end scanned document parsing model trained with reinforcement learning. It is designed to maintain the original document's structure and content with high fidelity by incorporating verifiable rewards based on layout and content. Infinity-Parser demonstrates state-of-the-art performance on various benchmarks for text recognition, table and formula extraction, and reading-order detection.

VL-Rethinker is a project designed to incentivize the self-reflection capabilities of Vision-Language Models (VLMs) through Reinforcement Learning. The research introduces a novel technique called Selective Sample Replay (SSR) to enhance the GRPO algorithm, addressing the "vanishing advantages" problem. It also employs "Forced Rethinking" to explicitly guide the model through a self-reflection reasoning step. By combining these methods, VL-Rethinker significantly advances the state-of-the-art performance on multiple vision-language benchmarks, including MathVista, MathVerse, and MathVision.

INF-MLLM2 is an advanced multimodal model with significant improvements in high-resolution image processing and document understanding. It supports dynamic image resolutions up to 1344x1344 pixels and features enhanced OCR capabilities for robust document parsing, table and formula recognition, and key information extraction.

INF-MLLM1 is a unified model for a wide range of visual-language tasks. It is designed to handle both multitask and instruction-tuning scenarios, demonstrating strong performance on various VQA and visual grounding datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •