KEMBAR78

INF-MLLM: Multimodal Large Language Models from INF Tech

Introduction

INF-MLLM is a series of open-source multimodal large language models developed by INF Tech. This repository contains the code, models, and documentation for our projects, which aim to advance the state-of-the-art in visual-language understanding and document intelligence. We are committed to open research and have released our models and datasets to the community to foster collaboration and innovation.

Updates

[2025/06/30] The Infinity-Doc-55K dataset and Infinity-Parser web demo are now available.
[2025/05/27] We have added an introduction to our latest model, Infinity-Parser.
[2025/04/22] VL-Rethinker models (7B & 72B) are released! They achieve new state-of-the-art results on MathVista, MathVerse, and MathVision benchmarks.
[2024/08/19] We have released INF-MLLM2, with the INF-MLLM2-7B model and evaluation code now available.
[2023/12/06] The models and evaluation code for INF-MLLM1 are now available.
[2023/11/06] We have released INF-MLLM1 and uploaded the initial version of the manuscript to arXiv.

Models

Here is a brief overview of the models available in this repository. For more details, please refer to the respective project directories.

Infinity-Parser

Infinity-Parser is an end-to-end scanned document parsing model trained with reinforcement learning. It is designed to maintain the original document's structure and content with high fidelity by incorporating verifiable rewards based on layout and content. Infinity-Parser demonstrates state-of-the-art performance on various benchmarks for text recognition, table and formula extraction, and reading-order detection.

Key Features: Layout-aware, reinforcement learning, high-fidelity document parsing.
Paper: Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Dataset: Infinity-Doc-55K
Web Demo: Infinity-Parser-Demo

VL-Rethinker

VL-Rethinker is a project designed to incentivize the self-reflection capabilities of Vision-Language Models (VLMs) through Reinforcement Learning. The research introduces a novel technique called Selective Sample Replay (SSR) to enhance the GRPO algorithm, addressing the "vanishing advantages" problem. It also employs "Forced Rethinking" to explicitly guide the model through a self-reflection reasoning step. By combining these methods, VL-Rethinker significantly advances the state-of-the-art performance on multiple vision-language benchmarks, including MathVista, MathVerse, and MathVision.

Key Features: Advanced RL techniques, fine-grained multimodal dataset, fully open-sourced.
Paper: VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Dataset: ViRL39K
Models: VL-Rethinker-7B, VL-Rethinker-72B
Web Demo: VL-Rethinker-Demo

INF-MLLM2

INF-MLLM2 is an advanced multimodal model with significant improvements in high-resolution image processing and document understanding. It supports dynamic image resolutions up to 1344x1344 pixels and features enhanced OCR capabilities for robust document parsing, table and formula recognition, and key information extraction.

Key Features: High-resolution image support, advanced OCR, progressive multi-stage training.
Paper: Technical Report
Model: INF-MLLM2-7B

INF-MLLM1

INF-MLLM1 is a unified model for a wide range of visual-language tasks. It is designed to handle both multitask and instruction-tuning scenarios, demonstrating strong performance on various VQA and visual grounding datasets.

Key Features: Unified framework, multitask learning, instruction tuning.
Paper: InfMLLM: A Unified Framework for Visual-Language Tasks
Models: InfMLLM-7B, InfMLLM-7B-Chat, InfMLLM-13B-Chat

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
INF-MLLM1		INF-MLLM1
INF-MLLM2		INF-MLLM2
Infinity-Parser		Infinity-Parser
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

INF-MLLM: Multimodal Large Language Models from INF Tech

Introduction

Updates

Models

Infinity-Parser

VL-Rethinker

INF-MLLM2

INF-MLLM1

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

infly-ai/INF-MLLM

Folders and files

Latest commit

History

Repository files navigation

INF-MLLM: Multimodal Large Language Models from INF Tech

Introduction

Updates

Models

Infinity-Parser

VL-Rethinker

INF-MLLM2

INF-MLLM1

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages