IBM has released Granite-4.0-3B-Vision, a compact vision-language model designed specifically for enterprise document data extraction tasks. The 3-billion parameter model is now available on Hugging Face and focuses on specialized extraction capabilities that smaller models typically struggle with.
The model targets three primary use cases: chart extraction, table extraction, and semantic key-value pair extraction from document images. For chart processing, it can convert visual charts into structured formats including CSV data (Chart2CSV), descriptive summaries (Chart2Summary), and executable code that recreates the chart (Chart2Code).
For table extraction, Granite-4.0-3B-Vision can process complex table layouts and output them in multiple structured formats. The model supports JSON output with detailed cell-level metadata including row and column indices, span information, and content type classification. It also generates HTML tables and OTSL (a specialized table markup format) that uses specific tags like “<fcel>” for filled cells, “<ecel>” for empty cells, and “<lcel>” for merged cells.
The model includes built-in chat templates with task-specific prompts. For CSV extraction, it instructs users that the output should “include a header row with clear column names” and “represent all data series/categories shown in the chart” while using “numeric values that match the chart as closely as possible.” The JSON extraction prompt specifies a detailed schema structure with dimensions, cell properties, and content classification.
IBM positions this as an “enterprise-grade” solution for document processing workflows, though the company has not disclosed training data details, benchmark performance metrics, or availability timeline beyond the current Hugging Face release.
