IBM Research released Granite-4.0-3B-Vision on March 27, 2026, a vision-language model purpose-built for structured data extraction from enterprise documents. The model totals approximately 4 billion parameters — a 3.5B Granite 4.0 Micro language model backbone plus 0.5B in LoRA adapters — and supports seven distinct extraction tasks spanning charts, tables, and key-value pairs. On the VAREX benchmark for key-value pair extraction, IBM reports the model achieved 85.5% exact-match accuracy in a zero-shot setting, ranking third among models in the 2–4B parameter class as of the release date.
- Granite-4.0-3B-Vision was released by IBM Research on March 27, 2026, and is available on Hugging Face under the Apache 2.0 license
- The model supports seven task types: Chart2CSV, Chart2Code, Chart2Summary, table extraction in JSON/HTML/OTSL formats, and key-value pair extraction
- On the VAREX KVP benchmark, the model scored 85.5% exact-match accuracy zero-shot, placing third among 2–4B parameter models
- Training used the ChartNet dataset — described as million-scale — and ran on IBM’s Blue Vela cluster using 32 NVIDIA H100 GPUs for approximately 200 hours
What Happened
IBM Research published Granite-4.0-3B-Vision to Hugging Face on March 27, 2026, under an Apache 2.0 license. The model is part of the Granite 4.0 family and is designed for document-centric extraction tasks where general-purpose vision-language models produce unstructured text rather than directly usable structured output. The release includes model weights in BF16 tensor format, a full Jinja2 chat template, and inference code. Author-level attribution for the research team was not provided in the public model card.
The release references two companion papers: ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding (arXiv:2603.27064) and Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence (arXiv:2502.09927). The ChartNet paper describes the primary training dataset used for chart tasks.
Why It Matters
Structured data extraction from document images — charts embedded in analyst reports, multi-header tables in regulatory filings, form fields in insurance or tax documents — requires more than text generation. Outputs need to be directly machine-readable, whether as CSV rows, HTML tables, or JSON schemas. Most general-purpose multimodal models are not fine-tuned for this, producing descriptive prose that requires a secondary parsing step before the data can enter a pipeline.
IBM has been investing in document intelligence infrastructure for several years. The company’s Docling open-source library, released in 2024, addressed document parsing through rule-based and neural methods on text-based PDFs. Granite-4.0-3B-Vision extends that direction to image inputs, accepting scanned documents and chart screenshots where text-based approaches cannot operate.
The 4B effective parameter count positions the model for single-GPU deployment. By comparison, competitive multimodal models such as Qwen2-VL and InternVL2 are commonly deployed at 7B to 72B parameters, requiring substantially more GPU memory.
Technical Details
The model’s architecture combines a SigLIP2 vision encoder (google/siglip2-so400m-patch16-384) with the Granite 4.0 Micro 3B language model, connected via a Window Q-Former projector that applies 4× feature compression. Vision features are injected into the language model at eight points using a technique IBM calls Deepstack — four encoder depth levels mapped to different LLM layers (LayerDeepstack), and the deepest features divided into four spatial groups (SpatialDeepstack). LoRA adapters with rank 256 fine-tune the language model component for extraction tasks.
Chart training used the ChartNet dataset, which IBM describes as million-scale. The methodology applies code-guided augmentation: charts are generated from code, producing aligned quadruples of rendering code, chart image, CSV data, and natural-language summary. The model’s Chart2CSV prompt, as defined in the model card, instructs it to “include a header row with clear column names,” “represent all data series/categories shown in the chart,” and use “numeric values that match the chart as closely as possible.”
Table extraction supports three output formats. The JSON schema returns a dimensions object (with rows, columns, header_rows, and total_rows) and a cells array where each entry carries row, col, colspan, rowspan, type, header_level, and content fields. The OTSL format uses a purpose-built tag vocabulary: <fcel> for filled cells, <ecel> for empty, <lcel> for horizontal merges, <ucel> for vertical merges, <xcel> for both, <ched> for column headers, and <nl> for row breaks. Table performance is measured using the TEDS (Tree-Edit Distance-based Similarity) metric on three datasets: TableVQA-Extract, OmniDocBench-tables, and PubTablesV2, evaluated in both cropped-table and full-page settings. IBM has not published the numeric TEDS scores in the public model card at time of release.
Training ran on IBM’s Blue Vela supercomputing cluster using 32 NVIDIA H100 GPUs for approximately 200 hours. The release requires Python 3.11, PyTorch 2.10.0, Transformers 4.57.6, and PEFT 0.18.1.
Who’s Affected
Enterprise developers building automated document processing pipelines are the primary intended users, particularly in financial services, healthcare, legal, and scientific publishing — sectors where charts and complex tables appear in high volumes and manual extraction is costly. The model accepts English-language instructions and images in PNG or JPEG format.
Organizations already using IBM’s Granite text models can integrate Granite-4.0-3B-Vision without switching toolchains, as it shares the same model family and Hugging Face distribution pattern. The Apache 2.0 license permits commercial use without royalty obligations. Developers can run inference via the Hugging Face Transformers library using the PEFT library for LoRA adapter loading, or access weights directly for self-hosted deployment.
What’s Next
IBM has not announced an integration timeline for Granite-4.0-3B-Vision into its Docling library or the watsonx platform. Numeric TEDS benchmark scores for table extraction tasks were not included in the public model card at time of publication, limiting independent comparison against competing table extraction models. The ChartNet benchmark results referenced in the model card use an LLM-as-judge evaluation methodology; the judge model and scoring rubric are detailed in the companion arXiv paper (arXiv:2603.27064) rather than the Hugging Face release itself.