Michael Hafftka, a figurative painter whose work is held in permanent collections at the Metropolitan Museum of Art, the Museum of Modern Art (MoMA), the San Francisco Museum of Modern Art (SFMOMA), and the British Museum, has released his complete catalog raisonné as a publicly accessible AI training dataset. Published in late March 2026 on Hugging Face and announced via a post on Reddit’s r/artificial forum, the dataset contains between 3,000 and 4,000 documented works spanning roughly fifty years of practice. It is available under a CC-BY-NC-4.0 license, permitting non-commercial use with attribution.
- Hafftka’s catalog raisonné — 3,000 to 4,000 works — is now freely available on Hugging Face under a CC-BY-NC-4.0 license permitting non-commercial AI research use.
- The dataset received more than 2,500 downloads within its first week, driven primarily by the research community.
- Works span oil on canvas, drawings, etchings, lithographs, and digital pieces produced since the 1970s — roughly half of Hafftka’s estimated total output.
- Hafftka described the release as a deliberate act of agency: “I would rather engage with that on my own terms than wait for it to happen to me.”
What Happened
Hafftka announced the release directly to the r/artificial community on Reddit, identifying himself as a painter with institutional representation at MoMA and the Metropolitan Museum and explaining his reasoning in plain terms. The dataset is hosted at Hafftka/michael-hafftka-catalog-raisonne on Hugging Face. The release appears to be among the first instances of a living artist with major museum representation voluntarily constructing a formal AI training dataset from their own body of work and publishing it with defined licensing terms.
Hafftka’s stated motivation was direct: “I did this because I want my work to have a future and the future involves AI.” He identified himself not as a developer or researcher but as “an artist who has spent fifty years painting the human figure,” positioning the release as an act of artistic continuity rather than a technical project.
Why It Matters
The release arrives at a moment when questions about AI training data and artistic consent are being actively litigated in multiple jurisdictions. Several AI image-generation companies face ongoing legal challenges from artists and rights organizations who argue their work was incorporated into training corpora without permission or compensation. Hafftka’s approach offers a structurally different model: a documented, licensed, artist-initiated release that specifies the conditions under which the work may be used.
The CC-BY-NC-4.0 license permits non-commercial research use with attribution but does not authorize deployment in commercial AI products without further negotiation. That restriction explicitly excludes the kind of unlicensed commercial use at the center of current AI copyright litigation, and it positions the dataset as research infrastructure with clear authorial boundaries rather than an unrestricted asset.
Technical Details
The dataset covers between 3,000 and 4,000 documented works. Hafftka noted that his total artistic output is approximately double that figure, meaning the current release accounts for roughly half of his documented production. The works span multiple media — oil on canvas, works on paper, drawings, etchings, lithographs, and digital works — all produced across a practice that began in the 1970s.
The dataset logged more than 2,500 downloads in its first week of publication, a pace Hafftka acknowledged as faster than expected. “What surprised me is how quickly the research community found it and engaged with it,” he wrote in his Reddit post. He has stated plans to expand the dataset incrementally as additional works are catalogued, though no timeline has been announced.
Who’s Affected
AI researchers working on style representation, visual perception, and figurative image modeling now have access to a large, single-artist corpus with documented provenance and a coherent fifty-year arc of practice. That structural consistency — one artist, one body of work, multiple decades and media — distinguishes it from the heterogeneous web-scraped corpora typically used in image-generation research. Developers building tools for art analysis, provenance verification, or art-historical applications may also find the structured catalog format useful.
For working artists evaluating how to engage with AI systems, the release provides a concrete example of asserting authorial control over training data without stepping away from the conversation. The CC-BY-NC-4.0 license also serves as a public record of intent, separating Hafftka’s dataset from the legally ambiguous corpora that have drawn litigation elsewhere.
What’s Next
Hafftka has indicated the dataset will continue to grow as more of his catalog is processed and uploaded, though he has not announced a completion target or any institutional research partnerships. The questions he poses about perception — “What does the machine see that the human does not? What does the human see that the machine cannot?” — were not framed as problems he intends to resolve, only as the same questions his paintings have asked across five decades.
“I do not have answers,” he wrote in his Reddit announcement. “I have fifty years of looking.”