ANALYSIS

Developer Ports Microsoft TRELLIS.2 to Apple Silicon via PyTorch MPS

M Marcus Rivera Apr 20, 2026 3 min read
Engine Score 7/10 — Important
Editorial illustration for: Developer Ports Microsoft TRELLIS.2 to Apple Silicon via PyTorch MPS
  • Developer Shivam Kumar published a Mac-native port of Microsoft’s TRELLIS.2, replacing all CUDA-only libraries with PyTorch MPS-compatible alternatives.
  • The port generates 400,000-vertex-plus meshes from single images in approximately 3.5 minutes on an M4 Pro with 24GB unified memory.
  • Pure-PyTorch sparse convolution runs roughly 10x slower than the original CUDA kernel; texture export and mesh hole filling remain unsupported.
  • The RMBG-2.0 background removal component carries a CC BY-NC 4.0 license that restricts commercial deployment without a separate agreement from BRIA AI.

What Happened

Developer Shivam Kumar published trellis-mac, a port of Microsoft’s TRELLIS.2 image-to-3D model that replaces every CUDA-only dependency with Apple Silicon-compatible code running through PyTorch’s Metal Performance Shaders (MPS) backend. The project, shared on GitHub and posted to Hacker News on April 20, 2026, allows Mac users with M1-generation chips or later to generate detailed 3D meshes from single photographs without an NVIDIA GPU. Before this port, running TRELLIS.2 required CUDA-capable hardware.

Why It Matters

TRELLIS.2 produces textured OBJ and GLB files with physically-based rendering (PBR) materials from a single input photograph — a workflow previously inaccessible on macOS without cloud GPU infrastructure or a dedicated NVIDIA workstation. The effort follows a broader pattern of the open-source community adapting CUDA-native models to PyTorch MPS, which gained Metal acceleration support beginning with PyTorch 1.12 in mid-2022. Apple Silicon’s unified memory architecture, which pools CPU and GPU allocations into a single region, makes it a practical inference target for large models that would otherwise require high-VRAM discrete GPUs.

Technical Details

Kumar replaced five CUDA-specific components in the original codebase. The sparse 3D convolution kernel (flex_gemm) was rewritten in backends/conv_none.py as a gather-scatter operation over a spatial hash of active voxels, with neighbor maps cached per-tensor to avoid redundant lookups. CUDA hashmap operations for mesh extraction were replaced with Python dictionary lookups in backends/mesh_extract.py, triangulating quads using normal-alignment heuristics. Flash attention was substituted with PyTorch’s native torch.nn.functional.scaled_dot_product_attention, padding variable-length sequences into batches before processing.

Two dependencies — nvdiffrast for differentiable rasterization and cumesh for hole filling — were stubbed out rather than ported, disabling texture baking and mesh repair entirely. Benchmarks on an M4 Pro with 24GB unified memory using the 512 pipeline type show a total generation time of approximately 3.5 minutes, with shape SLat sampling alone accounting for roughly 90 seconds and memory peaking at approximately 18GB. Kumar states in the repository documentation: “The pure-PyTorch sparse convolution is ~10x slower than the CUDA flex_gemm kernel. This is the main bottleneck.”

Who’s Affected

3D artists, game developers, and researchers on Apple Silicon hardware gain local access to TRELLIS.2 without cloud GPU costs, though the hardware floor is steep: Kumar recommends 24GB of unified memory — limiting practical use to M-series Pro, Max, or Ultra configurations — and approximately 15GB of free disk space for model weights downloaded on first run. Commercial users face an additional constraint: the RMBG-2.0 background removal component is licensed under CC BY-NC 4.0, requiring a separate commercial license from BRIA AI for production deployments. Meta’s DINOv3 weights are gated on Hugging Face under a custom license and require individual access approval before download.

What’s Next

Kumar’s README notes that example output images are forthcoming. The two largest unresolved gaps — texture export and mesh hole filling — depend on MPS-compatible replacements for nvdiffrast and cumesh, neither of which currently has a community-maintained port for Apple Silicon. The trellis-mac porting code is released under the MIT License; upstream TRELLIS.2 model weights also carry MIT terms, leaving licensing straightforward for most non-commercial deployments.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime