PartCrafter

🎥 Overview Video 🎥

🧩 Abstract 🧩

We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes.

PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data have been released here.

🔮 Method 🔮

Architecture

The capability of PartCrafter in generating structured 3D assets, e.g., 3D part-decomposable objects or 3D scenes composed of multiple object parts, is achieved by (1) Compositional Latent Space, where each 3D part is represented by a set of disentangled latent tokens. To distinguish across parts, we add a learnable part identity embedding to the tokens of each part. (2) Local-Global Denoising Transformer, where information across sets of latent tokens is fused to enable both local-level and global-level reasoning. Incorporation of the image condition into both local and global features ensure the independence and semantic coherence of the generated parts.

Dataset Curation

Large-scale 3D object datasets often contain rich part annotations. The pie chart and bin chart visualize the distribution of an object's part count. We curate a dataset by combining multiple sources, yielding 130,000 3D objects, of which 100,000 contain multiple parts. We further refine this dataset by filtering based on texture quality, part count, and average part-level Intersection over Union (IoU) to ensure high-quality supervision. The resulting dataset comprises approximately 50,000 part-labeled objects and 300,000 individual parts.

📝 Image to 3D Part-Level Object Generation 📝

📷 Real-World Images 📷

We evaluated our model using real-world object images from the CO3D dataset. Since our model was trained on rendered images from Objaverse, we observed that directly testing it on real-world images led to suboptimal performance due to the domain gap.
To address this issue, we stylized the input images to resemble 3D renderings before feeding them into the model, which significantly improved the results. We used Doubao's raw image transformation feature with the following prompt: "Preserve all details and perform image-to-image style transfer to convert the image into the style of a 3D rendering (Objaverse-style rendering)."

Input	Remove Background	RMBG-Result	Stylized	Stylized-Result

🖼️ Image to 3D Scene Generation 🖼️

🛠️ Ablation Study 🛠️

Input	2 Parts	3 Parts	4 Parts

5 Parts	6 Parts	7 Parts	8 Parts

Input	2 Parts	3 Parts	4 Parts

5 Parts	6 Parts	7 Parts	8 Parts

🎨 More Results 🎨

A. Comparison with HoloPart

Input	HoloPart	Ours

B. Comparison with MIDI

Input	MIDI	Ours

📚 BibTeX 📚

If you find our work helpful, please consider citing:

@misc{lin2025partcrafter,
  title={PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers}, 
  author={Yuchen Lin and Chenguo Lin and Panwang Pan and Honglei Yan and Yiqiang Feng and Yadong Mu and Katerina Fragkiadaki},
  year={2025},
  eprint={2506.05573},
  url={https://arxiv.org/abs/2506.05573}
}

PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

PartCrafter is a structured 3D generative model that jointly generates multiple parts and objects from a single RGB image in one shot.