GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

Pradyumna YM^1,2 Yuxuan Xue^1,2,† Yue Chen³ Nikita Kister^1,2 István Sárándi^1,2 Gerard Pons-Moll^1,2,4

¹University of Tübingen ²Tübingen AI Center ³Westlake University ⁴Max Planck Institute for Informatics
^†Corresponding author

2026

Paper arXiv Code BibTeX

In-the-wild results. GRAFT handles diverse single- and multi-person scenes with physically plausible human–scene interactions.

Abstract

Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization-based methods offer accurate contact but are slow (~20 s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human–scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ~50× lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of a three-way user study.

Fast, geometry-grounded human–scene reconstruction. From a single image, GRAFT reconstructs humans and scenes with physically coherent interactions (left), breaking the speed–accuracy tradeoff of prior methods (right).

Method

GRAFT initializes humans and scene geometry from a single image, then iteratively refines body parameters via a lightweight transformer. Interaction state is encoded as 24 compact HSI tokens grounded via Geometric Probes, which the transformer updates recurrently—predicting an Interaction Gradient and re-probing the scene each step.

Overview of GRAFT. Foundation models provide coarse initialization (left). Geometric probes encode local contact cues into compact HSI tokens, from which GRAFT predicts iterative parameter updates (center). The transformer alternates geometry self-attention and visual cross-attention (right).

Geometric Probes & Iterative Refinement

Geometric Probes in action. Each probe queries the nearest scene point at a body joint or vertex, capturing offset and surface normal. Probes are recomputed each iteration, giving the model direct physical feedback as the mesh updates.

Learned HSI Prior

On geometry alone—no image features—GRAFT acts as a plug-and-play HSI prior, projecting any perturbed mesh back to a valid interaction. This lets it improve other methods without retraining, boosting Human3R’s contact F1 by up to 44%.

GRAFT as a learned HSI prior. After applying translation and pose perturbations (red), GRAFT recovers a valid human–scene interaction (green) using geometry alone—no visual features.

Qualitative Comparisons

Comparison against feed-forward baselines UniSH and Human3R on in-the-wild images. Both prior methods lack explicit interaction modeling, leading to hovering and penetration artifacts. GRAFT produces scene-consistent, physically coherent contact.

Qualitative comparison with Human3R and UniSH

Qualitative comparison vs. UniSH and Human3R. Prior feed-forward methods often produce hovering or penetration artifacts. GRAFT recovers physically coherent, scene-consistent interactions.

Interactive 3D Results

Select an image to inspect the corresponding GRAFT reconstruction. Drag to rotate · scroll to zoom.

1 / 50

Ours

↻ drag to rotate

Loading…

BibTeX

@misc{ym2026graft, title = {GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction}, author = {Pradyumna YM and Yuxuan Xue and Yue Chen and Nikita Kister and Istv{\'a}n S{\'a}r{\'a}ndi and Gerard Pons-Moll}, year = {2026}, eprint = {2604.19624}, archivePrefix = {arXiv}, url = {https://arxiv.org/abs/2604.19624}, }