GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction
GRAFT initializes humans and scene geometry from a single image, then iteratively refines body parameters via a lightweight transformer. Interaction state is encoded as 24 compact HSI tokens grounded via Geometric Probes, which the transformer updates recurrently—predicting an Interaction Gradient and re-probing the scene each step.
On geometry alone—no image features—GRAFT acts as a plug-and-play HSI prior, projecting any perturbed mesh back to a valid interaction. This lets it improve other methods without retraining, boosting Human3R’s contact F1 by up to 44%.
Comparison against feed-forward baselines UniSH and Human3R on in-the-wild images. Both prior methods lack explicit interaction modeling, leading to hovering and penetration artifacts. GRAFT produces scene-consistent, physically coherent contact.
Select an image to inspect the corresponding GRAFT reconstruction. Drag to rotate · scroll to zoom.