{"id":1149051,"date":"2025-09-10T09:00:00","date_gmt":"2025-09-10T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1149051"},"modified":"2025-09-08T13:16:21","modified_gmt":"2025-09-08T20:16:21","slug":"renderformer-how-neural-networks-are-reshaping-3d-rendering","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/renderformer-how-neural-networks-are-reshaping-3d-rendering\/","title":{"rendered":"RenderFormer: How neural networks are reshaping 3D rendering"},"content":{"rendered":"\n
\"Three<\/figure>\n\n\n\n

3D rendering\u2014the process of converting three-dimensional models into two-dimensional images\u2014is a foundational technology in computer graphics, widely used across gaming, film, virtual reality, and architectural visualization. Traditionally, this process has depended on physics-based techniques like ray tracing and rasterization, which simulate light behavior through mathematical formulas and expert-designed models.<\/p>\n\n\n\n

Now, thanks to advances in AI, especially neural networks, researchers are beginning to replace these conventional approaches with machine learning (ML). This shift is giving rise to a new field known as neural rendering.<\/p>\n\n\n\n

Neural rendering combines deep learning with traditional graphics techniques, allowing models to simulate complex light transport without explicitly modeling physical optics. This approach offers significant advantages: it eliminates the need for handcrafted rules, supports end-to-end training, and can be optimized for specific tasks. Yet, most current neural rendering methods rely on 2D image inputs, lack support for raw 3D geometry and material data, and often require retraining for each new scene\u2014limiting their generalizability.<\/p>\n\n\n\n

RenderFormer: Toward a general-purpose neural rendering model<\/h2>\n\n\n\n

To overcome these limitations, researchers at Microsoft Research have developed RenderFormer, a new neural architecture designed to support full-featured 3D rendering using only ML\u2014no traditional graphics computation required. RenderFormer is the first model to demonstrate that a neural network can learn a complete graphics rendering pipeline, including support for arbitrary 3D scenes and global illumination, without relying on ray tracing or rasterization. This work<\/a> has been accepted at SIGGRAPH 2025 and is open-sourced on GitHub (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

Architecture overview<\/h2>\n\n\n\n

As shown in Figure 1, RenderFormer represents the entire 3D scene using triangle tokens\u2014each one encoding spatial position, surface normal, and physical material properties such as diffuse color, specular color, and roughness. Lighting is also modeled as triangle tokens, with emission values indicating intensity.<\/p>\n\n\n\n

\"Figure
Figure 1. Architecture of RenderFormer<\/figcaption><\/figure>\n\n\n\n

To describe the viewing direction, the model uses ray bundle tokens derived from a ray map\u2014each pixel in the output image corresponds to one of these rays. To improve computational efficiency, pixels are grouped into rectangular blocks, with all rays in a block processed together.<\/p>\n\n\n\n

The model outputs a set of tokens that are decoded into image pixels, completing the rendering process entirely within the neural network.<\/p>\n\n\n\n\t

\n\t\t\n\n\t\t

\n\t\tSpotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"\"\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

Dual-branch design for view-independent and view-dependent effects<\/h2>\n\n\n\n

The RenderFormer architecture is built around two transformers: one for view-independent features and another for view-dependent ones.<\/p>\n\n\n\n