Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness.
Given a single-view input (top-left, captured using an iPhone 15 Pro), MoReMouse predicts high-fidelity surface geometry and appearance using a transformer-based triplane architecture. The model is trained on a dataset constructed from real multi-view videos fitted to a Gaussian mouse avatar. Our method supports fast inference and outputs RGB renderings, semantic embeddings, and normal maps from a single image.
Given a monocular input image, a DINO-V2 encoder extracts high-level image features, which are processed by a transformer-based decoder to generate a triplane representation. These triplanes, combined with learnable positional embeddings, are queried by a Multi-Head MLP to predict per-point color, density, and geodesic embedding. Rendering is performed via either NeRF (for volumetric supervision) or DMTet (for explicit geometry output), producing RGB images, color-coded embeddings, and surface meshes.
Our synthetic dataset is built using a high-fidelity Gaussian mouse avatar. We begin by fitting a skeletal mesh model to multi-view recordings of real mice using 2D keypoints and silhouette constraints, resulting in anatomically plausible pose parameters . From this fitted mesh, we compute a baseline Gaussian position map in UV space, denoted as , where each valid texel corresponds to a Gaussian point.
To enhance this representation, a convolutional neural network (CNN) predicts:
The final deformed Gaussian positions are computed as:
where is the UV map of the neutral pose and is the LBS-based deformation function.
Finally, we render the avatar using Gaussian Splatting, producing diverse and photorealistic synthetic images for training monocular 3D reconstruction models.