Model Card for SO101-GR00T-N1 Vials V2.1

This model is a task-specific fine-tuning of NVIDIA's GR00T-N1.6-3B policy for robotic manipulation. The policy was trained to detect, grasp, and place laboratory-style vials into a yellow vial rack using an SO-101 robotic arm.

The training pipeline follows NVIDIA's Sim-to-Real workflow and combines teleoperated demonstrations collected in both Isaac Sim and the real world.

Model Details

Model Description

Data was collected following NVIDIA's official Sim-to-Real Workshop, which combines demonstrations collected in simulation (Isaac Sim) with real-world teleoperation data.

To reproduce this training workflow, an adjusted fork of the original repository was used:

The final training dataset is available at:

You can visualize the V3.0 dataset using LeRobot's Dataset Visualizer

Additional datasets generated during development:

Uses

This model was trained as a benchmark task for evaluating robotic foundation models on a constrained pick-and-place problem.

The task consists of locating a vial, grasping it, and placing it into an empty position within a yellow rack.

Downstream Use

Potential downstream uses include:

  • Further sim2real experiments
  • Alternative rack configurations
  • Robotic manipulation research
  • Fine-tuning for laboratory automation tasks
  • Evaluation of dataset quality and teleoperation strategies

Bias, Risks, and Limitations

The training dataset contains demonstrations from a single robotic platform, a single object category (vials), and a highly structured environment.

Observed limitations include:

  • Preference for a subset of rack positions during placement.
  • Hovering behavior before releasing the vial.
  • Reduced performance outside the training distribution.
  • Sensitivity to camera placement and environmental configuration.

Recommendations

For best performance:

  • Use an SO-101 robotic arm.
  • Maintain camera placements similar to the training setup.
  • Use controlled lighting conditions.
  • Keep the workspace dimensions similar to those used during training.
  • Fine-tune further before deploying to significantly different environments.

How to Get Started with the Model

Follow NVIDIA's Sim-to-Real workflow:

https://docs.nvidia.com/learning/physical-ai/sim-to-real-so-101/latest/index.html

For deployment and evaluation, use the modified workshop repository:

https://github.com/CursedRock17/Sim-to-Real-SO-101-Workshop/tree/gb10_current

Training Details

Training Data

Training data consists of teleoperated demonstrations collected in both simulation and the real world.

Task objective:

Pick up randomly positioned vials and place them into empty holes in a yellow rack.

Data collection characteristics:

  • Approximately 125 simulation episodes
  • Additional 15 real-world teleop episodes
  • Domain randomization enabled during simulation
  • Poor-quality episodes removed before training using LeRobot Doctor
  • Dual-camera observations (external + gripper camera)
  • Consistent grasping strategy across demonstrations

Dataset:

https://huggingface.co/datasets/CursedRock17/so101_teleop_vials_sim_and_real_v21

Training Procedure

The model was fine-tuned from NVIDIA's GR00T-N1.6-3B policy using NVIDIA's standard fine-tuning pipeline.

Preprocessing

Data collection emphasized:

  • Smooth teleoperation trajectories
  • Consistent grasping motions
  • Removal of failed demonstrations
  • Low cross-episode action variance
  • Short task horizons

Problematic episodes were manually removed prior to training.

Desired Movement

Lift head of the arm and open gripper at the same time, as you pick up, begin a straight trajectory towards the vial. Grasp the vial just underneath the cap. Drop once directly over the hole, stabilization at this point can be tricky since you don't have accurate depth. Try not to bump the arm on anything, including the rack. Stabilize the teleoperator arm with something else, only use one of your hands. After grasping the vial, I would pull back towards the base keeping the gripper elevated, pan to the left (seen in the consistency in the visualizer of the dataset), almost arch back, then drop the vial in. Get a full grasp on the vial, otherwise you end up with a weird drop orientation.

Training Hyperparameters

Hyperparmaeters were taken from the base finetuning file:

  • Base Model: GR00T-N1.6-3B
  • Training Steps: 30,000
  • Action Horizon: 16
  • Control Rate: 30 Hz
  • Training Regime: bf16 mixed precision
  • Final Loss: < 0.01
  • warmup_ratio: 0.05
  • weight_decay: 1e-5
  • learning_rate: 1e-4

Speeds, Sizes, Times

Training hardware:

  • Dell Pro Max with NVIDIA GB10

Training durations:

  • 5,000 Steps: 3h 43m
  • 30,000 Steps: 21h 40m

Evaluation

Testing Data, Factors & Metrics

External Tools

To check for loss, I used Weights & Biases page To check for viable epsiodes, I used both the "Action Insights" and "Doctor" tabs of the LeRobot Dataset Visualizer To check for attention, I started using the lerobot_attention_visualizer. Note, the scripts can be found in the helper_scripts section of my repo and are still a work in progress.

Testing Data

Evaluation was performed using held-out task executions in the same physical environment.

Factors

The following factors were evaluated:

  • Object localization success
  • Grasp success
  • Placement success
  • Out-of-distribution lighting robustness

Metrics

Primary metrics:

  • Vials Located
  • Vials Grasped
  • Vials Successfully Placed

Results

30K Step Checkpoint

Evaluation Episodes: 10

  • Vials Located: 10/10
  • Vials Grasped: 9/10
  • Vials Placed: 8/10

Placement Success Rate: 80% :)

5K Step Checkpoint

Evaluation Episodes: 10

  • Vials Located: 7/10
  • Vials Grasped: 2/10
  • Vials Placed: 1/10

Placement Success Rate: 10% :(

OOD Lighting Evaluation

Lighting Conditions:

  • 0%
  • 25%
  • 75%
  • 100%

Evaluation Episodes: 10

  • Vials Located: 10/10
  • Vials Grasped: 9/10
  • Vials Placed: 7/10

Placement Success Rate: 70% :)

Summary

Increasing training duration from 5,000 to 30,000 steps resulted in a substantial increase in task success.

The model demonstrates strong performance in constrained environments and retains reasonable performance under lighting variation.

Model Examination

Qualitative observations:

  • The policy often hovers over the rack before releasing a vial.
  • The model occasionally recovers from failed grasp attempts.
  • Placement behavior tends to favor a subset of rack positions.

Future work may include attention visualization and policy interpretability analysis.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact Calculator:

https://mlco2.github.io/impact

  • Hardware Type: Dell Pro Max with GB10
  • Hours Used: 21.67
  • Carbon Emitted: Not estimated

Technical Specifications

Model Architecture and Objective

The model is based on NVIDIA's GR00T-N1.6-3B Vision-Language-Action architecture.

Objective:

  • Locate vial
  • Grasp vial
  • Transport vial
  • Place vial into rack

Compute Infrastructure

Hardware

  • Dell Pro Max with NVIDIA GB10
  • SO-101 Robot Arm

Software

  • Isaac Sim 5.1.0
  • LeRobot v0.4.3 (Actual commit: e670ac5daf9b76)
  • Python 3.10

Model Card Authors

Lucas Wendland

University of Maryland MATRIX Lab

Model Card Contact

Please open an issue on the Hugging Face Hub repository for questions, bug reports, or reproduction issues.

Downloads last month
56
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Model tree for CursedRock17/so101_teleop_vials_sim_and_real_finetune

Finetuned
(25)
this model

Dataset used to train CursedRock17/so101_teleop_vials_sim_and_real_finetune