Model Card for SO101-GR00T-N1 Vials V2.1
This model is a task-specific fine-tuning of NVIDIA's GR00T-N1.6-3B policy for robotic manipulation. The policy was trained to detect, grasp, and place laboratory-style vials into a yellow vial rack using an SO-101 robotic arm.
The training pipeline follows NVIDIA's Sim-to-Real workflow and combines teleoperated demonstrations collected in both Isaac Sim and the real world.
Model Details
Model Description
Data was collected following NVIDIA's official Sim-to-Real Workshop, which combines demonstrations collected in simulation (Isaac Sim) with real-world teleoperation data.
To reproduce this training workflow, an adjusted fork of the original repository was used:
The final training dataset is available at:
You can visualize the V3.0 dataset using LeRobot's Dataset Visualizer
Additional datasets generated during development:
Simulation Only: https://huggingface.co/datasets/CursedRock17/so101_teleop_vials_sim_dr_train_trimmed
Real Only: https://huggingface.co/datasets/CursedRock17/so101_teleop_vials_rack_real
Combined Sim + Real: https://huggingface.co/datasets/CursedRock17/so101_teleop_vials_sim_and_real
Developed by: UMD MATRIX Lab
License: Apache-2.0
Finetuned from model: https://huggingface.co/nvidia/GR00T-N1.6-3B
Uses
This model was trained as a benchmark task for evaluating robotic foundation models on a constrained pick-and-place problem.
The task consists of locating a vial, grasping it, and placing it into an empty position within a yellow rack.
Downstream Use
Potential downstream uses include:
- Further sim2real experiments
- Alternative rack configurations
- Robotic manipulation research
- Fine-tuning for laboratory automation tasks
- Evaluation of dataset quality and teleoperation strategies
Bias, Risks, and Limitations
The training dataset contains demonstrations from a single robotic platform, a single object category (vials), and a highly structured environment.
Observed limitations include:
- Preference for a subset of rack positions during placement.
- Hovering behavior before releasing the vial.
- Reduced performance outside the training distribution.
- Sensitivity to camera placement and environmental configuration.
Recommendations
For best performance:
- Use an SO-101 robotic arm.
- Maintain camera placements similar to the training setup.
- Use controlled lighting conditions.
- Keep the workspace dimensions similar to those used during training.
- Fine-tune further before deploying to significantly different environments.
How to Get Started with the Model
Follow NVIDIA's Sim-to-Real workflow:
https://docs.nvidia.com/learning/physical-ai/sim-to-real-so-101/latest/index.html
For deployment and evaluation, use the modified workshop repository:
https://github.com/CursedRock17/Sim-to-Real-SO-101-Workshop/tree/gb10_current
Training Details
Training Data
Training data consists of teleoperated demonstrations collected in both simulation and the real world.
Task objective:
Pick up randomly positioned vials and place them into empty holes in a yellow rack.
Data collection characteristics:
- Approximately 125 simulation episodes
- Additional 15 real-world teleop episodes
- Domain randomization enabled during simulation
- Poor-quality episodes removed before training using LeRobot Doctor
- Dual-camera observations (external + gripper camera)
- Consistent grasping strategy across demonstrations
Dataset:
https://huggingface.co/datasets/CursedRock17/so101_teleop_vials_sim_and_real_v21
Training Procedure
The model was fine-tuned from NVIDIA's GR00T-N1.6-3B policy using NVIDIA's standard fine-tuning pipeline.
Preprocessing
Data collection emphasized:
- Smooth teleoperation trajectories
- Consistent grasping motions
- Removal of failed demonstrations
- Low cross-episode action variance
- Short task horizons
Problematic episodes were manually removed prior to training.
Desired Movement
Lift head of the arm and open gripper at the same time, as you pick up, begin a straight trajectory towards the vial. Grasp the vial just underneath the cap. Drop once directly over the hole, stabilization at this point can be tricky since you don't have accurate depth. Try not to bump the arm on anything, including the rack. Stabilize the teleoperator arm with something else, only use one of your hands. After grasping the vial, I would pull back towards the base keeping the gripper elevated, pan to the left (seen in the consistency in the visualizer of the dataset), almost arch back, then drop the vial in. Get a full grasp on the vial, otherwise you end up with a weird drop orientation.
Training Hyperparameters
Hyperparmaeters were taken from the base finetuning file:
- Base Model: GR00T-N1.6-3B
- Training Steps: 30,000
- Action Horizon: 16
- Control Rate: 30 Hz
- Training Regime: bf16 mixed precision
- Final Loss: < 0.01
- warmup_ratio: 0.05
- weight_decay: 1e-5
- learning_rate: 1e-4
Speeds, Sizes, Times
Training hardware:
- Dell Pro Max with NVIDIA GB10
Training durations:
- 5,000 Steps: 3h 43m
- 30,000 Steps: 21h 40m
Evaluation
Testing Data, Factors & Metrics
External Tools
To check for loss, I used Weights & Biases page To check for viable epsiodes, I used both the "Action Insights" and "Doctor" tabs of the LeRobot Dataset Visualizer To check for attention, I started using the lerobot_attention_visualizer. Note, the scripts can be found in the helper_scripts section of my repo and are still a work in progress.
Testing Data
Evaluation was performed using held-out task executions in the same physical environment.
Factors
The following factors were evaluated:
- Object localization success
- Grasp success
- Placement success
- Out-of-distribution lighting robustness
Metrics
Primary metrics:
- Vials Located
- Vials Grasped
- Vials Successfully Placed
Results
30K Step Checkpoint
Evaluation Episodes: 10
- Vials Located: 10/10
- Vials Grasped: 9/10
- Vials Placed: 8/10
Placement Success Rate: 80% :)
5K Step Checkpoint
Evaluation Episodes: 10
- Vials Located: 7/10
- Vials Grasped: 2/10
- Vials Placed: 1/10
Placement Success Rate: 10% :(
OOD Lighting Evaluation
Lighting Conditions:
- 0%
- 25%
- 75%
- 100%
Evaluation Episodes: 10
- Vials Located: 10/10
- Vials Grasped: 9/10
- Vials Placed: 7/10
Placement Success Rate: 70% :)
Summary
Increasing training duration from 5,000 to 30,000 steps resulted in a substantial increase in task success.
The model demonstrates strong performance in constrained environments and retains reasonable performance under lighting variation.
Model Examination
Qualitative observations:
- The policy often hovers over the rack before releasing a vial.
- The model occasionally recovers from failed grasp attempts.
- Placement behavior tends to favor a subset of rack positions.
Future work may include attention visualization and policy interpretability analysis.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact Calculator:
https://mlco2.github.io/impact
- Hardware Type: Dell Pro Max with GB10
- Hours Used: 21.67
- Carbon Emitted: Not estimated
Technical Specifications
Model Architecture and Objective
The model is based on NVIDIA's GR00T-N1.6-3B Vision-Language-Action architecture.
Objective:
- Locate vial
- Grasp vial
- Transport vial
- Place vial into rack
Compute Infrastructure
Hardware
- Dell Pro Max with NVIDIA GB10
- SO-101 Robot Arm
Software
- Isaac Sim 5.1.0
- LeRobot v0.4.3 (Actual commit: e670ac5daf9b76)
- Python 3.10
Model Card Authors
Lucas Wendland
University of Maryland MATRIX Lab
Model Card Contact
Please open an issue on the Hugging Face Hub repository for questions, bug reports, or reproduction issues.
- Downloads last month
- 56
Model tree for CursedRock17/so101_teleop_vials_sim_and_real_finetune
Base model
nvidia/GR00T-N1.6-3B