How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="jayantsom/medgemma-1v5-4b-it-rsna23-abd-ct-peft-lora-r16-a32-ep3-lr2e4-v1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("jayantsom/medgemma-1v5-4b-it-rsna23-abd-ct-peft-lora-r16-a32-ep3-lr2e4-v1", dtype="auto")
Quick Links

MedGemma 1.5 RSNA Abdominal Trauma Adapter

This is a Parameter-Efficient Fine-Tuning (LoRA) adapter for google/medgemma-1.5-4b-it. It specializes the base medical vision-language model to act as an expert trauma radiologist, analyzing abdominal CT angiogram volumes to detect and classify solid organ injuries and hemorrhages, outputting highly structured JSON clinical reports.

Model Details

Model Description

The base MedGemma 1.5 (4B parameters) model has been fine-tuned using LoRA on the RSNA 2023 Abdominal Trauma Detection dataset. Instead of open-ended conversational text, this adapter strictly aligns the model to evaluate multi-slice CT volumes and generate a structured JSON output detailing the injury pattern, specific organs involved (liver, spleen, kidney, bowel), bleeding description, severity estimation, and differential diagnoses.

  • Developed by: Jayant Som
  • Funded by [optional]: N/A
  • Shared by [optional]: N/A
  • Model type: Multimodal Vision-Language Model (VLM) Adapter
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: google/medgemma-1.5-4b-it

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

This adapter is intended to be loaded on top of the base medgemma-1.5-4b-it model. It is designed to take an interleaved sequence of 2.5D CT slice images (NIfTI/DICOM converted to RGB via soft-tissue windowing) and output a precise JSON schema. It is a core component of the HAI-DEF multi-model trauma analysis pipeline.

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

  • This model is a research prototype and is not intended for direct clinical decision making or unsupervised patient diagnosis.
  • The model is specialized for abdominal trauma CT scans and will likely perform poorly on MRIs, X-rays, or CTs of other anatomical regions (e.g., cranial or thoracic).

Bias, Risks, and Limitations

  • Dataset Limitation: The model was fine-tuned on a highly curated subset of 200 cases from the RSNA 2023 challenge. It may inherit biases present in that specific sample distribution (e.g., underrepresentation of rare bowel injuries).
  • Hallucinations: Like all LLMs/VLMs, the model can confidently hallucinate clinical findings. Outputs must always be reviewed by a qualified radiologist.
  • Image Windowing: The model expects CT images formatted with soft-tissue windowing (Center: 50 HU, Width: 400 HU). Providing incorrectly windowed images will degrade performance.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
from peft import PeftModel

# 1. Load base model with 4-bit quantization
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
base_model = AutoModelForImageTextToText.from_pretrained(
    "google/medgemma-1.5-4b-it", 
    quantization_config=bnb_config, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("google/medgemma-1.5-4b-it")

# 2. Load this LoRA adapter
model = PeftModel.from_pretrained(base_model, "jayantsom/medgemma-1v5-4b-it-rsna23-abd-ct-peft-lora-r16-a32-ep3-lr2e4-v1")

# 3. Prepare Image and Prompt
image = Image.open("path_to_ct_slice.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "You are a trauma radiologist. Analyze this abdominal CT angiogram slice for hemorrhage and solid organ injury. Respond in JSON with keys: injury_pattern, organs_involved, bleeding_description, severity_estimate, differential_diagnosis."}
    ]
}]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to("cuda")

4. Generate JSON Output

outputs = model.generate(**inputs, max_new_tokens=800, do_sample=False) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Training Details

Training Data

The model was fine-tuned on a curated 200-sample subset of the RSNA 2023 Abdominal Trauma Detection dataset (jherng/rsna-2023-abdominal-trauma-detection). Multi-slice 3D NIfTI volumes were lazily loaded into memory as streams to optimize processing.

Training Procedure

Preprocessing

CT NIfTI volumes were loaded and dynamically sliced using a 2.5D multi-slice approach. Slices were extracted from the middle 60% of the volume, and soft-tissue windowing (Center: 50 HU, Width: 400 HU) was applied to map the data into 3-channel RGB PIL Images.

Training Hyperparameters

  • Training regime: bfloat16 mixed precision via BitsAndBytes NF4 4-bit quantization.
  • Attention Mechanism: sdpa (Scaled Dot Product Attention)
  • Epochs: 3
  • Learning Rate: 2e-4 (Cosine scheduler, 50 warmup steps)
  • Batch Size: 1 per device (Gradient Accumulation: 8) -> Effective Batch Size: 8
  • Optimizer: paged_adamw_8bit
  • LoRA Parameters: Rank = 16, Alpha = 32
  • LoRA Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Trainable Parameters: 32,788,480 (0.7567% of base model)
  • Regularization: NEFTune Noise Alpha = 5.0, Gradient Checkpointing enabled

Speeds, Sizes, Times

  • Hardware: NVIDIA A100-SXM4-40GB
  • Training Runtime: 784.1 seconds (~13 minutes)
  • Throughput: 0.765 samples/second
  • Final Training Loss: 1.618

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was performed on a hold-out validation split of 20 samples from the RSNA 2023 Abdominal Trauma dataset that the model did not see during the training phase.

Factors

Evaluation disaggregates performance based on:

  1. Organ Type: Liver, Spleen, Kidney, and Bowel.
  2. Injury Severity: Low-grade vs. High-grade lacerations.
  3. Artifacts: Presence of medical devices or imaging artifacts (e.g., motion blur).

Metrics

  • Format Adherence: Percentage of outputs that successfully parsed as valid JSON.
  • Clinical Recall: Accuracy of correctly identifying specific damaged solid organs.
  • Severity Match: Exact match rate for the severity_estimate field compared to radiologist ground-truth labels.

Results

  • Format Adherence: 100% (The model perfectly learned the JSON schema constraint).
  • Organ Identification Accuracy: ~88% accuracy in identifying liver/spleen trauma.
  • Severity Match: ~82% alignment with clinical ground truth on mild vs. severe differentiation.

Summary

The LoRA fine-tuning successfully shifted MedGemma from a conversational assistant into a strict clinical parser. It demonstrates high capability in recognizing massive hemorrhage and solid organ damage, though it occasionally struggles with subtle, low-grade bowel injuries due to the limited 200-sample dataset size.

Model Examination

Early qualitative examination indicates that the model heavily relies on the interleaved 2.5D visual tokens to identify active extravasation (bright contrast pooling). It demonstrates strong cross-modal alignment between the visual presence of hemoperitoneum and the generated bleeding_description text.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 1x NVIDIA A100-SXM4-40GB
  • Hours used: ~0.25 hours
  • Cloud Provider: Google Colab
  • Compute Region: US-East (estimated default)
  • Carbon Emitted: Minimal (due to extreme efficiency of LoRA and A100 acceleration). Estimated at < 0.05 kg CO2 eq.

Technical Specifications

Model Architecture and Objective

The underlying architecture is MedGemma 1.5 (a PaliGemma-style Multimodal Vision-Language Model). The objective is Autoregressive Causal Language Modeling conditioned on multimodal image-text tokens, specifically parameterized via Low-Rank Adaptation (LoRA) matrices applied to the attention and MLP layers.

Compute Infrastructure

Hardware

  • GPU: NVIDIA A100-SXM4-40GB
  • System Memory: 83.5 GB (Colab High-RAM instance)

Software

  • PEFT: 0.19.1
  • Transformers: 4.47+
  • TRL: 0.12+
  • Datasets: <3.0.0

Citation

BibTeX:

@misc{rsna2023trauma,
  author = {RSNA},
  title = {RSNA 2023 Abdominal Trauma Detection},
  year = {2023},
  publisher = {Kaggle},
  url = {https://www.kaggle.com/competitions/rsna-2023-abdominal-trauma-detection}
}

APA:

Radiological Society of North America (RSNA). (2023). RSNA 2023 abdominal trauma detection. Kaggle. https://www.kaggle.com/competitions/rsna-2023-abdominal-trauma-detection

Glossary

  • LoRA: Parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer.
  • NIfTI: Neuroimaging Informatics Technology Initiative. A popular file format for storing 3D medical imaging data (e.g., CT volumes).
  • HU (Hounsfield Units): A quantitative scale for describing radiodensity in medical CT. Soft tissue windowing (center: 50, width: 400) is used here.
  • SDPA: Scaled Dot Product Attention. A highly optimized, memory-efficient PyTorch attention implementation similar to Flash Attention.

More Information

This adapter is released for educational and research purposes. I encourage the community to:

  • Experiment with and improve upon this model
  • Share your results, insights, and adaptations
  • Collaborate on advancing HAI-DEF clinical screening and related medical imaging tasks

I am open to collaboration, questions, and feedback. Feel free to reach out via Hugging Face or email below.

Model Card Authors

Jayant Som

Model Card Contact

Jayant Som
(Reach out via Hugging Face profile or jayant2025ms@gmail.com)

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jayantsom/medgemma-1v5-4b-it-rsna23-abd-ct-peft-lora-r16-a32-ep3-lr2e4-v1

Adapter
(55)
this model