MedVLSynther
/

MedVLSynther-3B-RL_2K

Model card Files Files and versions

MedVLSynther-3B-RL_2K / README.md

NingsenWang's picture

Update README.md

36803bd verified 8 months ago

|

3.78 kB

	---
	license: apache-2.0
	datasets:
	- MedVLSynther/MedSynVQA-2K
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	---

	# MedVLSynther-3B-RL_2K

	Code: https://github.com/UCSC-VLAA/MedVLSynther
	Project Page: https://ucsc-vlaa.github.io/MedVLSynther/

	## Model Description

	MedVLSynther-3B-RL_2K is a 3B parameter medical vision-language model based on Qwen2.5-VL.
	This model has been trained using reinforcement learning on MedSynVQA-2K dataset.

	## Model Details

	- Base Model: Qwen/Qwen2.5-VL-3B-Instruct
	- Model Size: 3B parameters
	- Training Method: Reinforcement Learning
	- Training Data: MedSynVQA-2K dataset

	## Usage

	Check here for demo images: https://github.com/UCSC-VLAA/MedVLSynther?tab=readme-ov-file#-quick-start

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info
	import torch

	# Load the model
	model_name="MedVLSynther/MedVLSynther-3B-RL_2K"
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_name)

	# Example usage
	messages_1 = [
	{
	"role": "system",
	"content": "You will solve a problem/request. You should provide your thoughts within <think> </think> tags before providing the answer.\nWrite your final answer within <answer> </answer> tags.",
	},
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "assets/7bMMMU.png",
	},
	{"type": "text", "text": "This line of of myelinated axons in layer IV of visual cortex represents the axons of cells in the Choices: (A) Superior colliculus. (B) Lateral geniculate.(C) Retina. (D) Medial geniculate."},
	],
	}
	]

	messages_2 = [
	{
	"role": "system",
	"content": "You will solve a problem/request. You should provide your thoughts within <think> </think> tags before providing the answer.\nWrite your final answer within <answer> </answer> tags.",
	},
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "assets/7bslake.png",
	},
	{"type": "text", "text": "Does the picture contain kidney? Choices: (A) Yes (B) No"},
	],
	}
	]

	# Preparation for inference
	messages = messages_2

	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference
	generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95, do_sample=True)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Citation

	```bibtex
	@article{MedVLSynther,
	title={MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs},
	author={Huang, Xiaoke and Wang, Ningsen and Liu, Hui and Tang, Xianfeng and Zhou, Yuyin},
	journal={arXiv preprint arXiv:2510.25867},
	year={2025}
	}
	@article{MedVLThinker,
	title={Medvlthinker: Simple baselines for multimodal medical reasoning},
	author={Huang, Xiaoke and Wu, Juncheng and Liu, Hui and Tang, Xianfeng and Zhou, Yuyin},
	journal={arXiv preprint arXiv:2508.02669},
	year={2025}
	}
	```

	## License

	This model is released under the Apache 2.0 license.