--- base_model: Qwen/Qwen2.5-32B-Instruct library_name: peft pipeline_tag: text-generation license: apache-2.0 language: - en - ko tags: - lora - sft - transformers - trl - protein - bioinformatics - uniprot datasets: - im-sangwoon/protein-sft-uniprot --- # chatprot-qwen2.5-32b-lora **Qwen2.5-32B-Instruct** 기반의 단백질 연구 특화 LoRA 어댑터입니다. UniProt 데이터베이스에서 구축한 약 155만 건의 단백질 Q&A 데이터로 SFT(Supervised Fine-Tuning)하여, 단백질의 이름, 기능, 패밀리, 세포 내 위치 등에 대한 전문적인 질의응답이 가능합니다. ## Model Details - **Base Model:** [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) - **Fine-tuning Method:** LoRA (Low-Rank Adaptation) - **Language:** English, Korean - **License:** Apache 2.0 ## How to Use ### With Transformers + PEFT ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel import torch base_model_id = "Qwen/Qwen2.5-32B-Instruct" adapter_id = "im-sangwoon/chatprot-qwen2.5-32b-lora" # Load base model model = AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # Load LoRA adapter model = PeftModel.from_pretrained(model, adapter_id) # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) # Inference messages = [ {"role": "system", "content": "You are a helpful assistant for protein analysis."}, {"role": "user", "content": "What is the function of Hemoglobin subunit alpha?"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` ### With vLLM ```bash python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-32B-Instruct \ --enable-lora \ --lora-modules chatprot=im-sangwoon/chatprot-qwen2.5-32b-lora \ --dtype bfloat16 \ --trust-remote-code \ --max-model-len 8192 \ --max-lora-rank 64 ``` ## Training Details ### Training Data [im-sangwoon/protein-sft-uniprot](https://huggingface.co/datasets/im-sangwoon/protein-sft-uniprot) - UniProt 데이터베이스에서 추출한 단백질 Q&A 데이터셋 (약 **1,551,711건**) 질문 유형: - 단백질 공식 명칭 (Official names) - 단백질 기능 (Function) - 단백질 패밀리 분류 (Protein family) - 세포 내 위치 (Subcellular location) ### LoRA Configuration | Parameter | Value | |---|---| | Rank (r) | 64 | | Alpha | 16 | | Dropout | 0.1 | | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Task Type | CAUSAL_LM | ### Training Hyperparameters | Parameter | Value | |---|---| | Epochs | 3 | | Batch size (per device) | 1 | | Gradient accumulation steps | 8 | | Effective batch size | 8 | | Learning rate | 2e-4 | | LR scheduler | Cosine | | Warmup ratio | 0.03 | | Weight decay | 0.001 | | Max grad norm | 0.3 | | Optimizer | AdamW (fused) | | Precision | bf16 | | Gradient checkpointing | Enabled | ### Hardware - **GPU:** NVIDIA A6000 x 8 ### Framework Versions - PEFT 0.18.0 - TRL (SFTTrainer) - Transformers - PyTorch