--- base_model: mistralai/Mistral-7B-v0.1 library_name: peft pipeline_tag: text-generation license: mit language: - en tags: - lora - transformers - summarization - cross-modal - video-summarization - curriculum-learning - catastrophic-forgetting datasets: - ccdv/arxiv-summarization - jamescalam/ai-arxiv --- # Hybrid-Summariser Cross-Modal LoRA (Phase 3) LoRA adapter for Mistral-7B-v0.1 trained with a 3-phase curriculum framework that prevents catastrophic forgetting during cross-modal summarization. Produces domain-aware summaries of CS lecture videos without academic style contamination. **Paper:** Follow-up to "Cross-Modal Transfer Learning in Domain-Adaptive Video Summarization" (IMPACT 2025, Springer) **Repo:** [github.com/Tushar-9802/Hybrid-Dataset-Summariser](https://github.com/Tushar-9802/Hybrid-Dataset-Summariser) ## Results (vs zero-shot Mistral-7B, n=75 videos, p < 0.005) | Metric | Baseline | This Model | Change | |--------|----------|------------|--------| | Video ROUGE-1 | 0.263 | **0.417** | +58% | | Video ROUGE-2 | 0.032 | **0.119** | +272% | | Video BERTScore | -0.032 | **+0.151** | flipped positive | | Passive Voice % | 9.9% | **14.1%** | controlled (vs 31.4% catastrophic) | | Cross-Modal Consistency | 0.356 | **0.531** | +49% | ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", quantization_config=bnb_config, device_map="auto", ) model = PeftModel.from_pretrained(model, "Tushar9802/hybrid-summariser-crossmodal-lora") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") prompt = "Summarize the following text.\n\nText: {your_text}\n\nSummary:" inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to("cuda") outputs = model.generate(**inputs, max_new_tokens=256, num_beams=4, no_repeat_ngram_size=3) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Training Three-phase curriculum on RTX 5070 Ti (16GB VRAM): | Phase | Data Mix | Methods Active | Epochs | |-------|----------|----------------|--------| | 1 | 100% papers | LoRA+ (ratio 8x) | 3 | | 2 | 50P/40V/10Pr | +OPLoRA (k=16), +EWC (lambda=200) | 1 | | 3 | 30P/60V/10Pr | +CrossCLR (tau=0.03), EWC (lambda=400) | 1 | ### Dataset 4,324 CS samples: 2,368 arXiv papers + 738 YouTube lectures + 1,218 SBERT-mined cross-modal pairs. [Kaggle](https://www.kaggle.com/datasets/tusharjaju/hybrid-dataset-summariser-crossmodal) ### Hyperparameters - LoRA: r=32, alpha=64, dropout=0.1, targets=q,k,v,o,gate,up,down - Trainable: 83.9M params (1.16% of 7.24B total) - Optimizer: 8-bit AdamW, effective batch size 24 - Quantization: 4-bit NF4, double quant, bfloat16 compute - Peak VRAM: 11.3 GB (Phase 3) ## Training Procedure **Phase 1** trains on papers only to acquire domain vocabulary. Fisher Information (83.9M entries across 448 parameter matrices) and SVD (top-16 singular directions per module) computed at exit. **Phase 2** introduces videos and cross-modal pairs with OPLoRA orthogonal projection preventing subspace contamination, EWC preserving critical paper summarization parameters, and a 10% replay buffer from Phase 1. **Phase 3** shifts to video-heavy training with contrastive cross-modal alignment and increased elastic regularization. ## Limitations - CrossCLR exhibited NaN on some pair batches (partially limiting contrastive alignment) - Single model (Mistral-7B) and domain (CS/engineering) — generalization untested - Video references generated by GPT-4o-mini, not human annotators - No human evaluation conducted ## Citation ```bibtex @inproceedings{jaju2025crossmodal, title={Cross-Modal Transfer Learning in Domain-Adaptive Video Summarization}, author={Jaju, Tushar and Saharawat, Tanishka and Bhatia, Shruti and Rastogi, Shivansh}, booktitle={Proc. IMPACT 2025}, publisher={Springer}, year={2025} } ``` ## Authors Tushar Jaju (training infrastructure, implementation, experiments), Tanishka Saharawat, Shruti Bhatia, Shivansh Rastogi Guide: Dr. Neha Yadav — ABES Engineering College, Ghaziabad (AKTU) ## Framework Versions - PEFT: 0.18.1 - Transformers: 5.0.0rc3 - PyTorch: 2.11.0.dev20260214+cu128 - bitsandbytes: 0.49.2 - Python: 3.11