---
base_model: mistralai/Mistral-7B-v0.1
library_name: peft
pipeline_tag: text-generation
license: mit
language:
  - en
tags:
  - lora
  - transformers
  - summarization
  - cross-modal
  - video-summarization
  - curriculum-learning
  - catastrophic-forgetting
datasets:
  - ccdv/arxiv-summarization
  - jamescalam/ai-arxiv
---

# Hybrid-Summariser Cross-Modal LoRA (Phase 3)

LoRA adapter for Mistral-7B-v0.1 trained with a 3-phase curriculum framework that prevents catastrophic forgetting during cross-modal summarization. Produces domain-aware summaries of CS lecture videos without academic style contamination.

**Paper:** Follow-up to "Cross-Modal Transfer Learning in Domain-Adaptive Video Summarization" (IMPACT 2025, Springer)

**Repo:** [github.com/Tushar-9802/Hybrid-Dataset-Summariser](https://github.com/Tushar-9802/Hybrid-Dataset-Summariser)

## Results (vs zero-shot Mistral-7B, n=75 videos, p < 0.005)

| Metric | Baseline | This Model | Change |
|--------|----------|------------|--------|
| Video ROUGE-1 | 0.263 | **0.417** | +58% |
| Video ROUGE-2 | 0.032 | **0.119** | +272% |
| Video BERTScore | -0.032 | **+0.151** | flipped positive |
| Passive Voice % | 9.9% | **14.1%** | controlled (vs 31.4% catastrophic) |
| Cross-Modal Consistency | 0.356 | **0.531** | +49% |

## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "Tushar9802/hybrid-summariser-crossmodal-lora")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

prompt = "Summarize the following text.\n\nText: {your_text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, num_beams=4, no_repeat_ngram_size=3)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

## Training

Three-phase curriculum on RTX 5070 Ti (16GB VRAM):

| Phase | Data Mix | Methods Active | Epochs |
|-------|----------|----------------|--------|
| 1 | 100% papers | LoRA+ (ratio 8x) | 3 |
| 2 | 50P/40V/10Pr | +OPLoRA (k=16), +EWC (lambda=200) | 1 |
| 3 | 30P/60V/10Pr | +CrossCLR (tau=0.03), EWC (lambda=400) | 1 |

### Dataset

4,324 CS samples: 2,368 arXiv papers + 738 YouTube lectures + 1,218 SBERT-mined cross-modal pairs. [Kaggle](https://www.kaggle.com/datasets/tusharjaju/hybrid-dataset-summariser-crossmodal)

### Hyperparameters

- LoRA: r=32, alpha=64, dropout=0.1, targets=q,k,v,o,gate,up,down
- Trainable: 83.9M params (1.16% of 7.24B total)
- Optimizer: 8-bit AdamW, effective batch size 24
- Quantization: 4-bit NF4, double quant, bfloat16 compute
- Peak VRAM: 11.3 GB (Phase 3)

## Training Procedure

**Phase 1** trains on papers only to acquire domain vocabulary. Fisher Information (83.9M entries across 448 parameter matrices) and SVD (top-16 singular directions per module) computed at exit.

**Phase 2** introduces videos and cross-modal pairs with OPLoRA orthogonal projection preventing subspace contamination, EWC preserving critical paper summarization parameters, and a 10% replay buffer from Phase 1.

**Phase 3** shifts to video-heavy training with contrastive cross-modal alignment and increased elastic regularization.

## Limitations

- CrossCLR exhibited NaN on some pair batches (partially limiting contrastive alignment)
- Single model (Mistral-7B) and domain (CS/engineering) — generalization untested
- Video references generated by GPT-4o-mini, not human annotators
- No human evaluation conducted

## Citation
```bibtex
@inproceedings{jaju2025crossmodal,
  title={Cross-Modal Transfer Learning in Domain-Adaptive Video Summarization},
  author={Jaju, Tushar and Saharawat, Tanishka and Bhatia, Shruti and Rastogi, Shivansh},
  booktitle={Proc. IMPACT 2025},
  publisher={Springer},
  year={2025}
}
```

## Authors

Tushar Jaju (training infrastructure, implementation, experiments), Tanishka Saharawat, Shruti Bhatia, Shivansh Rastogi

Guide: Dr. Neha Yadav — ABES Engineering College, Ghaziabad (AKTU)

## Framework Versions

- PEFT: 0.18.1
- Transformers: 5.0.0rc3
- PyTorch: 2.11.0.dev20260214+cu128
- bitsandbytes: 0.49.2
- Python: 3.11