Jemma Daniel

Update README

9ad54d8 20 days ago

7.25 kB

	---
	license: cc-by-4.0
	language:
	- en
	tags:
	- proteomics
	- mass-spectrometry
	- peptide-sequencing
	- calibration
	- fdr
	- biology
	- de-novo-peptide-sequencing
	---

	# Winnow HeLa Single Shot Probability Calibrator

	[Winnow](https://github.com/instadeepai/winnow) recalibrates confidence scores and provides FDR control for de novo peptide sequencing (DNS) workflows.
	This repository hosts a calibrator trained on the HeLa Single Shot dataset as referenced in our paper: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952).

	- Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo
	- Outputs: rescored and calibrated per-PSM probabilities in `calibrated_confidence` with de novo FDR control.

	## What’s inside

	- `model.safetensors`: trained classifier
	- `config.json`: classifier hyperparameter settings

	---

	## How to use

	### Python

	```python
	from pathlib import Path
	from huggingface_hub import snapshot_download
	from winnow.calibration.calibrator import ProbabilityCalibrator
	from winnow.datasets.data_loaders import InstaNovoDatasetLoader
	from winnow.scripts.main import filter_dataset
	from winnow.fdr.nonparametric import NonParametricFDRControl

	# 1) Download model files
	general_model = Path("general_model")
	snapshot_download(
	repo_id="InstaDeepAI/winnow-helaqc-model",
	repo_type="model",
	local_dir=helaqc_model,
	)

	# 2) Load calibrator
	calibrator = ProbabilityCalibrator.load(pretrained_model_name_or_path=helaqc_model)

	# 3) Load your dataset (InstaNovo-style config)
	dataset = InstaNovoDatasetLoader().load(
	data_path="path_to_spectrum_data.parquet",
	predictions_path="path_to_instanovo_predictions.csv",
	)
	dataset = filter_dataset(dataset) # standard Winnow filtering

	# 4) Predict calibrated confidences
	calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"]

	# 5) Optional: FDR control on calibrated confidence
	fdr = NonParametricFDRControl()
	fdr.fit(dataset.metadata["calibrated_confidence"])
	cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff
	dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff
	```

	### CLI

	```bash
	# After `pip install winnow`
	winnow predict \
	data_loader=instanovo \
	dataset.spectrum_path_or_directory=my_data.parquet \
	dataset.predictions_path=my_preds.csv \
	calibrator.pretrained_model_name_or_path=config_with_dataset_paths.yaml \
	fdr_control.fdr_threshold=0.05 \
	output_folder=outputs
	```

	---

	## Inputs and outputs

	Required columns for calibration:

	- Spectrum data (parquet\ipc\mgf)
	- `spectrum_id` (string): unique spectrum identifier
	- `experiment_name` (string): MS run identifier
	- `retention_time` (float): retention time (seconds)
	- `precursor_charge` (float): charge of the precursor ion (from MS1)
	- `precursor_mz` (float): mass-to-charge of the precursor ion (from MS1)
	- `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum
	- `intensity_array` (list[float]): intensity values of the MS2 spectrum

	- Beam predictions (csv)
	- `spectrum_id` (string)
	- `predictions` (string): top prediction, untokenised sequence
	- `predictions_tokenised` (string): comma‐separated tokens for the top prediction
	- `log_probability` (float): top prediction log probability
	- `token_log_probabilities` (list[float]): per-token log-probabilities for the top prediction
	- `predictions_beam_k` (string): untokenised sequence for beam k (k≥0)
	- `log_probability_beam_k` (float)
	- `token_log_probabilities_k` (string/list-encoded)

	Outputs:

	- `metadata.csv`: spectrum metadata and computed features. Contains everything except the prediction and FDR columns, i.e.:
	- `spectrum_id`, `experiment_name`, `precursor_mz`, `precursor_charge`, `retention_time`, etc. (all pass-through spectrum columns)
	All computed feature columns, including intermediate results (`mass_error_da`, `irt_error`, `ion_matches`, `margin`, etc.)
	- `preds_and_fdr_metrics.csv`: predictions and FDR results. Always contains:
	- `spectrum_id`
	- `prediction`
	- `calibrated_confidence`: calibrated probability
	- `psm_fdr`
	- `psm_q_value`
	- Optional: `psm_pep`

	---

	## Training data

	- The model was trained on the HeLa single-shot dataset (PXD044934).
	- This model uses fragment match features, iRT features, beam features, token score features and the mass error feature.
	- Predictions were obtained using InstaNovo v1.2.0 with beam search set to 5 beams.

	---

	## Citation

	If you use `winnow` in your research, please cite our preprint: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952)

	```bibtex
	@article{mabona2025novopeptidesequencingrescoring,
	title = {De novo peptide sequencing rescoring and FDR estimation with Winnow},
	author = {Amandla Mabona and Jemma Daniel and Henrik Servais Janssen Knudsen and
	Rachel Catzel and Kevin Michael Eloff and Erwin M. Schoof and Nicolas
	Lopez Carranza and Timothy P. Jenkins and Jeroen Van Goey and
	Konstantinos Kalogeropoulos},
	year = {2025},
	eprint = {2509.24952},
	archivePrefix = {arXiv},
	primaryClass = {q-bio.QM},
	url = {https://arxiv.org/abs/2509.24952},
	}
	```

	If you use this calibrator, please cite:

	```bibtex
	@misc{instadeep_ltd_2025,
	author = { InstaDeep Ltd },
	title = { winnow-helaqc-model (Revision aa4465fde73f384468b50aaa40fc5bb445216763) },
	year = 2026,
	url = { https://huggingface.co/InstaDeepAI/winnow-general-model },
	doi = { 10.57967/hf/6611 },
	publisher = { Hugging Face }
	}
	```

	If you use the `InstaNovo` model to generate predictions, please also cite: [InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments](https://doi.org/10.1038/s42256-025-01019-5)

	```bibtex
	@article{eloff_kalogeropoulos_2025_instanovo,
	title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
	proteomics experiments},
	author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
	Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
	Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
	and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
	Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
	Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
	year = 2025,
	month = {Mar},
	day = 31,
	journal = {Nature Machine Intelligence},
	doi = {10.1038/s42256-025-01019-5},
	issn = {2522-5839},
	url = {https://doi.org/10.1038/s42256-025-01019-5}
	}
	```

	## Contact

	For issues with this pretrained model or usage in Winnow, please open an issue on the Winnow GitHub: https://github.com/instadeepai/winnow