| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| tags: |
| - proteomics |
| - mass-spectrometry |
| - peptide-sequencing |
| - calibration |
| - fdr |
| - biology |
| - de-novo-peptide-sequencing |
| --- |
| |
| # Winnow HeLa Single Shot Probability Calibrator |
|
|
| [**Winnow**](https://github.com/instadeepai/winnow) recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows. |
| This repository hosts a calibrator trained on the HeLa Single Shot dataset as referenced in our paper: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952). |
|
|
| - Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo |
| - Outputs: rescored and calibrated per-PSM probabilities in `calibrated_confidence` with *de novo* FDR control. |
|
|
| ## What’s inside |
|
|
| - `model.safetensors`: trained classifier |
| - `config.json`: classifier hyperparameter settings |
|
|
| --- |
|
|
| ## How to use |
|
|
| ### Python |
|
|
| ```python |
| from pathlib import Path |
| from huggingface_hub import snapshot_download |
| from winnow.calibration.calibrator import ProbabilityCalibrator |
| from winnow.datasets.data_loaders import InstaNovoDatasetLoader |
| from winnow.scripts.main import filter_dataset |
| from winnow.fdr.nonparametric import NonParametricFDRControl |
| |
| # 1) Download model files |
| general_model = Path("general_model") |
| snapshot_download( |
| repo_id="InstaDeepAI/winnow-helaqc-model", |
| repo_type="model", |
| local_dir=helaqc_model, |
| ) |
| |
| # 2) Load calibrator |
| calibrator = ProbabilityCalibrator.load(pretrained_model_name_or_path=helaqc_model) |
| |
| # 3) Load your dataset (InstaNovo-style config) |
| dataset = InstaNovoDatasetLoader().load( |
| data_path="path_to_spectrum_data.parquet", |
| predictions_path="path_to_instanovo_predictions.csv", |
| ) |
| dataset = filter_dataset(dataset) # standard Winnow filtering |
| |
| # 4) Predict calibrated confidences |
| calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"] |
| |
| # 5) Optional: FDR control on calibrated confidence |
| fdr = NonParametricFDRControl() |
| fdr.fit(dataset.metadata["calibrated_confidence"]) |
| cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff |
| dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff |
| ``` |
|
|
| ### CLI |
|
|
| ```bash |
| # After `pip install winnow` |
| winnow predict \ |
| data_loader=instanovo \ |
| dataset.spectrum_path_or_directory=my_data.parquet \ |
| dataset.predictions_path=my_preds.csv \ |
| calibrator.pretrained_model_name_or_path=config_with_dataset_paths.yaml \ |
| fdr_control.fdr_threshold=0.05 \ |
| output_folder=outputs |
| ``` |
|
|
| --- |
|
|
| ## Inputs and outputs |
|
|
| **Required columns for calibration:** |
|
|
| - Spectrum data (parquet\ipc\mgf) |
| - `spectrum_id` (string): unique spectrum identifier |
| - `experiment_name` (string): MS run identifier |
| - `retention_time` (float): retention time (seconds) |
| - `precursor_charge` (float): charge of the precursor ion (from MS1) |
| - `precursor_mz` (float): mass-to-charge of the precursor ion (from MS1) |
| - `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum |
| - `intensity_array` (list[float]): intensity values of the MS2 spectrum |
|
|
| - Beam predictions (csv) |
| - `spectrum_id` (string) |
| - `predictions` (string): top prediction, untokenised sequence |
| - `predictions_tokenised` (string): comma‐separated tokens for the top prediction |
| - `log_probability` (float): top prediction log probability |
| - `token_log_probabilities` (list[float]): per-token log-probabilities for the top prediction |
| - `predictions_beam_k` (string): untokenised sequence for beam k (k≥0) |
| - `log_probability_beam_k` (float) |
| - `token_log_probabilities_k` (string/list-encoded) |
|
|
| **Outputs:** |
|
|
| - `metadata.csv`: spectrum metadata and computed features. Contains everything *except* the prediction and FDR columns, i.e.: |
| - `spectrum_id`, `experiment_name`, `precursor_mz`, `precursor_charge`, `retention_time`, etc. (all pass-through spectrum columns) |
| All computed feature columns, including intermediate results (`mass_error_da`, `irt_error`, `ion_matches`, `margin`, etc.) |
| - `preds_and_fdr_metrics.csv`: predictions and FDR results. Always contains: |
| - `spectrum_id` |
| - `prediction` |
| - `calibrated_confidence`: calibrated probability |
| - `psm_fdr` |
| - `psm_q_value` |
| - Optional: `psm_pep` |
|
|
| --- |
|
|
| ## Training data |
|
|
| - The model was trained on the HeLa single-shot dataset (PXD044934). |
| - This model uses fragment match features, iRT features, beam features, token score features and the mass error feature. |
| - Predictions were obtained using InstaNovo v1.2.0 with beam search set to 5 beams. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use `winnow` in your research, please cite our preprint: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952) |
|
|
| ```bibtex |
| @article{mabona2025novopeptidesequencingrescoring, |
| title = {De novo peptide sequencing rescoring and FDR estimation with Winnow}, |
| author = {Amandla Mabona and Jemma Daniel and Henrik Servais Janssen Knudsen and |
| Rachel Catzel and Kevin Michael Eloff and Erwin M. Schoof and Nicolas |
| Lopez Carranza and Timothy P. Jenkins and Jeroen Van Goey and |
| Konstantinos Kalogeropoulos}, |
| year = {2025}, |
| eprint = {2509.24952}, |
| archivePrefix = {arXiv}, |
| primaryClass = {q-bio.QM}, |
| url = {https://arxiv.org/abs/2509.24952}, |
| } |
| ``` |
|
|
| If you use this calibrator, please cite: |
|
|
| ```bibtex |
| @misc{instadeep_ltd_2025, |
| author = { InstaDeep Ltd }, |
| title = { winnow-helaqc-model (Revision aa4465fde73f384468b50aaa40fc5bb445216763) }, |
| year = 2026, |
| url = { https://huggingface.co/InstaDeepAI/winnow-general-model }, |
| doi = { 10.57967/hf/6611 }, |
| publisher = { Hugging Face } |
| } |
| ``` |
|
|
| If you use the `InstaNovo` model to generate predictions, please also cite: [InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments](https://doi.org/10.1038/s42256-025-01019-5) |
|
|
| ```bibtex |
| @article{eloff_kalogeropoulos_2025_instanovo, |
| title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale |
| proteomics experiments}, |
| author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell, |
| Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen, |
| Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J. |
| and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars, |
| Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and |
| Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.}, |
| year = 2025, |
| month = {Mar}, |
| day = 31, |
| journal = {Nature Machine Intelligence}, |
| doi = {10.1038/s42256-025-01019-5}, |
| issn = {2522-5839}, |
| url = {https://doi.org/10.1038/s42256-025-01019-5} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| For issues with this pretrained model or usage in Winnow, please open an issue on the Winnow GitHub: https://github.com/instadeepai/winnow |
|
|