Zero-Shot Classification
Transformers
PyTorch
English
roberta
text-classification
zero-shot
science
mag
Instructions to use BSC-LT/sciroshot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BSC-LT/sciroshot with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("BSC-LT/sciroshot") model = AutoModelForSequenceClassification.from_pretrained("BSC-LT/sciroshot") - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: zero-shot-classification | |
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - zero-shot | |
| - text-classification | |
| - science | |
| - mag | |
| widget: | |
| - text: Leo Messi is the best player ever | |
| candidate_labels: politics, science, sports, environment | |
| multi_class: true | |
| # SCIroShot | |
| ## Overview | |
| <details> | |
| <summary>Click to expand</summary> | |
| - **Model type:** Language Model | |
| - **Architecture:** RoBERTa-large | |
| - **Language:** English | |
| - **License:** Apache 2.0 | |
| - **Task:** Zero-Shot Text Classification | |
| - **Data:** Microsoft Academic Graph | |
| - **Additional Resources:** | |
| - [Paper](https://aclanthology.org/2023.eacl-main.22/) | |
| - [GitHub](https://github.com/bsc-langtech/sciroshot) | |
| </details> | |
| ## Model description | |
| SCIroShot is an entailment-based Zero-Shot Text Classification model that | |
| has been fine-tuned using a self-made dataset composed of scientific articles | |
| from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) | |
| (MAG). The resulting model achieves SOTA | |
| performance in the scientific domain and very competitive results in other areas. | |
| ## Intended Usage | |
| This model is intended to be used for zero-shot text classification in English. | |
| ## How to use | |
| ```python | |
| from transformers import pipeline | |
| zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot") | |
| sentence = "Leo Messi is the best player ever." | |
| candidate_labels = ["politics", "science", "sports", "environment"] | |
| template = "This example is {}" | |
| output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False) | |
| print(output) | |
| print(f'Predicted class: {output["labels"][0]}') | |
| ``` | |
| ## Limitations and bias | |
| No measures have been taken to estimate the bias and toxicity embedded in the model. | |
| Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias). | |
| ## Training | |
| ### Training data | |
| Our data builds on top of scientific-domain | |
| annotated data from Microsoft Academic Graph (MAG). | |
| This database consists of a heterogeneous | |
| graph with billions of records from both scientific | |
| publications and patents, in addition to metadata | |
| information such as the authors, institutions, journals, | |
| conferences and their citation relationships. | |
| The documents are organized in a six-level hierarchical | |
| structure of scientific concepts, where the two | |
| top-most levels are manually curated in order to | |
| guarantee a high level of accuracy. | |
| To create the training corpus, a random sample of | |
| scientific articles with a publication year between | |
| 2000 and 2021 were retrieved from MAG with their respective | |
| titles and abstracts in English. This results in over 2M documents | |
| with their corresponding Field Of Study, which was obtained from | |
| the 1-level MAG taxonomy (292 possible classes, such as "Computational biology" | |
| or "Transport Engineering"). | |
| The fine-tuning dataset was constructed in a weakly supervised | |
| manner by converting text classification data to the entailment format. | |
| Using the relationship between scientific texts | |
| and their matching concepts in the 1-level MAG | |
| taxonomy we are able to generate the premise- | |
| hypothesis pairs corresponding to the entailment | |
| label. Conversely, we generate the pairs for the | |
| neutral label by removing the actual relationship | |
| between the texts and their scientific concepts and | |
| creating a virtual relationship with those to which | |
| they are not matched. | |
| ### Training procedure | |
| The newly-created scientific dataset described in the previous section | |
| was used to fine-tune a 355M parameters RoBERTa model on the entailment task. | |
| To do so, the model has to compute the entailment score between every text that | |
| is fed to it and all candidate labels. The final prediction would be the | |
| highest-scoring class in a single-label classification setup, or the N classes | |
| above a certain threshold in a multi-label scenario. | |
| A subset of 52 labels from the training data were kept apart so that they | |
| could be used as a development set of fully-unseen classes. | |
| As a novelty, the validation was not performed on the entailment task (which is used a proxy) | |
| but directly on the target text classification task. This allows us to stop training at the right | |
| time via early stopping, which prevents the model from "overfitting" to the training task. This method | |
| was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed | |
| that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to | |
| improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance. | |
| Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation. | |
| ## Evaluation | |
| ### Evaluation data | |
| The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability). | |
| The following table provides an overview of the number of examples and labels for each dataset: | |
| | Dataset | Labels | Size | | |
| |------------------|--------|--------| | |
| | arXiv | 11 | 3,838 | | |
| | SciDocs-MeSH | 11 | 16,433 | | |
| | SciDocs-MAG | 19 | 17,501 | | |
| | Konstanz | 24 | 10,000 | | |
| | Elsevier | 26 | 14,738 | | |
| | PubMed | 109 | 5,000 | | |
| | Topic Categorization (Yahoo! Answers) | 10 | 60,000 | | |
| | Emotion Detection (UnifyEmotion) | 10 | 15,689 | | |
| | Situation Frame Detection (Situation Typing) | 12 | 3,311 | | |
| Please refer to the paper for further details on each particular dataset. | |
| ### Evaluation results | |
| These are the official results reported in the paper: | |
| #### Scientific domain benchmark | |
| | Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed | | |
| |-------|-------|--------------|-------------|----------|----------|--------| | |
| | [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**🔥 | 51.77 | 54.62 | 28.41 | **31.59**🔥 | | |
| | SCIroShot | **42.22**🔥 | 59.34 | **69.86**🔥 | **66.07**🔥 | **54.42**🔥 | 27.93 | | |
| #### General domain benchmark | |
| | Model | Topic | Emotion | Situation | | |
| |-------|-------|---------|-----------| | |
| | RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**🔥 | | |
| | FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 | | |
| | MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 | | |
| | NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 | | |
| | NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 | | |
| | SCIroShot | **59.08**🔥 | **24.94**🔥 | 27.42 | |
| All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf). | |
| ## Additional information | |
| ### Authors | |
| - SIRIS Lab, Research Division of SIRIS Academic. | |
| - Language Technologies Unit, Barcelona Supercomputing Center. | |
| ### Contact | |
| For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>. | |
| ### License | |
| This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). | |
| ### Funding | |
| This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme: | |
| - INODE (grant agreement No 863410). | |
| - IntelComp (grant agreement No 101004870). | |
| ### Citation | |
| ```bibtex | |
| @inproceedings{pamies2023weakly, | |
| title={A weakly supervised textual entailment approach to zero-shot text classification}, | |
| author={P{\`a}mies, Marc and Llop, Joan and Multari, Francesco and Duran-Silva, Nicolau and Parra-Rojas, C{\'e}sar and Gonz{\'a}lez-Agirre, Aitor and Massucci, Francesco Alessandro and Villegas, Marta}, | |
| booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, | |
| pages={286--296}, | |
| year={2023} | |
| } | |
| ``` | |
| ### Disclaimer | |
| <details> | |
| <summary>Click to expand</summary> | |
| The model published in this repository is intended for a generalist purpose | |
| and is made available to third parties under a Apache v2.0 License. | |
| Please keep in mind that the model may have bias and/or any other undesirable distortions. | |
| When third parties deploy or provide systems and/or services to other parties using this model | |
| (or a system based on it) or become users of the model itself, they should note that it is under | |
| their responsibility to mitigate the risks arising from its use and, in any event, to comply with | |
| applicable regulations, including regulations regarding the use of Artificial Intelligence. | |
| In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties. | |
| </details> | |