Self-Boosting Large Language Models with Synthetic Preference Data
Paper • 2410.06961 • Published • 16
How to use chelleboyer/llm-mm-good-eb8e3f60-56f2-4729-8934-2428ca568d27 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("chelleboyer/llm-mm-good-eb8e3f60-56f2-4729-8934-2428ca568d27")
sentences = [
"What are the key contributions of Shen and Wan (2023) in the field of reference-free evaluation?",
"may be constrained by the quality and variety of the reference data.",
"Springer.\n\n\n\n\n\n\nTyen et al. (2023)\n\nGladys Tyen, Hassan Mansoor, Peter Chen, Tony Mak, and Victor Cărbune. 2023.\n\n\nLLMs cannot find reasoning errors, but can correct them!\n\n\narXiv preprint arXiv:2311.08516 (2023).\n\n\n\n\n\n\nValmeekam et al. (2023)\n\nKarthik Valmeekam, Matthew Marquez, and Subbarao Kambhampati. 2023.\n\n\nCan large language models really improve by self-critiquing their own plans?\n\n\narXiv preprint arXiv:2310.08118 (2023).\n\n\n\n\n\n\nVerga et al. (2024)\n\nPat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024.",
"Reference-Free Evaluation (Shen and Wan, 2023; Zheng et al., 2023a; He et al., 2023b):"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("chelleboyer/llm-mm-good-eb8e3f60-56f2-4729-8934-2428ca568d27")
# Run inference
sentences = [
'How do Dong et al. (2022) contribute to the understanding of in-context learning in their survey?',
'Dong et\xa0al. (2024a)\n\nQingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. 2024a.\n\n\nSelf-Boosting Large Language Models with Synthetic Preference Data.\n\n\narXiv preprint arXiv:2410.06961 (2024).\n\n\n\n\n\n\nDong et\xa0al. (2022)\n\nQingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et\xa0al. 2022.\n\n\nA survey on in-context learning.\n\n\narXiv preprint arXiv:2301.00234 (2022).\n\n\n\n\n\n\nDong et\xa0al. (2024b)\n\nYijiang\xa0River Dong, Tiancheng Hu, and Nigel Collier. 2024b.\n\n\nCan LLM be a Personalized Judge?\n\n\narXiv preprint arXiv:2406.11657 (2024).\n\n\n\n\n\n\nDorner et\xa0al. (2024)\n\nFlorian\xa0E. Dorner, Vivian\xa0Y. Nastl, and Moritz Hardt. 2024.',
'Additionally, the LLMAAA\xa0(Zhang et\xa0al., 2023a) framework incorporates an active learning strategy to efficiently select high-information samples for annotation, thereby mitigating the effects of noisy labels and reducing the reliance on costly human annotation. These approach not only enhance the performance of task-specific models but also offer new perspectives on the efficient application of LLMs in annotation workflows.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.92 |
| cosine_accuracy@3 | 0.99 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.92 |
| cosine_precision@3 | 0.33 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.92 |
| cosine_recall@3 | 0.99 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9667 |
| cosine_mrr@10 | 0.9553 |
| cosine_map@100 | 0.9553 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What are the key components of the evaluation function ( E ) as described in the preliminaries section? |
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods |
How do LLMs contribute to model enhancement according to the functionalities outlined in the survey? |
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods |
What are the different approaches discussed under the Single-LLM System methodology? |
4 Methodology |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 50per_device_eval_batch_size: 50num_train_epochs: 10multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 50per_device_eval_batch_size: 50per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 10max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size: 0fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_ndcg@10 |
|---|---|---|
| 1.0 | 27 | 0.9647 |
| 1.8519 | 50 | 0.9685 |
| 2.0 | 54 | 0.9717 |
| 3.0 | 81 | 0.9717 |
| 3.7037 | 100 | 0.9778 |
| 4.0 | 108 | 0.9754 |
| 5.0 | 135 | 0.9699 |
| 5.5556 | 150 | 0.9699 |
| 6.0 | 162 | 0.9664 |
| 7.0 | 189 | 0.9630 |
| 7.4074 | 200 | 0.9667 |
| 8.0 | 216 | 0.9667 |
| 9.0 | 243 | 0.9667 |
| 9.2593 | 250 | 0.9667 |
| 10.0 | 270 | 0.9667 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-l