Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
How to use jebish7/stella-MNSR-2 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jebish7/stella-MNSR-2", trust_remote_code=True)
sentences = [
"In the context of the risk-based assessment of customers and business relationships, how should the overlap between customer risk assessment and CDD be managed to ensure both are completed effectively and in compliance with ADGM regulations?",
"DocumentID: 36 | PassageID: D.7. | Passage: Principle 7 – Scenario analysis of climate-related financial risks. Where appropriate, relevant financial firms should develop and implement climate-related scenario analysis frameworks, including stress testing, in a manner commensurate with their size, complexity, risk profile and nature of activities.\n",
"DocumentID: 1 | PassageID: 7.Guidance.4. | Passage: The risk-based assessment of the customer and the proposed business relationship, Transaction or product required under this Chapter is required to be undertaken prior to the establishment of a business relationship with a customer. Because the risk rating assigned to a customer resulting from this assessment determines the level of CDD that must be undertaken for that customer, this process must be completed before the CDD is completed for the customer. The Regulator is aware that in practice there will often be some degree of overlap between the customer risk assessment and CDD. For example, a Relevant Person may undertake some aspects of CDD, such as identifying Beneficial Owners, when it performs a risk assessment of the customer. Conversely, a Relevant Person may also obtain relevant information as part of CDD which has an impact on its customer risk assessment. Where information obtained as part of CDD of a customer affects the risk rating of a customer, the change in risk rating should be reflected in the degree of CDD undertaken.",
"DocumentID: 1 | PassageID: 9.1.2.Guidance.4. | Passage: Where the legislative framework of a jurisdiction (such as secrecy or data protection legislation) prevents a Relevant Person from having access to CDD information upon request without delay as referred to in Rule 9.1.1(3)(b), the Relevant Person should undertake the relevant CDD itself and should not seek to rely on the relevant third party."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from dunzhang/stella_en_400M_v5 on the csv dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("jebish7/stella-MNSR-2")
# Run inference
sentences = [
"Could you provide examples of circumstances that, when changed, would necessitate the reevaluation of a customer's risk assessment and the application of updated CDD measures?",
'DocumentID: 1 | PassageID: 8.1.2.(1) | Passage: A Relevant Person must also apply CDD measures to each existing customer under Rules \u200e8.3.1, \u200e8.4.1 or \u200e8.5.1 as applicable:\n(a)\twith a frequency appropriate to the outcome of the risk-based approach taken in relation to each customer; and\n(b)\twhen the Relevant Person becomes aware that any circumstances relevant to its risk assessment for a customer have changed.',
"DocumentID: 13 | PassageID: 9.2.1.Guidance.1. | Passage: The Regulator expects that an Authorised Person's Liquidity Risk strategy will set out the approach that the Authorised Person will take to Liquidity Risk management, including various quantitative and qualitative targets. It should be communicated to all relevant functions and staff within the organisation and be set out in the Authorised Person's Liquidity Risk policy.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
Could you outline the expected procedures for a Trade Repository to notify relevant authorities of any significant errors or omissions in previously submitted data? |
DocumentID: 7 |
In the context of a non-binding MPO, how are commodities held by an Authorised Person treated for the purpose of determining the Commodities Risk Capital Requirement? |
DocumentID: 9 |
Can the FSRA provide case studies or examples of best practices for RIEs operating MTFs or OTFs using spot commodities in line with the Spot Commodities Framework? |
DocumentID: 34 |
MultipleNegativesSymmetricRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 16num_train_epochs: 1warmup_ratio: 0.1batch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falsebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss |
|---|---|---|
| 0.0541 | 100 | 0.4442 |
| 0.1083 | 200 | 0.4793 |
| 0.1624 | 300 | 0.4395 |
| 0.2166 | 400 | 0.4783 |
| 0.2707 | 500 | 0.4573 |
| 0.3249 | 600 | 0.4235 |
| 0.3790 | 700 | 0.4029 |
| 0.4331 | 800 | 0.3951 |
| 0.4873 | 900 | 0.438 |
| 0.5414 | 1000 | 0.364 |
| 0.5956 | 1100 | 0.3732 |
| 0.6497 | 1200 | 0.3932 |
| 0.7038 | 1300 | 0.3387 |
| 0.7580 | 1400 | 0.2956 |
| 0.8121 | 1500 | 0.3612 |
| 0.8663 | 1600 | 0.3333 |
| 0.9204 | 1700 | 0.2837 |
| 0.9746 | 1800 | 0.2785 |
| 0.0541 | 100 | 0.2263 |
| 0.1083 | 200 | 0.2085 |
| 0.1624 | 300 | 0.1638 |
| 0.2166 | 400 | 0.2085 |
| 0.2707 | 500 | 0.2442 |
| 0.3249 | 600 | 0.1965 |
| 0.3790 | 700 | 0.2548 |
| 0.4331 | 800 | 0.2504 |
| 0.4873 | 900 | 0.2358 |
| 0.5414 | 1000 | 0.2083 |
| 0.5956 | 1100 | 0.2117 |
| 0.6497 | 1200 | 0.248 |
| 0.7038 | 1300 | 0.221 |
| 0.7580 | 1400 | 0.1886 |
| 0.8121 | 1500 | 0.2653 |
| 0.8663 | 1600 | 0.2651 |
| 0.9204 | 1700 | 0.2349 |
| 0.9746 | 1800 | 0.2435 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Base model
NovaSearch/stella_en_400M_v5
from sentence_transformers import SentenceTransformer model = SentenceTransformer("jebish7/stella-MNSR-2", trust_remote_code=True) sentences = [ "In the context of the risk-based assessment of customers and business relationships, how should the overlap between customer risk assessment and CDD be managed to ensure both are completed effectively and in compliance with ADGM regulations?", "DocumentID: 36 | PassageID: D.7. | Passage: Principle 7 – Scenario analysis of climate-related financial risks. Where appropriate, relevant financial firms should develop and implement climate-related scenario analysis frameworks, including stress testing, in a manner commensurate with their size, complexity, risk profile and nature of activities.\n", "DocumentID: 1 | PassageID: 7.Guidance.4. | Passage: The risk-based assessment of the customer and the proposed business relationship, Transaction or product required under this Chapter is required to be undertaken prior to the establishment of a business relationship with a customer. Because the risk rating assigned to a customer resulting from this assessment determines the level of CDD that must be undertaken for that customer, this process must be completed before the CDD is completed for the customer. The Regulator is aware that in practice there will often be some degree of overlap between the customer risk assessment and CDD. For example, a Relevant Person may undertake some aspects of CDD, such as identifying Beneficial Owners, when it performs a risk assessment of the customer. Conversely, a Relevant Person may also obtain relevant information as part of CDD which has an impact on its customer risk assessment. Where information obtained as part of CDD of a customer affects the risk rating of a customer, the change in risk rating should be reflected in the degree of CDD undertaken.", "DocumentID: 1 | PassageID: 9.1.2.Guidance.4. | Passage: Where the legislative framework of a jurisdiction (such as secrecy or data protection legislation) prevents a Relevant Person from having access to CDD information upon request without delay as referred to in Rule 9.1.1(3)(b), the Relevant Person should undertake the relevant CDD itself and should not seek to rely on the relevant third party." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4]