Initial version

Browse files

Files changed (8) hide show

README.md +66 -0
config.json +34 -0
eval/CrossEncoderCorrelationEvaluator_results.csv +4 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+---
+pipeline_tag: text-ranking
+tags:
+- sentence-transformers
+- cross-encoder
+- reranker
+- sentence-similarity
+- transformers
+base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
+language: en
+license: apache-2.0
+---
+# BiomedBERT Reranker
+This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
+The training dataset was generated using a random sample of [PubMed](https://pubmed.ncbi.nlm.nih.gov/) title-abstract pairs along with similar title pairs.
+## Usage (txtai)
+This model can be used to score a list of text pairs. This is useful as a reranking pipeline after an initial semantic search operation.
+```python
+from txtai.pipeline import Similarity
+ranker = Similarity(path="neuml/biomedbert-base-reranker", crossencode=True)
+ranker("query", ["document1", "document2"])
+```
+## Usage (Sentence-Transformers)
+Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
+```python
+from sentence_transformers import CrossEncoder
+model = SentenceTransformer("neuml/biomedbert-base-reranker")
+model.predict([["query", "document1"], ["query", "document2"]])
+```
+## Evaluation Results
+Performance of this model is compared to previously released models trained on medical literature.
+The following datasets were used to evaluate model performance.
+- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
+  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
+- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
+  - Split: test, Pair: (title, text)
+- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
+  - Subset: pubmed, Split: validation, Pair: (article, abstract)
+Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
+| Model                                                 | PubMed QA | PubMed Subset | PubMed Summary | Average   |
+| ----------------------------------------------------- | --------- | ------------- | -------------- | --------- |
+| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40  | 95.92 | 94.07 | 93.46     |
+| [bioclinical-modernbert-base-embeddings](https://hf.co/neuml/bioclinical-modernbert-base-embeddings) | 92.49 | 97.10 | 97.04     | 95.54 |
+| [biomedbert-base-colbert](https://hf.co/neuml/biomedbert-base-colbert)  | 94.59 | 97.18 | 96.21  | 95.99|
+| [**biomedbert-base-reranker**](https://hf.co/neuml/biomedbert-base-reranker)  | **97.66** | **99.76**  | **98.81** | **98.74** |
+| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings)       | 93.27  | 97.00 | 96.58 | 95.62 |
+| [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) | 90.05  | 94.29 | 94.15 | 92.83 |
+As expected, this cross-encoder model scores much higher than bi-encoder models and late interaction models. The tradeoff is that this is expensive to run and there is no way to scale it past small batches of data. But it's a great model for re-ranking medical literature.

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "sentence_transformers": {
+    "activation_fn": "torch.nn.modules.activation.Sigmoid",
+    "version": "5.1.1"
+  },
+  "transformers_version": "4.56.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

eval/CrossEncoderCorrelationEvaluator_results.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+epoch,steps,Pearson_Correlation,Spearman_Correlation
+0.26666666666666666,1000,0.9972800398511398,0.8660153599776242
+0.5333333333333333,2000,0.9979565710463103,0.8660227070761196
+0.8,3000,0.9980340500871159,0.8662180223716152

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4bdb4b2f3d97f229724126c25fe833c47a93d2cb4b5bdb2b6c0fb66ecc4ab5a1
+size 437955572

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff