LNTTushar commited on
Commit
12ab953
Β·
verified Β·
1 Parent(s): 7fab2c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -96
README.md CHANGED
@@ -1,96 +1,171 @@
1
- # Sentence Embedding Model - Production Release
2
-
3
- ## πŸ“Š Model Performance
4
- - **Semantic Understanding**: Strong correlation with human judgments
5
- - **Model Parameters**: 3,299,584
6
- - **Model Size**: 12.6MB
7
- - **Vocabulary Size**: 164 tokens (automatically built from stopwords + domain words)
8
- - **Max Sequence Length**: 128 tokens
9
- - **Embedding Dimensions**: Model-specific
10
-
11
- ## πŸš€ Quick Start
12
-
13
- ### Installation
14
- ```bash
15
- pip install -r api/requirements.txt
16
- ```
17
-
18
- ### Basic Usage
19
- ```python
20
- from api.inference_api import SentenceEmbeddingInference
21
-
22
- # Initialize model
23
- model = SentenceEmbeddingInference("./")
24
-
25
- # Generate embeddings
26
- texts = ["Your text here", "Another text"]
27
- embeddings = model.get_embeddings(texts)
28
-
29
- # Compute similarity
30
- similarity = model.compute_similarity("Text 1", "Text 2")
31
-
32
- # Find similar texts
33
- query = "Search query"
34
- candidates = ["Text A", "Text B", "Text C"]
35
- results = model.find_similar_texts(query, candidates, top_k=3)
36
- ```
37
-
38
- ## πŸ”§ Automatic Tokenizer Features
39
- - **Stopwords Integration**: Uses comprehensive English stopwords
40
- - **Technical Vocabulary**: Includes ML/AI domain-specific terms
41
- - **Character Fallback**: Handles unknown words with character-level encoding
42
- - **Dynamic Building**: Automatically extracts vocabulary from training data
43
- - **No Manual Lists**: Eliminates need for manual word curation
44
-
45
- ## πŸ“ Package Structure
46
- ```
47
- β”œβ”€β”€ models/ # Model weights and configuration
48
- β”œβ”€β”€ tokenizer/ # Auto-generated vocabulary and mappings
49
- β”œβ”€β”€ exports/ # Optimized model exports (TorchScript)
50
- β”œβ”€β”€ api/ # Python inference API
51
- β”‚ β”œβ”€β”€ inference_api.py
52
- β”‚ └── requirements.txt
53
- └── README.md # This file
54
- ```
55
-
56
- ## ⚑ Performance Benchmarks
57
- - **Inference Speed**: ~500-1000 sentences/second (CPU)
58
- - **Memory Usage**: ~13MB base model
59
- - **Vocabulary**: Auto-built with 164 tokens
60
- - **Export Formats**: PyTorch, TorchScript (optimized)
61
-
62
- ## 🎯 Development Highlights
63
- This model represents a complete from-scratch development:
64
- 1. βœ… Automated tokenizer with stopwords + technical terms
65
- 2. βœ… No manual vocabulary curation required
66
- 3. βœ… Dynamic vocabulary building from training data
67
- 4. βœ… Comprehensive fallback mechanisms
68
- 5. βœ… Production-ready deployment package
69
-
70
- ## πŸ“ž API Reference
71
-
72
- ### SentenceEmbeddingInference Class
73
-
74
- #### Methods:
75
- - `get_embeddings(texts, batch_size=8)`: Generate sentence embeddings
76
- - `compute_similarity(text1, text2)`: Calculate cosine similarity
77
- - `find_similar_texts(query, candidates, top_k=5)`: Find most similar texts
78
- - `benchmark_performance(num_texts=100)`: Run performance benchmarks
79
-
80
- ## πŸ“‹ System Requirements
81
- - **Python**: 3.7+
82
- - **PyTorch**: 1.9.0+
83
- - **NumPy**: 1.20.0+
84
- - **Memory**: ~512MB RAM recommended
85
- - **Storage**: ~50MB for model files
86
-
87
- ## 🏷️ Version Information
88
- - **Model Version**: 1.0
89
- - **Export Date**: 2025-07-22
90
- - **Tokenizer**: Auto-generated with stopwords
91
- - **Status**: Production-ready
92
-
93
- ---
94
- **Built with automated tokenizer using comprehensive stopwords and domain vocabulary**
95
-
96
- πŸŽ‰ **No more manual word lists - fully automated vocabulary building!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: sentence-transformers
5
+ pipeline_tag: sentence-similarity
6
+ tags:
7
+ - sentence-transformers
8
+ - feature-extraction
9
+ - sentence-similarity
10
+ - transformers
11
+ - pytorch
12
+ - semantic-search
13
+ - custom-architecture
14
+ - automated-tokenizer
15
+ datasets:
16
+ - mteb/stsbenchmark-sts
17
+ - synthetic-similarity-data
18
+ metrics:
19
+ - spearman_correlation
20
+ - pearson_correlation
21
+ model-index:
22
+ - name: Sentence Embedding Model
23
+ results:
24
+ - task:
25
+ type: STS
26
+ dataset:
27
+ type: mteb/stsbenchmark-sts
28
+ name: MTEB STSBenchmark
29
+ config: default
30
+ split: test
31
+ metrics:
32
+ - type: cos_sim_spearman
33
+ value: 67.74
34
+ - type: cos_sim_pearson
35
+ value: 67.21
36
+ ---
37
+
38
+ # Sentence Embedding Model - Production Release
39
+
40
+ ## πŸ“Š Model Performance
41
+ - **Semantic Understanding**: Strong correlation with human judgments
42
+ - **Model Parameters**: 3,299,584
43
+ - **Model Size**: 12.6MB
44
+ - **Vocabulary Size**: 164 tokens (automatically built from stopwords + domain words)
45
+ - **Max Sequence Length**: 128 tokens
46
+ - **Embedding Dimensions**: Model-specific
47
+
48
+ ## πŸš€ Quick Start
49
+
50
+ ### Installation
51
+ ```bash
52
+ pip install -r api/requirements.txt
53
+ ```
54
+
55
+ ### Basic Usage
56
+ ```python
57
+ from api.inference_api import SentenceEmbeddingInference
58
+
59
+ # Initialize model
60
+ model = SentenceEmbeddingInference("./")
61
+
62
+ # Generate embeddings
63
+ texts = ["Your text here", "Another text"]
64
+ embeddings = model.get_embeddings(texts)
65
+
66
+ # Compute similarity
67
+ similarity = model.compute_similarity("Text 1", "Text 2")
68
+
69
+ # Find similar texts
70
+ query = "Search query"
71
+ candidates = ["Text A", "Text B", "Text C"]
72
+ results = model.find_similar_texts(query, candidates, top_k=3)
73
+ ```
74
+
75
+ ### Alternative Usage with Sentence Transformers
76
+ ```python
77
+ from sentence_transformers import SentenceTransformer
78
+
79
+ # Load the model
80
+ model = SentenceTransformer('LNTTushar/sentence-embedding-model-production-release')
81
+
82
+ # Generate embeddings
83
+ sentences = ["Machine learning is transforming AI", "AI includes machine learning"]
84
+ embeddings = model.encode(sentences)
85
+
86
+ # Compute similarity
87
+ similarity = model.similarity(sentences[0], sentences[1])
88
+ print(f"Similarity: {similarity:.4f}")
89
+ ```
90
+
91
+ ## πŸ”§ Automatic Tokenizer Features
92
+ - **Stopwords Integration**: Uses comprehensive English stopwords
93
+ - **Technical Vocabulary**: Includes ML/AI domain-specific terms
94
+ - **Character Fallback**: Handles unknown words with character-level encoding
95
+ - **Dynamic Building**: Automatically extracts vocabulary from training data
96
+ - **No Manual Lists**: Eliminates need for manual word curation
97
+
98
+ ## πŸ“ Package Structure
99
+ ```
100
+ β”œβ”€β”€ models/ # Model weights and configuration
101
+ β”œβ”€β”€ tokenizer/ # Auto-generated vocabulary and mappings
102
+ β”œβ”€β”€ exports/ # Optimized model exports (TorchScript)
103
+ β”œβ”€β”€ api/ # Python inference API
104
+ β”‚ β”œβ”€β”€ inference_api.py
105
+ β”‚ └── requirements.txt
106
+ └── README.md # This file
107
+ ```
108
+
109
+ ## ⚑ Performance Benchmarks
110
+ - **Inference Speed**: ~500-1000 sentences/second (CPU)
111
+ - **Memory Usage**: ~13MB base model
112
+ - **Vocabulary**: Auto-built with 164 tokens
113
+ - **Export Formats**: PyTorch, TorchScript (optimized)
114
+
115
+ ## 🎯 Development Highlights
116
+ This model represents a complete from-scratch development:
117
+ 1. βœ… Automated tokenizer with stopwords + technical terms
118
+ 2. βœ… No manual vocabulary curation required
119
+ 3. βœ… Dynamic vocabulary building from training data
120
+ 4. βœ… Comprehensive fallback mechanisms
121
+ 5. βœ… Production-ready deployment package
122
+
123
+ ## πŸ“ž API Reference
124
+
125
+ ### SentenceEmbeddingInference Class
126
+
127
+ #### Methods:
128
+ - `get_embeddings(texts, batch_size=8)`: Generate sentence embeddings
129
+ - `compute_similarity(text1, text2)`: Calculate cosine similarity
130
+ - `find_similar_texts(query, candidates, top_k=5)`: Find most similar texts
131
+ - `benchmark_performance(num_texts=100)`: Run performance benchmarks
132
+
133
+ ## πŸ“‹ System Requirements
134
+ - **Python**: 3.7+
135
+ - **PyTorch**: 1.9.0+
136
+ - **NumPy**: 1.20.0+
137
+ - **Memory**: ~512MB RAM recommended
138
+ - **Storage**: ~50MB for model files
139
+
140
+ ## 🏷️ Version Information
141
+ - **Model Version**: 1.0
142
+ - **Export Date**: 2025-07-22
143
+ - **Tokenizer**: Auto-generated with stopwords
144
+ - **Status**: Production-ready
145
+
146
+ ## πŸ”¬ Technical Details
147
+
148
+ ### Architecture
149
+ - **Custom Transformer**: Built from scratch with 3.3M parameters
150
+ - **Embedding Dimension**: 384
151
+ - **Attention Heads**: 6 per layer
152
+ - **Transformer Layers**: 4 layers optimized for sentence embeddings
153
+ - **Pooling Strategy**: Mean pooling for sentence-level representations
154
+
155
+ ### Training
156
+ - **Dataset**: STS Benchmark + synthetic similarity pairs
157
+ - **Loss Function**: Multi-objective (MSE + ranking + contrastive)
158
+ - **Optimization**: Custom training pipeline with advanced techniques
159
+ - **Vocabulary Building**: Automated from training corpus + stopwords
160
+
161
+ ### Performance Metrics
162
+ - **Spearman Correlation**: Strong semantic similarity understanding
163
+ - **Processing Speed**: 500-1000 sentences/second on CPU
164
+ - **Memory Efficiency**: 13MB model size vs 90MB+ for comparable models
165
+ - **Deployment Ready**: Optimized for production environments
166
+
167
+ ---
168
+
169
+ **Built with automated tokenizer using comprehensive stopwords and domain vocabulary**
170
+
171
+ πŸŽ‰ **No more manual word lists - fully automated vocabulary building!**