Kirim-V1-7B-Chat
A conversational bilingual language model optimized for natural dialogue in Chinese and English
Introduction
Kirim-V1-7B-Chat is the instruction-tuned variant of Kirim-V1-base, specifically optimized for conversational interactions. With 7 billion parameters, this model strikes an excellent balance between performance and efficiency, making it ideal for:
- Natural conversational AI
- Bilingual customer support
- Educational assistance
- Coding help and explanations
- General question answering
Key Improvements Over Base Model:
- ✅ Instruction Following: Fine-tuned on 150K high-quality instruction-response pairs
- ✅ Safety Aligned: Enhanced with safety guidelines and ethical considerations
- ✅ Conversational: Maintains context across multi-turn dialogues
- ✅ Helpful & Harmless: Balanced to be both useful and safe
- ✅ Efficient: 7B parameters run smoothly on consumer GPUs (16GB VRAM)
Model Details
| Specification | Value | Comparison to Base |
|---|---|---|
| Parameters | ~7B | 46% smaller |
| Hidden Size | 3,584 | Reduced for efficiency |
| Layers | 28 | 4 layers fewer |
| Attention Heads | 28 | Optimized ratio |
| KV Heads | 7 (GQA) | 4:1 efficiency |
| Context Length | 16,384 tokens | Focused on dialogue |
| Vocabulary | 102,400 | Same bilingual coverage |
| Precision | BFloat16 | Same quality |
Architecture Optimizations:
- Smaller Hidden Size: 3,584 (vs 4,096 in base) for faster inference
- Reduced Layers: 28 layers optimized for chat interactions
- Shorter Context: 16K tokens, sufficient for most conversations
- Lower Rope Scaling: 1.5x factor for typical chat lengths
- Grouped Query Attention: 7 KV heads for memory efficiency
Quick Start
Installation
pip install transformers torch accelerate
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Kirim-ai/Kirim-V1-7B-Chat",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"Kirim-ai/Kirim-V1-7B-Chat",
trust_remote_code=True
)
# Single turn conversation
messages = [
{"role": "user", "content": "你好,能帮我解释一下机器学习吗?"}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=512,
temperature=0.8,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Multi-turn Conversation
conversation = [
{"role": "system", "content": "You are Kirim, a helpful bilingual AI assistant."},
{"role": "user", "content": "What's the weather like today?"},
{"role": "assistant", "content": "I don't have access to real-time weather data. Could you tell me your location?"},
{"role": "user", "content": "I'm in Beijing"}
]
# Generate response
inputs = tokenizer.apply_chat_template(conversation, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Deployment Options
Full Precision (BF16) - Best Quality
Requirements: 16GB VRAM (RTX 4080, A10)
model = AutoModelForCausalLM.from_pretrained(
"Kirim-ai/Kirim-V1-7B-Chat",
torch_dtype=torch.bfloat16,
device_map="auto"
)
8-bit Quantization - Balanced
Requirements: 10GB VRAM (RTX 3080, RTX 4070)
model = AutoModelForCausalLM.from_pretrained(
"Kirim-ai/Kirim-V1-7B-Chat",
load_in_8bit=True,
device_map="auto"
)
4-bit Quantization - Maximum Efficiency
Requirements: 6GB VRAM (RTX 3060, RTX 4060)
model = AutoModelForCausalLM.from_pretrained(
"Kirim-ai/Kirim-V1-7B-Chat",
load_in_4bit=True,
device_map="auto"
)
Performance Benchmarks
Chat & Instruction Following
| Benchmark | Kirim-7B-Chat | Baseline 7B |
|---|---|---|
| MT-Bench (Chinese) | 7.2 | 6.8 |
| MT-Bench (English) | 6.9 | 6.7 |
| AlpacaEval | 73.5% | 71.2% |
| Chinese Safety | 92.1% | 88.4% |
Domain Performance
| Task | Score |
|---|---|
| Conversational Response | 8.1/10 |
| Code Explanation | 7.8/10 |
| Math Problem Solving | 7.2/10 |
| Creative Writing | 8.3/10 |
| Multilingual Translation | 7.9/10 |
Efficiency Metrics
| Configuration | VRAM | Tokens/sec | Latency |
|---|---|---|---|
| BF16 | 16GB | 55 tok/s | Low |
| 8-bit | 10GB | 48 tok/s | Low |
| 4-bit | 6GB | 38 tok/s | Medium |
Benchmarked on NVIDIA RTX 4090
Use Cases
1. Customer Support Chatbot
messages = [
{"role": "system", "content": "You are a customer support agent for an e-commerce platform."},
{"role": "user", "content": "我的订单还没有发货,什么时候能收到?"}
]
# Model provides helpful, empathetic response
2. Programming Assistant
messages = [
{"role": "user", "content": "Write a Python function to find the longest palindrome in a string"}
]
# Model generates code with explanations
3. Educational Tutor
messages = [
{"role": "user", "content": "用简单的话解释什么是递归"}
]
# Model explains concepts clearly in the user's language
4. Content Creation
messages = [
{"role": "user", "content": "Write a short story about a robot learning to paint"}
]
# Model generates creative content
Safety & Limitations
Safety Features
Trained with safety-focused instruction data
Reduced tendency for harmful outputs
Cultural sensitivity for both Chinese and English contexts
Appropriate refusal of dangerous requests
Known Limitations
No Real-time Data: Cannot access current information or browse the web
Text Only: Cannot process or generate images
Knowledge Cutoff: Training data through October 2024
Hallucinations: May occasionally generate plausible but incorrect information
Shorter Context: 16K tokens (vs 32K in base model)
Reduced Capacity: 7B parameters means less complex reasoning than 13B base
Comparison: 7B-Chat vs Base Model
| Feature | Kirim-V1-7B-Chat | Kirim-V1-Base |
|---|---|---|
| Size | 7B parameters | 13B parameters |
| Purpose | Conversational AI | General purpose |
| Context | 16K tokens | 32K tokens |
| Training | Instruction-tuned | Pre-trained only |
| Use Case | Chat, Q&A, dialogue | Foundation for fine-tuning |
| Safety | Safety-aligned | Neutral base |
| VRAM | 6-16GB | 12-24GB |
| Speed | Faster (55 tok/s) | Slower (45 tok/s) |
| Best For | Production chatbots | Research & customization |
System Requirements
Minimum (4-bit Quantization)
- GPU: 6GB VRAM (RTX 3060, RTX 4060)
- RAM: 16GB
- Storage: 10GB
Recommended (BF16)
- GPU: 16GB VRAM (RTX 4080, RTX 4090, A10)
- RAM: 32GB
- Storage: 20GB
Optimal (Production)
- GPU: 24GB+ VRAM (RTX 4090, A100)
- RAM: 64GB
- Storage: 50GB SSD
Docker Deployment
# Pull and run
docker run -it --gpus all \
-v $(pwd)/models:/app/models \
-p 8000:8000 \
kirim-ai/kirim-v1-7b-chat:latest
# Or build from Dockerfile
docker build -t kirim-7b-chat .
docker run -it --gpus all kirim-7b-chat
Fine-tuning
This model can be further fine-tuned for specific domains:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./kirim-7b-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=1e-5,
fp16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=your_dataset,
)
trainer.train()
Model Series
| Model | Parameters | Context | Best For |
|---|---|---|---|
| Kirim-V1-Base | 13B | 32K | Research, fine-tuning foundation |
| Kirim-V1-7B-Chat ⭐ | 7B | 16K | Production chatbots, Q&A |
License
This model is released under the Apache License 2.0. You are free to use, modify, and distribute this model for both commercial and non-commercial purposes.
Citation
@misc{kirim2024v1chat,
title={Kirim-V1-7B-Chat: Bilingual Conversational AI},
author={Kirim AI Team},
year={2025},
publisher={Kirim AI},
url={https://huggingface.co/Kirim-ai/Kirim-V1-7B-Chat}
}
- Downloads last month
- 48
Model tree for Kirim-ai/Kirim-V1-7B-Chat
Base model
Kirim-ai/Kirim-V1-Base