Kirim-V1-7B-Chat

Helion-V2-Thinking Logo

A conversational bilingual language model optimized for natural dialogue in Chinese and English

Base Model快速入门


Introduction

Kirim-V1-7B-Chat is the instruction-tuned variant of Kirim-V1-base, specifically optimized for conversational interactions. With 7 billion parameters, this model strikes an excellent balance between performance and efficiency, making it ideal for:

  • Natural conversational AI
  • Bilingual customer support
  • Educational assistance
  • Coding help and explanations
  • General question answering

Key Improvements Over Base Model:

  • Instruction Following: Fine-tuned on 150K high-quality instruction-response pairs
  • Safety Aligned: Enhanced with safety guidelines and ethical considerations
  • Conversational: Maintains context across multi-turn dialogues
  • Helpful & Harmless: Balanced to be both useful and safe
  • Efficient: 7B parameters run smoothly on consumer GPUs (16GB VRAM)

Model Details

Specification Value Comparison to Base
Parameters ~7B 46% smaller
Hidden Size 3,584 Reduced for efficiency
Layers 28 4 layers fewer
Attention Heads 28 Optimized ratio
KV Heads 7 (GQA) 4:1 efficiency
Context Length 16,384 tokens Focused on dialogue
Vocabulary 102,400 Same bilingual coverage
Precision BFloat16 Same quality

Architecture Optimizations:

  • Smaller Hidden Size: 3,584 (vs 4,096 in base) for faster inference
  • Reduced Layers: 28 layers optimized for chat interactions
  • Shorter Context: 16K tokens, sufficient for most conversations
  • Lower Rope Scaling: 1.5x factor for typical chat lengths
  • Grouped Query Attention: 7 KV heads for memory efficiency

Quick Start

Installation

pip install transformers torch accelerate

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Kirim-ai/Kirim-V1-7B-Chat",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "Kirim-ai/Kirim-V1-7B-Chat",
    trust_remote_code=True
)

# Single turn conversation
messages = [
    {"role": "user", "content": "你好,能帮我解释一下机器学习吗?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.8,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-turn Conversation

conversation = [
    {"role": "system", "content": "You are Kirim, a helpful bilingual AI assistant."},
    {"role": "user", "content": "What's the weather like today?"},
    {"role": "assistant", "content": "I don't have access to real-time weather data. Could you tell me your location?"},
    {"role": "user", "content": "I'm in Beijing"}
]

# Generate response
inputs = tokenizer.apply_chat_template(conversation, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Deployment Options

Full Precision (BF16) - Best Quality

Requirements: 16GB VRAM (RTX 4080, A10)

model = AutoModelForCausalLM.from_pretrained(
    "Kirim-ai/Kirim-V1-7B-Chat",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

8-bit Quantization - Balanced

Requirements: 10GB VRAM (RTX 3080, RTX 4070)

model = AutoModelForCausalLM.from_pretrained(
    "Kirim-ai/Kirim-V1-7B-Chat",
    load_in_8bit=True,
    device_map="auto"
)

4-bit Quantization - Maximum Efficiency

Requirements: 6GB VRAM (RTX 3060, RTX 4060)

model = AutoModelForCausalLM.from_pretrained(
    "Kirim-ai/Kirim-V1-7B-Chat",
    load_in_4bit=True,
    device_map="auto"
)

Performance Benchmarks

Chat & Instruction Following

Benchmark Kirim-7B-Chat Baseline 7B
MT-Bench (Chinese) 7.2 6.8
MT-Bench (English) 6.9 6.7
AlpacaEval 73.5% 71.2%
Chinese Safety 92.1% 88.4%

Domain Performance

Task Score
Conversational Response 8.1/10
Code Explanation 7.8/10
Math Problem Solving 7.2/10
Creative Writing 8.3/10
Multilingual Translation 7.9/10

Efficiency Metrics

Configuration VRAM Tokens/sec Latency
BF16 16GB 55 tok/s Low
8-bit 10GB 48 tok/s Low
4-bit 6GB 38 tok/s Medium

Benchmarked on NVIDIA RTX 4090


Use Cases

1. Customer Support Chatbot

messages = [
    {"role": "system", "content": "You are a customer support agent for an e-commerce platform."},
    {"role": "user", "content": "我的订单还没有发货,什么时候能收到?"}
]
# Model provides helpful, empathetic response

2. Programming Assistant

messages = [
    {"role": "user", "content": "Write a Python function to find the longest palindrome in a string"}
]
# Model generates code with explanations

3. Educational Tutor

messages = [
    {"role": "user", "content": "用简单的话解释什么是递归"}
]
# Model explains concepts clearly in the user's language

4. Content Creation

messages = [
    {"role": "user", "content": "Write a short story about a robot learning to paint"}
]
# Model generates creative content

Safety & Limitations

Safety Features

Trained with safety-focused instruction data
Reduced tendency for harmful outputs
Cultural sensitivity for both Chinese and English contexts
Appropriate refusal of dangerous requests

Known Limitations

No Real-time Data: Cannot access current information or browse the web
Text Only: Cannot process or generate images
Knowledge Cutoff: Training data through October 2024
Hallucinations: May occasionally generate plausible but incorrect information
Shorter Context: 16K tokens (vs 32K in base model)
Reduced Capacity: 7B parameters means less complex reasoning than 13B base


Comparison: 7B-Chat vs Base Model

Feature Kirim-V1-7B-Chat Kirim-V1-Base
Size 7B parameters 13B parameters
Purpose Conversational AI General purpose
Context 16K tokens 32K tokens
Training Instruction-tuned Pre-trained only
Use Case Chat, Q&A, dialogue Foundation for fine-tuning
Safety Safety-aligned Neutral base
VRAM 6-16GB 12-24GB
Speed Faster (55 tok/s) Slower (45 tok/s)
Best For Production chatbots Research & customization

System Requirements

Minimum (4-bit Quantization)

  • GPU: 6GB VRAM (RTX 3060, RTX 4060)
  • RAM: 16GB
  • Storage: 10GB

Recommended (BF16)

  • GPU: 16GB VRAM (RTX 4080, RTX 4090, A10)
  • RAM: 32GB
  • Storage: 20GB

Optimal (Production)

  • GPU: 24GB+ VRAM (RTX 4090, A100)
  • RAM: 64GB
  • Storage: 50GB SSD

Docker Deployment

# Pull and run
docker run -it --gpus all \
  -v $(pwd)/models:/app/models \
  -p 8000:8000 \
  kirim-ai/kirim-v1-7b-chat:latest

# Or build from Dockerfile
docker build -t kirim-7b-chat .
docker run -it --gpus all kirim-7b-chat

Fine-tuning

This model can be further fine-tuned for specific domains:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./kirim-7b-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=1e-5,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
)

trainer.train()

Model Series

Model Parameters Context Best For
Kirim-V1-Base 13B 32K Research, fine-tuning foundation
Kirim-V1-7B-Chat 7B 16K Production chatbots, Q&A

License

This model is released under the Apache License 2.0. You are free to use, modify, and distribute this model for both commercial and non-commercial purposes.


Citation

@misc{kirim2024v1chat,
  title={Kirim-V1-7B-Chat: Bilingual Conversational AI},
  author={Kirim AI Team},
  year={2025},
  publisher={Kirim AI},
  url={https://huggingface.co/Kirim-ai/Kirim-V1-7B-Chat}
}
Downloads last month
48
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kirim-ai/Kirim-V1-7B-Chat

Quantized
(2)
this model

Collection including Kirim-ai/Kirim-V1-7B-Chat

Evaluation results