Instructions to use bigscience/bloom with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloom with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloom")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") model = AutoModelForCausalLM.from_pretrained("bigscience/bloom") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigscience/bloom with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloom" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloom
- SGLang
How to use bigscience/bloom with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloom with Docker Model Runner:
docker model run hf.co/bigscience/bloom
Suggest a cloud gpu service to fine tune Bloom.
Suggest a cloud gpu service to fine tune Bloom.
Could you explain more clearly what your issue is? Typically what is the problem you're facing. I mean whatever provider with enough A100 would work. In our case we use 384 80G A100s to train it, with 48 GPUs per replicas. Depending on how you want to finetune, you might need less than 48 A100s.
And also you can batch your questions into a single discussion.
well I would like to fine tune Bloom for two things. One is paraphrasing and two summarization. However the costs of a 100 rental is a lot because I assume it would take at least a month to fine tune for each purpose. What is the cheapest approach? How long does one epoch take on a reasonable amount of gpus. My calculations are something like at least 25k per month per purpose from lambda, the cheapest cloud gpu service.
Not sure what the cheapest approach is. I do think that you first need to figure out how to optimize your software before worrying about the hardware. Typically we use Megatron-DeepSpeed to train as it allowed to get the fastest training we could get at the time.
How long does one epoch take on a reasonable amount of gpus.
That question isn't very well formulated, as epoch is usually dataset dependent. During pretraining we only do one epoch, but it last 3 months on 382 GPUs. I'm guessing your fine tuning setup is probably going to be shorter.
I would usually advise against fine tuning such big models as you could just use prompting in order to solve your task. The only fine tuning I would really consider is T0-style, where we fine tune on a bunch of tasks.
ty
what is T0-style?
Is prompting just giving it one example and then starting the sequence for the next and have Bloom complete it?
yes you are right it is badly formulated. If the bloom original training set was 1.6tb and all my training data is 1gb. Will it be 1600 times faster or require 1600 times less gpu power or some combination of both?
what is pretraining, does it mean before validation?
how many words is 2048 tokens in Bloom approximately?