Announce the Launch of Empower

Empower is a serverless LLM hosting platform for fine-tuned models that delivers performance on par with dedicated instances, but only at a fraction of the cost.

Yulong Liu

February 15, 2024

•

3 mins read

Today, we're excited to announce the public beta launch of Empower, a super fast and cost-effective platform to serve fine-tuned LLMs. With Empower, we provide a serverless LLM hosting platform for fine-tuned models that delivers performance on par with dedicated instances, but only at a fraction of the cost. Get started to deploy your model for free today.

Problem

Fine-tuned open-source LLMs can achieve superior performance, sometimes even outperforming larger models like GPT-4 on some tasks. As a result, an increasing number of companies have started to fine-tune models.

However, deploying and hosting fine-tuned LLMs is challenging, primarily because of the associated costs. Most LLM hosting platforms require merging the fine-tuned LoRA weights with the base model and renting dedicated GPUs to run the merged model. In this way, one fine-tuned model can cost a few thousand dollars a month to run, making using fine-tuned models economically impractical for most businesses.

Solution

Empower is a serverless fine-tuned LLM hosting platform. Unlike other platforms that require reserving GPUs to run fine-tuned models, Empower allows users to serverless deploy fine-tuned LoRAs, and pay for token usage, not GPU time. With Empower, you can easily cut your model serving cost by more than 80% with no compromise on performance.

Key Benefits of Using Empower, compared to other alternatives (e.g.,AWS, GCP, Together.ai, Modal, runpod, etc.):

- Cost Effective: Empower adopts the pay-by-token-usage model ($0.4 per 1 million tokens for 7B models), significantly reducing costs compared to the traditional GPU time-based billing, which can cost thousands of dollars monthly per fine-tuned model.‍

- Minimum Cold Start Time: Empower’s cold start time is more than 80% lower than other alternatives, ensuring immediate response for an enhanced user experience.‍

- Flexibility: Empower supports any PEFT based LoRA models. Users are free to train their models on any platform and deploy with Empower, to serve with OpenAI compatible API, no concern of vendor lock-in.

Empower is 80% - 90% cheaper than other alternatives. This saving does not even factor in the additional cost needed for ensuring high availability, which could double other options’ cost.

How It Works

Deploying a LoRA on Empower is straightforward and effortless. We support importing models from Hugging Face. To deploy, simply go to the LoRAs page of Empower, and enter the Hugging Face model ID. The OpenAI compatible LoRA inference endpoint will be ready to use right after deployment. The screen recording below shows the flow of deploying a LoRA to production.

Under the Hood

At the core of Empower’s platform is a high performance LoRA inference engine we built atop existing open-source projects using Rust. Empower’s engine allows us to run multiple LoRA’s on the same GPU with low latency and high throughput, which outperforms other leading open-source inference engines such as vLLM (developed by UCB) and LoraX (built on top of HuggingFace’s text generation inference).

We conducted two benchmarks to test Empower’s performance:

- Cold start latency benchmark against AnyScale and Modal

- latency / throughput benchmark against vLLM and LoraX

with the following setup:

- Model: llama2-7b-chat

- Lora: Llama-7b-chinese-chat-lora‍

- Hardware: one A100-40G GPU

Cold Start Latency Benchmark:

We sent 5 cold requests (the first request after service start, or the first request after a long idle time) to each platform. The average cold request response time is:

The result shows that Empower has the lowest cold start time, which is crucial for applications aiming to provide a consistent experience for customers.

Latency / Throughput Benchmark:

Below is the setup of the latency / throughput benchmark:

- Total of 1000 requests, varying QPS to simulate different load scenarios.

- Varied input length, with an average of 233 input tokens per request

- Varied output length, with an average of 245 output tokens per request

Empower’s engine consistently delivers higher throughput and also maintains lower latency in both high and low QPS scenarios. Particularly under high load conditions. In these scenarios, Empower’s latency is more than 20% lower than vLLM and LoraX.

This is just the initial version of the Empower engine, we will continue to optimize the engine to make it perform even better!

Ready to start?

Deploy and serve your first fine-tuned LLM in 1 minute for free!

a black and white image of a black and white background