The LLM Dilemma: Self-Hosted vs. Public API Solutions

How to chose the best LLM for your use-case: in-house vs public API comparison.

Johan Leduc

In the field of artificial intelligence, Large Language Models (LLMs) are driving innovation and efficiency across various industries. With their growing importance, businesses and developers face a major decision: deploying self-hosted, fine-tuned LLMs or leveraging established public APIs. This article delves into this decision, emphasizing cost, performance, privacy, and computational considerations.

Privacy Considerations

Using a public API limits the control over data privacy. Third-party use of data sent to these APIs can’t be fully controlled. On the other hand, deploying a personally fine-tuned model allows for the application of Differential Privacy (DP) training techniques and the use of trusted infrastructure, offering a higher degree of data security and privacy.

Accuracy trade-off

Deploying self-hosted models often involves using smaller-scale models compared to the larger ones available through public APIs. For instance, self-hosting a model like GPT-4, with its 1.7 trillion parameters, is not possible for an average company. However, it’s noteworthy that fine-tuning a more compact, open-source model such as Llama2 can achieve performance on par with these larger models, particularly for specialized tasks. This approach highlights a key advantage of self-hosted models: their ability to deliver robust performance in specific domains, despite having a smaller footprint.

Latency Trade-off

In-production system latency is a significant factor. Systems demanding low latency may suffer when relying on public APIs. Instances like recent ChatGPT outages due to DDoS attacks illustrate the vulnerability of external dependencies. Conversely, self-hosted models on specialized infrastructure can minimize such dependencies and offer enhanced performance through advanced hardware, crucial for businesses where quick response times are paramount.

Cost Analysis

Integrating LLMs into operations involves various cost considerations, often a deciding factor. While the initial expense of fine-tuning and deploying LLMs requires iterations and specialized expertise, these costs can be competitive when compared to the long-term ongoing use of pre-deployed model APIs like OpenAI’s ChatGPT.

Comparing public APIs with self-hosted solutions is not straightforward because public APIs like OpenAI’s ChatGPT are exposed as Function as a Service (FaaS) and charged based on token usage. In contrast, self-hosted solutions self-hosted solutions typically incur hardware rental costs on Infrastructure as a Service (IaaS) platforms.

To illustrate, let’s compare generating 500 responses from the Alpaca dataset using various models:

  • OpenAI’s GPT models: We consider GPT-4, GPT-4-turbo, and GPT-3.5-turbo, focusing on their cost, runtime, and absence of direct hardware costs.
  • Open-source models: As proxies for self-hosted models, we examine Vicuna-7B and Llama2–13B, deployed on Huggingface’s inference endpoints, including their cost, runtime, and associated hardware rental costs.

The cost analysis reveals significant savings when deploying models through self-hosted (e.g. Huggingface) compared to using OpenAI’s ChatGPT. Specifically, hosting a 7B model on Huggingface is approximately 50% less expensive than operating GPT-3.5 at full capacity. This implies that self-hosting becomes more economically efficient than utilizing ChatGPT 3.5 when the self-hosted model is used at or above 50% of its capacity.

Similarly, deploying a 13B model on Huggingface offers even more important cost advantages — it’s about nine times less costly than using GPT-4-turbo and twenty-six times cheaper than GPT-4. Therefore, for a fine-tuned 13B model that matches GPT-4 in terms of performance, self-deployment proves to be cost-effective if the anticipated usage exceeds 10% of its capacity.


In conclusion, the decision between self-hosted Large Language Models (LLMs) and public API solutions like OpenAI’s ChatGPT is multifaceted. While public APIs offer ease of access and often larger, more powerful models, they come with potential latency issues and less control over data privacy. Self-hosted models, although requiring more initial investment and expertise in deployment, can provide on par performance, especially in specialized tasks, and greater data privacy control. The cost analysis demonstrates that while public APIs have straightforward pricing models, self-hosted solutions may offer competitive long-term financial benefits for important loads, especially when considering the control and customization they allow. Ultimately, the choice depends on the specific needs and capabilities of the business or developer, weighing the trade-offs in performance, privacy, and cost to arrive at the most suitable solution.

About the author

Johan Leduc

Data Science Researcher @ Sarus


Ready to unlock the value of your data? We can set you up in no time.


Subscribe to our newsletter

You're on the list! Thank you for signing up.
Oops! Something went wrong while submitting the form.
128 rue La Boétie
75008 Paris — France
©2023 Sarus Technologies.
All rights reserved.