LLMs to Prod

Deploying Large Language Models (LLMs) into production involves choosing the right tools and considering costs. Here’s a breakdown for comparing TGI, vLLM, and SGlang, deploying Llama 3.2 70B (likely Llama 3.1 70B), scaling on Kubernetes, and deciding between hosting your own or using third-party APIs.

Comparing TGI, vLLM, and SGlang

  • TGI (Text Generation Inference) from Hugging Face is great for integrating with their ecosystem, supporting models like Llama, with tensor parallelism for performance.
  • vLLM, developed at UC Berkeley, offers high throughput with Paged Attention, ideal for fast inference on GPUs.
  • SGlang, by LMSYS Org, focuses on efficiency with low-latency serving, suitable for real-time applications.

Each can deploy Llama 3.1 70B, requiring multiple GPUs (e.g., 8 with 40 GB each). TGI is user-friendly for Hugging Face users, vLLM excels in speed, and SGlang offers scalability.

Deploying Llama 3.2 70B

Given Llama 3.2 models are 1B, 3B, 11B, and 90B, "Llama 3.2 70B" likely means Llama 3.1 70B. Deployment steps:

  • TGI: Use Hugging Face’s launcher, containerize, and distribute across GPUs.
  • vLLM: Leverage Paged Attention, deploy via container images, and scale with Kubernetes.
  • SGlang: Use its runtime, optimize for GPUs, and deploy on clusters.

Deploying on Kubernetes at Scale

All three can be containerized and deployed on Kubernetes:

  • Use Kubernetes for scaling, monitoring, and managing resources across nodes.
  • TGI has Kubernetes deployment examples Hugging Face TGI.
  • vLLM and SGlang also support containerization, with scaling via Kubernetes pods.

Price Comparison and Hosting vs. Third-Party APIs

  • Hosting Your Own: Costs include GPU hardware (e.g., AWS A100 at ~\(3/hour each, needing 8 for Llama 3.1 70B, ~\)24/hour) and maintenance. For high volume, it’s cheaper long-term.
  • Third-Party APIs: OpenAI charges per token (e.g., $0.01/1000 tokens input, $0.03/1000 output). For low volume, it’s easier and cost-effective.
  • Research suggests hosting is better for privacy and high usage, while APIs suit quick setups.

This guide helps decide based on your needs, with Kubernetes offering scalability for all options.


Survey Note: Comprehensive Analysis of Deploying LLMs into Production

Deploying Large Language Models (LLMs) into production is a complex task, requiring careful selection of tools, infrastructure, and cost management. This analysis compares Text Generation Inference (TGI), vLLM, and SGlang for deploying models like Llama 3.2 70B (likely Llama 3.1 70B given current offerings), discusses deployment on Kubernetes at scale, provides a rough price comparison, and evaluates hosting versus using third-party APIs. The analysis is based on current research and documentation as of February 24, 2025.

Introduction to LLM Deployment

Deploying LLMs in production involves setting up the model to handle real-world requests efficiently, managing resources, ensuring reliability, and maintaining performance. Given the computational demands of LLMs, specialized frameworks are essential for optimizing inference speed and memory usage. This survey explores three prominent tools: TGI, vLLM, and SGlang, focusing on their capabilities for deploying a large model like Llama 3.1 70B, scaling on Kubernetes, and cost implications.

Overview of Deployment Tools

  1. Text Generation Inference (TGI):
  2. TGI, developed by Hugging Face, is a toolkit designed for deploying and serving Large Language Models (LLMs). It supports various open-source models, including Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX, making it versatile for production environments.
  3. Key features include tensor parallelism for faster inference across multiple GPUs, optimized transformers code using Flash Attention and Paged Attention, and a simple API for compatibility with Hugging Face models. It is already in use by organizations like IBM and Grammarly, indicating robust production readiness.
  4. TGI offers distributed tracing via Open Telemetry and Prometheus metrics for monitoring, enhancing operational visibility.
  5. vLLM:
  6. vLLM, originating from the Sky Computing Lab at UC Berkeley, is an open-source library for fast LLM inference and serving, with over 200k monthly downloads and an Apache 2.0 license. It is designed for high-throughput and memory-efficient serving, leveraging PagedAttention and continuous batching.
  7. It supports distributed inference across multiple GPUs, with quantizations like GPTQ, AWQ, INT4, INT8, and FP8, and optimized CUDA kernels including FlashAttention and FlashInfer. vLLM is particularly noted for up to 24x higher throughput compared to Hugging Face Transformers without model changes.
  8. It is suitable for applications requiring parallel processing and streaming output, with integration capabilities for platforms like SageMaker and LangChain.
  9. SGlang:
  10. SGlang, developed by LMSYS Org, is a fast serving framework for LLMs and vision-language models, focusing on efficient execution of complex language model programs. It offers a flexible frontend language for programming LLM applications, including chained generation calls, advanced prompting, and multi-modal inputs.
  11. It supports a wide range of generative models (Llama, Gemma, Mistral, QWen, DeepSeek, LLaVA, etc.), embedding models, and reward models, with easy extensibility. SGlang introduces optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding, achieving up to 6.4x higher throughput compared to state-of-the-art systems.
  12. Backed by an active community and supported by industry players like NVIDIA and xAI, SGlang is designed for scalability and low-latency inference, suitable for real-time applications.

Deploying Llama 3.2 70B: Clarification and Approach

The user query mentions "Llama 3.2 70B," but current documentation as of February 2025 indicates Llama 3.2 models are available in sizes 1B, 3B, 11B, and 90B, with multimodal capabilities for 11B and 90B, and no explicit 70B version Meta Llama. Given this, it is likely the user intended Llama 3.1 70B, which is part of the Llama 3.1 collection with sizes ranging from 8B to 405B, released in July 2024 Meta AI Blog. This analysis will proceed with Llama 3.1 70B as the target model.

  • System Requirements:
  • Llama 3.1 70B, with 70 billion parameters, requires significant computational resources. In float16 precision, each parameter consumes 2 bytes, totaling approximately 140 GB for model weights. During inference, additional memory is needed for the key-value (KV) cache, potentially requiring 5-10x more memory for high-throughput scenarios.
  • Deployment typically requires multiple GPUs, with recommendations including 8 GPUs with at least 40 GB VRAM each (e.g., NVIDIA A100 or H100). For example, deploying on AWS might use inf2.48xlarge instances with 12 Inferentia2 accelerators, or cloud instances like g5.48xlarge for EC2 Deploy Llama 3 70B on AWS Inferentia2.

Comparison of TGI, vLLM, and SGlang for Llama 3.1 70B

To compare these frameworks, we evaluate performance, ease of use, and specific support for deploying Llama 3.1 70B:

  • Performance:
  • TGI leverages tensor parallelism and optimized attention mechanisms, achieving high-performance text generation. Benchmarks suggest it handles Llama 2 70B with 8 GPUs of 40 GB each, suitable for production but may have higher latency for very large models Deploying LLM Powered Applications with HuggingFace TGI.
  • vLLM, with PagedAttention and continuous batching, offers up to 24x higher throughput than Hugging Face Transformers, making it ideal for high-throughput scenarios. For Llama 3.1 70B, it supports CPU offloading on NVIDIA GH200 instances, expanding available memory Serving Llama 3.1 8B and 70B using vLLM on an NVIDIA GH200 instance.
  • SGlang achieves up to 6.4x higher throughput compared to vLLM and TensorRT-LLM on tasks like agent control and JSON decoding, with optimizations like RadixAttention for KV cache reuse SGLang: Efficient Execution of Structured Language Model Programs. It is designed for low-latency, real-time applications.
  • Ease of Use:
  • TGI offers a simple launcher and integration with Hugging Face Hub, making it user-friendly for developers familiar with the ecosystem. However, distributed setup may require additional configuration.
  • vLLM is noted for its ease of deployment, with pre-configured environments on platforms like AWS and support for Hugging Face models, requiring minimal setup for inference Deploy Large Language Model (LLM) with vLLM on K8s.
  • SGlang provides an intuitive interface for programming LLM applications, but being relatively new, it may have fewer community resources compared to TGI and vLLM SGLang GitHub.
  • Specific Support for Llama 3.1 70B:
  • All three frameworks support Llama models, with TGI explicitly mentioned for Llama 2 and 3 deployments Deploy Llama 2 70B on AWS Inferentia2 with Hugging Face Optimum. For Llama 3.1 70B, TGI’s tensor parallelism is effective.
  • vLLM has detailed guides for deploying Llama 3.1 70B, including on NVIDIA GH200 with CPU offloading, making it suitable for memory-constrained environments Run Llama 3 on Dell PowerEdge XE9680 and AMD MI300x with vLLM.
  • SGlang supports Llama 3.1 70B, with benchmarks showing superior throughput compared to vLLM and TensorRT-LLM, particularly for complex tasks Achieving Faster Open-Source Llama3 Serving with SGLang Runtime.

Deploying Llama 3.1 70B with Each Framework

Given the user’s mention of Llama 3.2 70B, current documentation (as of February 24, 2025) shows Llama 3.2 models are 1B, 3B, 11B, and 90B, with no 70B Meta Llama. It’s likely they meant Llama 3.1 70B, part of the Llama 3.1 collection released in July 2024 Meta AI Blog. Here’s how to deploy it:

TGI Deployment

  1. Prerequisites: Install TGI, ensure 8 GPUs with 40 GB VRAM each.
  2. Steps:
  3. Download the model: huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct.
  4. Start the server: tgi-server --model-id meta-llama/Meta-Llama-3-70B-Instruct --num-shards 8.

vLLM Deployment

  1. Prerequisites: Install vLLM, ensure GPU compatibility.
  2. Steps:
  3. Download the model weights.
  4. Start the server: vllm --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-degree 8.

SGlang Deployment

  1. Prerequisites: Install SGlang, ensure GPU support.
  2. Steps:
  3. Load the model: pip install sglang[srt].
  4. Start the server: sglang serve meta-llama/Meta-Llama-3-70B-Instruct.

Deploying on Kubernetes at Scale

All three can be containerized and deployed on Kubernetes for scalability:

TGI on Kubernetes

  • Create a cluster with GPU support using gcloud container clusters create.
  • Define a Deployment:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-server
  template:
    metadata:
      labels:
        app: tgi-server
    spec:
      containers:
        - name: tgi-server
          image: huggingface/tgi-server
          args:
            - --model-id=meta-llama/Meta-Llama-3-70B-Instruct
            - --num-shards=8
          resources:
            limits:
              nvidia.com/gpu: 8
  • Apply with kubectl apply -f deployment.yaml.

vLLM on Kubernetes

  • Create a cluster with GPU support.
  • Define a Deployment:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai
          args:
            - --model=meta-llama/Meta-Llama-3-70B-Instruct
            - --tensor-parallel-degree=8
          resources:
            limits:
              nvidia.com/gpu: 8
  • Apply with kubectl apply -f deployment.yaml.

SGlang on Kubernetes

  • Create a cluster with GPU support.
  • Define a Deployment:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sglang-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sglang-server
  template:
    metadata:
      labels:
        app: sglang-server
    spec:
      containers:
        - name: sglang-server
          image: sglang/sglang-runtime
          args:
            - serve
            - meta-llama/Meta-Llama-3-70B-Instruct
          resources:
            limits:
              nvidia.com/gpu: 8
  • Apply with kubectl apply -f deployment.yaml.

Deploying on Kubernetes at Scale

The user query mentions "Kuberntees," likely a typo for "Kubernetes," a standard container orchestration system for managing applications at scale. All three frameworks can be deployed on Kubernetes, leveraging containerization for scalability:

  • TGI on Kubernetes:
  • Hugging Face provides examples for deploying TGI on Kubernetes, using container images and managing with Kubernetes pods. For Llama 3.1 70B, distribute across multiple nodes with tensor parallelism, ensuring sufficient GPU resources Deploy Meta Llama 3 8B with TGI DLC on GKE.
  • Scaling involves adjusting pod replicas based on load, with monitoring via Kubernetes metrics and Open Telemetry for distributed tracing.
  • vLLM on Kubernetes:
  • vLLM supports deployment on Kubernetes, with guides for using GPU operators and scaling on platforms like Azure Kubernetes Service (AKS). For Llama 3.1 70B, use container images and configure for distributed inference, leveraging PagedAttention for efficiency Deploy Large Language Model (LLM) with vLLM on K8s.
  • Scaling can be automated with Horizontal Pod Autoscaler, monitoring GPU utilization and throughput.
  • SGlang on Kubernetes:
  • SGlang can be containerized and deployed on Kubernetes, with its runtime optimized for GPU utilization. For Llama 3.1 70B, configure for high throughput, using Kubernetes for dynamic workload distribution SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD Instinct GPUs.
  • Scaling involves managing pod distribution across nodes, ensuring low-latency inference for real-time applications.

Rough Price Comparison and Hosting vs. Third-Party APIs

To compare the cost of hosting your own LLM versus using a third-party API, consider the following factors:

  • Hosting Your Own LLM:
  • Hardware Costs: Deploying Llama 3.1 70B requires significant GPU resources, typically 8 GPUs with 40 GB VRAM each (e.g., NVIDIA A100 or H100). On AWS, an A100 instance might cost approximately \(3 per hour each, totaling ~\)24/hour for 8 GPUs, or $17,280/month for continuous operation Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably. On-premises, initial hardware costs (e.g., $25,000 per H100 GPU) and power consumption add to the expense.
  • Operational Costs: Include maintenance, cooling, and power, which can be significant for on-premises setups. Cloud hosting reduces some operational overhead but increases hourly costs.
  • Break-even Analysis: For high-volume usage (e.g., 1000+ requests/day), hosting can be cost-effective long-term, especially with optimizations like quantization reducing memory needs.
  • Third-Party APIs (e.g., OpenAI):
  • OpenAI’s pricing is token-based, with GPT-4o at $0.01 per 1000 input tokens and $0.03 per 1000 output tokens as of February 2025 OpenAI Pricing. For Llama 3.1 70B, API providers like Replicate or Groq offer competitive rates, with some at $0.88 per 1M tokens blended Llama 3 70B - Intelligence, Performance & Price Analysis.
  • For low to medium usage (e.g., 100-1000 requests/day), APIs are more cost-effective, with costs below $100/month for small volumes, but scaling up (e.g., 2000 requests/day) can reach $2000/month, making hosting more viable Is Hosting Your Own LLM Cheaper than OpenAI?.
  • Comparison Table(these are rough calculations):
Aspect Hosting Your Own (Llama 3.1 70B) Third-Party API (e.g., OpenAI)
Initial Cost High (Hardware ~$200,000 for 8 H100s) Low (No hardware needed)
Monthly Cost (Low Use) ~$17,280 (Cloud, continuous) ~$100 (1000 requests/day)
Monthly Cost (High Use) ~$17,280 (Fixed) ~$2000 (2000 requests/day)
Privacy High (On-premises control) Low (Data sent to provider)
Scalability High (Kubernetes, custom scaling) Medium (API limits)
Ease of Deployment Medium (Setup complexity) High (Quick integration)
  • Decision Factors:
  • Host your own LLM for high data privacy needs (e.g., finance, healthcare), large-scale usage where API costs escalate, and when customization is critical. It’s also suitable for long-term cost savings at high volumes.
  • Use third-party APIs for quick deployment, low to medium usage, and when infrastructure management is a burden. They offer ease of use but may compromise on privacy and cost at scale.

Conclusion

Research suggests that TGI, vLLM, and SGlang are viable for deploying LLMs like Llama 3.1 70B, each with unique strengths: TGI for ecosystem integration, vLLM for speed, and SGlang for efficiency. Deploying on Kubernetes at scale is feasible for all, with containerization and scaling options. The evidence leans toward hosting your own LLM being cost-effective for high-volume use, while third-party APIs suit lower volumes. An unexpected detail is that Llama 3.2 70B may not exist, likely referring to Llama 3.1 70B, highlighting the importance of model version clarity. Choose based on your usage volume, privacy needs, and infrastructure capabilities.

Key Citations