One post tagged with "InferenceEngine"

Open Source Inference at Full Throttle: Exploring TGI and vLLM

September 28, 2024 · 11 min read

Developer Relations

Large language models (LLMs) have received a huge amount of attention ever since ChatGPT first appeared at the end of 2022. ChatGPT represented a notable breakthrough in AI language models, surprising everyone with its ability to generate human-like text. However, it came with a notable limitation: the model could only be accessed via OpenAI’s servers. Users could interact with ChatGPT through a web interface, but they lacked access to the underlying architecture and model weights. Although a few months later OpenAI added access to the underlying GPT-3.5 model to its API, the models still resided on remote servers, and the underlying weights of the models couldn’t be changed. While this was necessary due to the model's enormous computational requirements, it naturally raised questions about privacy and access since all data could be read by OpenAI and an external internet connection was required.

Two years later and the situation has dramatically changed. Due to the rising availability of open-weights alternatives like Meta’s Llama models, we now have multiple options for running LLMs locally on our own hardware. Access is no longer tethered to cloud-based infrastructures.

Two old-timey F1 race cars labeled TGI and vLLM, capturing that vintage racing vibe with a touch of futuristic flair. This design emphasizes the competitive spirit between the two inference engines, set in a nostalgic, dynamic scene. — Generated with OpenAI DALL-E 3 and edited by the author.

Some key points I'll address here are:

The transition from server-based LLMs like ChatGPT to locally runnable models, enabling customization and offline usage.
The role of inference engines in executing neural networks using learned parameters for local model inference.
Introduction to PagedAttention in vLLM, improving memory efficiency through better key-value cache management.
A comparison of TGI and vLLM, highlighting shared features such as tensor parallelism and batching, and distinct features like speculative decoding and structured output guidance.
Explanation of latency and throughput, including how these performance metrics influence LLM deployments.
Advice on selecting between TGI and vLLM based on specific enterprise needs, focusing on use case experimentation and benchmarking.
An overview of licensing differences, discussing TGI's shift back to Apache 2.0, aligning with vLLM’s license.
Practical code examples showing how to deploy models using both TGI and vLLM for real-world applications.