Skip to main content

Open Source Inference at Full Throttle: Exploring TGI and vLLM

· 11 min read
Ian Kelk
Developer Relations

Large language models (LLMs) have received a huge amount of attention ever since ChatGPT first appeared at the end of 2022. ChatGPT represented a notable breakthrough in AI language models, surprising everyone with its ability to generate human-like text. However, it came with a notable limitation: the model could only be accessed via OpenAI’s servers. Users could interact with ChatGPT through a web interface, but they lacked access to the underlying architecture and model weights. Although a few months later OpenAI added access to the underlying GPT-3.5 model to its API, the models still resided on remote servers, and the underlying weights of the models couldn’t be changed. While this was necessary due to the model's enormous computational requirements, it naturally raised questions about privacy and access since all data could be read by OpenAI and an external internet connection was required.

Two years later and the situation has dramatically changed. Due to the rising availability of open-weights alternatives like Meta’s Llama models, we now have multiple options for running LLMs locally on our own hardware. Access is no longer tethered to cloud-based infrastructures.

Some key points I'll address here are:
  • The transition from server-based LLMs like ChatGPT to locally runnable models, enabling customization and offline usage.
  • The role of inference engines in executing neural networks using learned parameters for local model inference.
  • Introduction to PagedAttention in vLLM, improving memory efficiency through better key-value cache management.
  • A comparison of TGI and vLLM, highlighting shared features such as tensor parallelism and batching, and distinct features like speculative decoding and structured output guidance.
  • Explanation of latency and throughput, including how these performance metrics influence LLM deployments.
  • Advice on selecting between TGI and vLLM based on specific enterprise needs, focusing on use case experimentation and benchmarking.
  • An overview of licensing differences, discussing TGI's shift back to Apache 2.0, aligning with vLLM’s license.
  • Practical code examples showing how to deploy models using both TGI and vLLM for real-world applications.

Secret LLM chickens II: Tuning the chicken

· 17 min read
Ian Kelk
Developer Relations

When working with an LLM, sometimes it doesn't generate responses in the way you want. Maybe it's being too creative and weird when tasked with serious prompts ("Write me a cover letter for a programming job" "I am a coding wizard and I always min-max my character!"), or it's being too serious when you want to do some creative writing ("Write me a story" "You are tired and you lie down and go to sleep. The end."). This can be tweaked by making certain adjustments to the sampling mechanism—aka "the chicken."

This blog post continues from my previous article, The secret chickens that run LLMs, and you should read that first to understand what a "stochastic chicken" is.

Some key points I'll address here are:
  • The "chicken" can be tuned, using inference hyperparameters like temperature, top-k, and top-p. These serve as dials to fine-tune the randomness introduced by the stochastic process, balancing creativity and coherence in the text they generate.
  • Adjusting the temperature parameter can make the model's outputs more predictable and less random at lower values, or more diverse and less deterministic at higher values.
  • Modifying the top-k and top-p parameters fine-tunes the sampling process by limiting the set of possible next words.
  • Top-k restricts the model to choose from the kk most likely next words, while top-p uses a probability threshold to create a dynamic set of options. These tweaks help balance creativity with coherence, allowing the LLM to better meet specific needs or experimental conditions.
  • Even when using top-k, the astronomical number of possible text sequences challenges the idea of detecting originality and plagiarism. It's nearly impossible to prove the source of any specific piece of text generated by these models, although LLM-generated text can be recognizable due to the language and style used.

The secret chickens that run LLMs

· 16 min read
Ian Kelk
Developer Relations

Humans often organize large, skilled groups to undertake complex projects and then bizarrely place incompetent people in charge. Large language models (LLMs) such as OpenAI GPT-4, Anthropic Claude, and Google Gemini carry on this proud tradition with my new favorite metaphor of who has the final say in writing the text they generate—a chicken.

There is now a sequel to this article, Secret LLM chickens II: Tuning the chicken, if you'd like to learn how and why the "chicken" can be customized.

Some key points I'll address here are:
  • Modern LLMs are huge and incredibly sophisticated. However, for every word they generate, they have to hand their predictions over to a simple, random function to pick the actual word.
  • This is because neural networks are deterministic, and without the inclusion of randomness, they would always produce the same output for any given prompt.
  • These random functions that choose the word are no smarter than a chicken pecking at differently-sized piles of feed to choose the word.
  • Without these "stochastic chickens," large language models wouldn't work due to problems with repetitiveness, lack of creativity, and contextual inappropriateness.
  • It's nearly impossible to prove the originality or source of any specific piece of text generated by these models.
  • The reliance on these "chickens" for text generation illustrates a fundamental difference between artificial intelligence and human cognition.
  • LLMs can be viewed as either deterministic or stochastic depending on your point of view.
  • The "stochastic chicken" isn't the same as the paradigm of the "stochastic parrot."

LLMs are forward thinkers, and that's a bit of a problem

· 12 min read
Ian Kelk
Developer Relations

This is going to be a weird post. And we're going to start with a thought experiment about a shark and an octopus.

Some key points I'll address here are:
  • Human brains are able to invent ideas without relying on a strictly linear train of thought.
  • LLMs like ChatGPT are autoregressive and are unable to continue a dialogue if they haven't already generated everything up to that point. This is because they don't "think" per se, but progressively generate a response using the parts of the response they previously created.
  • If you try to get an LLM to write text in the middle of a dialogue without previous context, it will give near-identical answers and attempt to conclude the conversation.
  • Prompting for "ridiculous" answers can spark creativity that helps break this pattern.
  • The reliance on a linear train of thought is a limitation for general intelligence. LLMs are ineffective if you ask them to generate the second part of a response without allowing them to generate the first part.

As I mentioned, this is going to sound a bit silly, but I promise there is a point!

How ChatGPT fools us into thinking we're having a conversation

· 9 min read
Ian Kelk
Developer Relations

Remember the first time you used ChatGPT and how amazed you were to find yourself having what appeared to be a full-on conversation with an artificial intelligence? While ChatGPT was (and still is) mind-blowing, it uses a few tricks to make things appear more familiar.

While the title of this article is a bit tongue-in-cheek, it isn't clickbait. ChatGPT does indeed use two notable hidden techniques to simulate human conversation, and the more you know about how they work, the more effectively you can use the technology.

Some key points I'll address here are:
  • ChatGPT has no idea who you are and has no memory of talking to you at any point in the conversation.
  • It simulates conversations by "reading" the whole chat from the start each time.
  • As a conversation gets longer, ChatGPT starts removing pieces of the conversation from the start, creating a rolling window of context.
  • Because of this, very long chats will forget what was mentioned at the beginning.