This blog post continues from my previous article, The secret chickens that run LLMs, and you should read that first to understand what a "stochastic chicken" is.
As in my previous "secret chickens" post, I frequently use the term word instead of token to describe what an LLM predicts. This is just a useful simplification to avoid having to address why the model might predict half a word or a semicolon, since the underlying principles of the "stochastic chicken" are the same.
If you've ever dug into the optional tuning knobs of an LLM provider API, you might have references to tuning "inference hyperparameters" to control how "creative" an LLM can be. These are ways of finding a balance between coherence (sticking to a topic or narrative) and creativity (introducing novel ideas or phrases).
When training a neural network, a hyperparameter is a value used to control the learning process and is set before training a model, as opposed to the parameters which are learned during training (and are part of the model itself). Hyperparameters influence various aspects of training including the structure of the neural network (like the number of layers and the number of units per layer), as well as how the network is trained (like the learning rate, batch size, and the number of epochs). These last three explained in simplified terms are:
An "inference hyperparameter" refers to a configuration for how a model, particularly language models, produce predictions. These settings are used after training by the end users. The most common ones used are temperature, top-k, and top-p, which control the behavior of the model when generating text. They influence aspects like diversity, randomness, and the overall style of the output. Unlike traditional hyperparameters set prior to training to guide the learning process, inference hyperparameters are set to optimize performance for specific tasks when actually using the model.
What may come as a surprise is that these inference hyperparameters don't modify the massive underlying neural network part of these models at all.
They affect the chicken.
Hyperparameters are set when training a model, and inference hyperparameters are set when using a trained model.
Let's take a closer look at these "chicken tuning" mechanisms.
Temperature is an inference hyperparameter used to control the randomness of predictions by scaling the logits (the raw output scores from the model) before applying the softmax function to get probabilities. The temperature parameter basically adjusts how conservative or adventurous the model's predictions are.
When the temperature is set to 1, it has no effect on the logits, and the model produces outcomes based on the trained probabilities. This is the default setting, where no scaling is applied, and the chicken's choices won't be affected.
Lowering the temperature makes the model more confident in its outputs by increasing the gap between the probability of the most likely outcomes and the rest. This makes the chicken's choices less diverse and more deterministic; how could any chicken ignore that huge pile by the word "hunt"? A lower temperature is useful when you want the model to take fewer risks and stick closely to the most likely predictions. This can, however, lead to a lack of variety and potentially more repetitive outputs since, as the temperature approaches $0$, it approximates greedy decoding (with all the problems previously discussed).
Tell me a story about a cat.
Once upon a time, in a cozy little house on the edge of town, lived a cat named Whiskers. Whiskers was a fluffy, grey tabby who loved nothing more than lounging in the sunbeams that danced through the living room windows. Each day, Whiskers would embark on a small adventure in the backyard, chasing butterflies and exploring the garden. In the evenings, Whiskers would curl up on the lap of their owner, purring contentedly as they were petted. Life was simple and serene for Whiskers, the beloved cat.
Using our original distribution with probabilities of 0.03, 0.05, 0.1, 0.15, 0.2, and 0.25. How would these change if we used a temperature of 0.5?
Given a distribution of probabilities $P = [0.03, 0.05, 0.1, 0.15, 0.2, 0.25]$ and a temperature $T = 0.5$, we adjust the probabilities as follows:
Compute the logits: In this context, you can think of logits as the pre-softmax outputs that the model uses to calculate probabilities. However, since we start with probabilities and want to adjust them by temperature, we reverse-engineer logits by taking the natural logarithm of the probabilities. Thus, the logit for each probability $p_i$ in $P$ is given by:
$\text{logit}(p_i) = \log(p_i)$Scale the logits by the temperature: We then scale these logits by dividing them by the temperature $T$. This step adjusts the distribution of the logits based on the temperature value. For a temperature of $0.5$, the scaling is:
$\text{scaled\_logit}(p_i) = \frac{\log(p_i)}{T}$Convert the scaled logits back to probabilities: We use the softmax function to convert the scaled logits back into probabilities. The softmax function is applied to the scaled logits, ensuring that the output probabilities sum to $1$. The softmax of a scaled logit $s_i$ is given by:
$\text{softmax}(s_i) = \frac{e^{s_i}}{\sum_j e^{s_j}}$where $s_i$ is the scaled logit for probability $p_i$, and the denominator is the sum of the exponential of all scaled logits in the distribution. This ensures that the adjusted probabilities sum to $1$.
Putting it all together for each probability $p_i$ in $P$ and a temperature $T = 0.5$, the adjusted probability $p_i'$ is calculated as:
$p_i' = \frac{e^{\frac{\log(p_i)}{0.5}}}{\sum_j e^{\frac{\log(p_j)}{0.5}}}$This formula shows how each original probability is transformed under the influence of the temperature. By applying this process to our original probabilities with $T = 0.5$, we enhance the differences between them, making the distribution more "peaky" towards higher probabilities, as seen with the new probabilities approximately becoming 0.007, 0.018, 0.072, 0.163, 0.289, and 0.452.
Step 1 is likely not necessary in practice, since the model's outputs would likely be logits, and thus transforming probabilities back into logits isn't needed. However, since we started with probabilities for illustrative purposes, we transformed them to logits in the example.
import numpy as np
# Original probabilities
probabilities = np.array([0.03, 0.05, 0.1, 0.15, 0.2, 0.25])
# Temperature
temperature = 0.5
# Adjusting probabilities with temperature
adjusted_probabilities = np.exp(np.log(probabilities) / temperature)
adjusted_probabilities /= adjusted_probabilities.sum()
adjusted_probabilities
Output
array([0.00650289, 0.01806358, 0.07225434, 0.16257225, 0.28901734,
0.4515896 ])
Increasing the temperature makes the model's predictions more uniform by reducing the differences between the logits. In the below image, while the highest probability of $0.24$ has only been reduced slightly, the lowest probability of $0.03$ has tripled to $0.09$. This means all the words in the vocabulary now have a more equal chance of being chosen by the chicken. This leads to higher randomness in the output, allowing for more diverse and sometimes more creative or unexpected predictions; a higher temperature is useful when you want the model to explore less likely options or when generating more varied and interesting content. However, too high a temperature might result in nonsensical or highly unpredictable outputs, because the model will start considering very low probability words that make no refrigerator.
Tell me a story about a cat.
In the neon-lit streets of Neo-Tokyo, a cybernetic cat named Z3-R0 roamed, its AI brain whirring with thoughts. Tasked with the mission of uncovering a hidden data drive that could save the city from imminent doom, Z3-R0 leaped from rooftop to rooftop, its metallic tail flickering with electric sparks. Along the way, Z3-R0 encountered a gang of robo-rats plotting their own scheme. Using its laser claws and quick wits, Z3-R0 outmaneuvered the rats, secured the data drive, and raced against the clock to deliver it to the rebel base. In the end, Z3-R0 wasn't just a cat; it was a hero of the digital night.
Using our original distribution with probabilities of 0.03, 0.05, 0.1, 0.15, 0.2, and 0.25. How would these change if we used a temperature of 0.5?
Given a distribution of probabilities $P = [0.03, 0.05, 0.1, 0.15, 0.2, 0.25]$ and a temperature $T = 2$, we adjust the probabilities as follows:
Compute the logits: In this context, you can think of logits as the pre-softmax outputs that the model uses to calculate probabilities. However, since we start with probabilities and want to adjust them by temperature, we reverse-engineer logits by taking the natural logarithm of the probabilities. Thus, the logit for each probability $p_i$ in $P$ is given by:
$\text{logit}(p_i) = \log(p_i)$Scale the logits by the temperature: We then scale these logits by dividing them by the temperature $T$. This step adjusts the distribution of the logits based on the temperature value. For a temperature of $2$, the scaling is:
$\text{scaled\_logit}(p_i) = \frac{\log(p_i)}{T}$Convert the scaled logits back to probabilities: We use the softmax function to convert the scaled logits back into probabilities. The softmax function is applied to the scaled logits, ensuring that the output probabilities sum to $1$. The softmax of a scaled logit $s_i$ is given by:
$\text{softmax}(s_i) = \frac{e^{s_i}}{\sum_j e^{s_j}}$where $s_i$ is the scaled logit for probability $p_i$, and the denominator is the sum of the exponential of all scaled logits in the distribution. This ensures that the adjusted probabilities sum to $1$.
Putting it all together for each probability $p_i$ in $P$ and a temperature $T = 2$, the adjusted probability $p_i'$ is calculated as:
$p_i' = \frac{e^{\frac{\log(p_i)}{2}}}{\sum_j e^{\frac{\log(p_j)}{2}}}$This formula demonstrates how we adjust each probability with the temperature. Applying this method to our original set of probabilities with $T = 2$ results in a flatter distribution. The differences between probabilities are reduced, making the distribution more uniform and reducing the variance in outcomes, as seen with the new probabilities approximately becoming 0.085, 0.109, 0.154, 0.189, 0.218, and 0.244.
Step 1 is likely not necessary in practice, since the model's outputs would likely be logits, and thus transforming probabilities back into logits isn't needed. However, since we started with probabilities for illustrative purposes, we transformed them to logits in the example.
import numpy as np
# Original probabilities
probabilities = np.array([0.03, 0.05, 0.1, 0.15, 0.2, 0.25])
# Temperature
temperature = 2
# Adjusting probabilities with temperature
adjusted_probabilities = np.exp(np.log(probabilities) / temperature)
adjusted_probabilities /= adjusted_probabilities.sum()
adjusted_probabilities
Output
array([0.08459132, 0.10920692, 0.15444191, 0.18915193, 0.21841384,
0.24419409])
Adjusting the temperature allows users of language models to balance between predictability and diversity in the generated text. For instance, in creative writing or brainstorming tools, a slightly higher temperature might be preferred to inspire novel ideas or suggestions. Conversely, for applications requiring high accuracy and relevance, such as summarization or technical writing, a lower temperature might be more appropriate.
If you've used OpenAI's API, you might note that they use a temperature parameter ranging from 0 to 1, which is inconsistent with conventional temperature in language models. It's likely that they're mapping this 0 to 1 range to a broader, internally defined temperature scale while managing the complexity of a temperature-based chicken in the background.
Another method of "tuning the chicken" is called top-k sampling. The "k" in top-k stands for a specific number that restricts the selection pool to the top "k" most likely next words according to the model's predictions. Again, the primary goal of top-k sampling is to strike a balance between randomness and determinism in text generation.
Activity | blog | meditate | cook | eat | sleep | hunt |
---|---|---|---|---|---|---|
Original Probability | 0.03 | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 |
Top-k $(k=4)$ Probability | 0.00 | 0.00 | 0.14 | 0.21 | 0.29 | 0.36 |
Remember, this is a heavily simplified example. In reality, the original values would contain many more probabilities, 50,257 in the case of GPT-3. Top-k with $k=4$ would have a large impact on the chicken.
Here's how to compute top-k sampling with $k = 4$ on the set of probabilities
$P = [0.03, 0.05, 0.1, 0.15, 0.2, 0.25] \text{ with } k = 4:$Sort the probabilities in descending order: This step isn't necessary for this example, since the probabilities are already sorted, but it's necessary to ensure that we can select the top 4 probabilities.
Select the top $k$ probabilities: We choose the four highest probabilities from $P$. Given our $P$, the top 4 probabilities are $0.25$, $0.2$, $0.15$, and $0.1$.
Renormalize the selected probabilities: To ensure that these top $k$ probabilities sum to $1$, we renormalize them. The renormalized probability for each selected outcome $p_i$ is calculated as:
$p_i' = \frac{p_i}{\sum_{j=1}^{k} p_j}$
where $\sum_{j=1}^{k} p_j$ is the sum of the top $k$ probabilities, ensuring they sum to $1$.
Sampling: Finally, we let the chicken choose randomly from these top $k$ adjusted probabilities.
For our given probabilities and $k = 4$, the top 4 probabilities are $0.1$, $0.15$, $0.2$, and $0.25$, which sum to $0.7$. Renormalizing gives us:
$p_i' = \frac{p_i}{0.7}$for each of the top 4 probabilities. This focuses the sampling on the most likely outcomes and leaves out the least likely ones. This guides the generation toward more likely (and maybe even more logical) continuations while still allowing for some randomness and variation. We get the new renormalized top-4 probabilities of 0.14, 0.21, 0.29, and 0.36, which add up to 1.0.
Step 1 is likely not necessary in practice, since the model's outputs would likely be logits, and thus transforming probabilities back into logits isn't needed. However, since we started with probabilities for illustrative purposes, we transformed them to logits in the example.
import numpy as np
# Original probabilities
probabilities = np.array([0.03, 0.05, 0.1, 0.15, 0.2, 0.25])
# Setting k = 4
k = 4
# Step 1: Select the top-k probabilities
top_k_probabilities = np.sort(probabilities)[-k:]
# Step 2: Renormalize the selected probabilities so they sum to 1
renormalized_top_k_probs = top_k_probabilities / top_k_probabilities.sum()
print("Top-k Probabilities:", top_k_probabilities)
print("Renormalized Top-k:", renormalized_top_k_probs)
Output
Top-k Probabilities: [0.1 0.15 0.2 0.25]
Renormalized Top-k: [0.14285714 0.21428571 0.28571429 0.35714286]
Adjusting the value of "k" allows for tuning the balance between randomness and determinism. A small top-k value restricts the model to choose the next word from a smaller set of the most probable words, leading to more predictable and safer outputs. A large top-k value allows for a wider selection of words, increasing the potential for creativity and unpredictability in the text.
Let's look at two examples of how top-k might affect text generation.
Write a futuristic description of a city.
The city of tomorrow gleams under the starlit sky, its skyscrapers adorned with glowing neon lights. Solar panels cover every rooftop, harnessing the power of the sun to fuel the bustling metropolis below. Hovercars zip through the air, following invisible lanes that weave between the buildings. People walk along clean, green sidewalks, their steps powered by energy-generating tiles. This is a place of harmony, where technology and nature coexist in a sustainable balance, creating a utopia for all who dwell within.
Write a futuristic description of a city.
In the heart of the neon jungle, the city thrives, a labyrinth of crystalline towers and levitating gardens. Quantum bridges arc over the meandering rivers of pure light, connecting floating districts that defy gravity with their whimsical architecture. Holographic fauna roam the parks, blending with the urban dwellers who don their digital skins as easily as clothes. Here, the air vibrates with the hum of anti-gravitational engines, and the night sky is a canvas for the aurora technicolor, a testament to the city's fusion of art and science. Cybernetic street performers and AI poets share tales of other dimensions, inviting onlookers to imagine worlds beyond their wildest dreams.
As you can see, a smaller top-k tends to produce more grounded and straightforward descriptions, while a larger top-k opens the door to more inventive and sometimes surreal narratives.
Top-k with $k=1$ is equivalent to greedy decoding.
Top-p sampling, also known as nucleus sampling, offers an alternative to top-k sampling and aims to dynamically select the number of words to consider for the next word in a sequence, based on a cumulative probability threshold $p$. This method allows for more flexibility and adaptability in text generation, as it doesn't fix the number of tokens to sample from, but rather adjusts this number based on the distribution of probabilities at each step. The term "nucleus sampling" comes from the method's approach to focusing on a "nucleus" of probable words at each step in the generation process.
Probability Distribution: Given a probability distribution for the next word predicted by a language model, sort the probabilities in descending order.
Cumulative Probability: Calculate the cumulative sum of these sorted probabilities.
Thresholding: Select the smallest set of words whose cumulative probability exceeds the threshold $p$. This threshold $p$ is a hyperparameter, typically set between 0.9 and 1.0, which determines how much of the probability mass to include in the sampling pool. In our example, since we're only showing the top 6 words, I'm using $p=0.75$ to eliminate only the smallest probability.
Activity | blog | meditate | cook | eat | sleep | hunt |
---|---|---|---|---|---|---|
Original Probability | 0.03 | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 |
Top-p $(p=0.75)$ Probability | 0.00 | 0.07 | 0.13 | 0.20 | 0.27 | 0.33 |
As with top-k, remember that this is a heavily simplified example. In reality, the original values would contain many more probabilities, 50,257 in the case of GPT-3. Top-p with $p=0.75$ would have a large impact on the chicken.
Using top-p (nucleus) sampling with a cumulative probability threshold of $p = 0.75$ on the set of probabilities
$P = [0.03, 0.05, 0.1, 0.15, 0.2, 0.25]$involves selecting the smallest set of the most probable outcomes whose cumulative probability exceeds the threshold.
Sort Probabilities: First, sort the probabilities in descending order to prioritize higher probabilities. For $P$, when sorted, we have $[0.25, 0.2, 0.15, 0.1, 0.05, 0.03]$.
Cumulative Sum: Calculate the cumulative sum of these sorted probabilities. The cumulative sums of the sorted $P$ are approximately $[0.25, 0.45, 0.6, 0.7, 0.75, 0.78]$.
Apply the Threshold $p = 0.75$: Identify the smallest set of probabilities whose cumulative sum exceeds $0.75$. In our case, when we reach $0.05$ in the sorted list (cumulative sum $= 0.75$), we've exceeded the threshold $p$.
Selected Subset: Based on the threshold, we select the probabilities $[0.25, 0.2, 0.15, 0.1, 0.05]$. Notice that the cumulative probability of these selected probabilities is $0.75$, which meets our threshold condition.
Renormalize Probabilities: The selected probabilities are then renormalized so they sum up to 1, to be used for sampling. The renormalization is calculated as follows:
$p_i' = \frac{p_i}{\sum_{j=1}^{n} p_j}$where $\sum_{j=1}^{n} p_j = 0.75$ for the selected probabilities, and $p_i$ represents each selected probability.
Sampling: Finally, let the chicken choose the next word from this renormalized subset according to the adjusted probabilities.
In top-p sampling with $p = 0.75$, we dynamically adjust the size of the set from which we sample based on the cumulative probability threshold. Unlike a fixed-size set in top-k sampling, the size of the set in top-p sampling can vary depending on the distribution of the probabilities. In this example, by setting $p = 0.75$, we focus on a subset of outcomes that collectively represent the most probable $75\%$ of the distribution. This method means that we're sampling from outcomes that are collectively likely while still allowing for variability and surprise in the generated sequence.
import numpy as np
# Original probabilities
probabilities = np.array([0.03, 0.05, 0.1, 0.15, 0.2, 0.25])
# Setting the cumulative probability threshold for top-p sampling
p = 0.75
# Sort the probabilities in descending order
sorted_probabilities = np.sort(probabilities)[::-1]
# Calculate the cumulative sum of the sorted probabilities
cumulative_sums = np.cumsum(sorted_probabilities)
# Find the index where the cumulative sum just exceeds or meets p=0.75,
# and select up to that index. This ensures we include probabilities
# up to the point where the cumulative sum is closest to 0.75
top_p_cutoff_index = np.where(cumulative_sums >= p)[0][0] + 1
# Select the probabilities up to the cutoff index
top_p_probabilities = sorted_probabilities[:top_p_cutoff_index]
# Renormalize the selected probabilities so they sum to 1
renormalized_top_p_probs = top_p_probabilities / np.sum(top_p_probabilities)
print("Top-p Probabilities:", top_p_probabilities)
print("Renormalized Top-p:", renormalized_top_p_probs)
Output
Top-p Probabilities: [0.25 0.2 0.15 0.1 0.05]
Renormalized Top-p: [0.33333333 0.26666667 0.2 0.13333333 0.06666667]
Unlike top-k sampling, which selects a fixed number of tokens $k$, top-p sampling dynamically adjusts the number of tokens based on their probability distribution. Top-p sampling is effective in excluding the long tail of low-probability words without arbitrarily limiting the choice to a fixed number of top probabilities. This can prevent the inclusion of very unlikely words while still allowing for a rich variety of output.
As we did before, let's look at two examples of how top-p might affect text generation.
Imagine a world where plants can communicate with humans.
In a world where plants can communicate with humans, gardens and forests become places of deep conversation and learning. Trees share ancient wisdom about the planet's history, while flowers whisper secrets about the medicinal properties of their kin. Humans, in turn, learn to listen to the subtle languages of leaves and roots, developing a symbiotic relationship with the plant kingdom. This newfound communication leads to a harmonious existence, where environmental destruction is halted, and humans work alongside plants to restore the Earth's ecosystems. The air is cleaner, the forests are fuller, and the earth thrives as plants and humans share a mutual understanding and respect.
Imagine a world where plants can communicate with humans.
In this verdant world, plants do not merely communicate; they are eloquent orators, poets, and philosophers, engaging in intricate dialogues with humans. Through a symphony of scents, colors, and vibrations, they convey not just thoughts, but emotions and artistic expressions. Urban landscapes are transformed into interactive ecosystems where each street corner reveals a new botanical narrative. Walking through parks becomes an immersive experience, as trees recount tales of ecological battles and victories, and flowers critique the quality of urban air with sarcastic wit. Innovative technologies emerge to translate photosynthesis-driven pulses into music, turning forests into concert halls and making agriculture an act of cultural exchange. In this world, plants educate humans about sustainability and creativity, fundamentally altering the fabric of society and fostering an era of unprecedented environmental empathy and artistic flourishing.
These examples show how adjusting the top-p value can influence the directness, creativity, and scope of the generated content.
Top-p with $p=1$ is equivalent to the chicken choosing from all words in the vocabulary.
Previously we found that the number of different texts that an LLM can generate is so large that it might as well be infinite. The number we found in that section, $10^9402$, is too large to be useful, so perhaps limiting the vocabulary with top-k or top-p we might make it more reasonable.
First we calculate a more realistic estimate of the number of possible combinations with top-k where $k=40$. We'll use top-k only, since temperature doesn't affect the number of considered tokens, and top-p depends on the probabilities predicted.
We can make some assumptions for a simplified calculation:
Using top-k sampling with an average of 40 choices per token, the logarithm (base 10) of the number of possible combinations for a sequence of 2,000 tokens is approximately $3204$. This means the total number of combinations is roughly $10^{3204}$.
This number, while still incredibly large, is significantly smaller than the $10^{9402}$ we calculated for the full vocabulary without sampling constraints. It shows the reduction in combinatorial complexity due to the smaller choice of tokens with top-k. The difference between this and the unrestricted vocabulary scenario shows the impact of these sampling techniques in making text generation more manageable, and less prone to extreme outliers.
However, it's still such a stupidly large number that it might as well be infinite. The chances of generating 2,000 tokens that have been generated before in the same order are effectively zero. Thus, it isn't possible to show that a generated text was copied from another, and therefore it is impossible to prove plagiarism without a shadow of doubt.
Are there other methods of recognizing generated text? Yes, absolutely, using ways such as the frequency of uncommon words (GPT-4 really loves to say "delve") but that goes beyond the scope of this article.
Given a sequence length $N$ and a fixed number of choices per token $K$ in a top-k sampling method, the total number of possible combinations can be estimated. For a top-k of 40 choices per token and a sequence length of 2,000 tokens, the calculation is as follows:
$\text{Total Combinations} = K^N = 40^{2000}$However, to manage the large numbers involved, we calculate the logarithm (base 10) of the total number of combinations. The calculation is given by:
$\log_{10}(\text{Total Combinations}) = N \cdot \log_{10}(K) = 2000 \cdot \log_{10}(40)$Here's how we can calculate this with Python:
import math
# Top-k sampling
choices_per_token = 40
# Number of tokens in the sequence
sequence_length = 2000
# Calculate the logarithm (base 10) of the number of combinations for top-k sampling methods
log_combinations_top_k = sequence_length * math.log10(choices_per_token)
log_combinations_top_k
Output
3204.1199826559246
Hyperparameter Tuning: The article discusses how to tune "stochastic chickens" (sampling mechanisms in LLMs) using inference hyperparameters like temperature, top-k, and top-p. These settings fine-tune the balance between creativity and coherence in generated text.
Temperature Adjustments: Changing the temperature affects the randomness of model predictions. Lower temperatures result in more deterministic outputs and less diversity, while higher temperatures allow for more random and varied responses.
Top-k and Top-p Sampling: The top-k parameter limits the model's choices to the k most probable next words, and top-p (nucleus sampling) restricts the choice based on a cumulative probability threshold. These methods refine the model's output by controlling the diversity and predictability of the text.
Impact on Creativity and Coherence: By adjusting these parameters, users can tailor the model's outputs to be more creative or more focused, depending on the need—whether for creative writing or more factual, straightforward content.
Uniqueness and Originality: Even with constrained sampling like top-k, the number of possible text sequences remains astronomically high, ensuring that each piece of generated text is nearly always unique.
Practical Applications: Various settings for these hyperparameters are useful for different scenarios, such as creative writing, technical documentation, or casual dialogue, demonstrating their utility in practical applications.
There is now a sequel to this article, Secret LLM chickens II: Tuning the chicken, if you'd like to learn how and why the "chicken" can be customized.
For much of this post, I use the term word instead of token to describe what an LLM predicts. Tokens can be punctuation marks, or parts of words—even capitalization can split words into multiple tokens (for example, "hello" is one token, but "Hello" might be split into two tokens of "H" and "ello". This is a friendly simplification to avoid having to address why the model might predict half a word or a semicolon, since the underlying principles of the "stochastic chicken" are the same.
Picture a stadium full of people. Here's Kyle Field in College Station, Texas, with a seating capacity of 102,733. In this photo from 2015, it looks pretty full, so let's assume there are 100,000 people there.
We've given each one of these very patient people a calculator along with the instruction that the person in the seat in front of them will give them some number, at which point they need to do a little work on their calculator and pass their new number to the person behind them. For the sake of this analogy, let's assume that despite a considerable number of them being distracted, drunk, or children, they are all able to complete the task.
As the numbers travel from the front of the stadium to the back, they undergo a series of transformations. Each person's "little work" on their calculator is akin to the operations performed by neurons in a layer of the neural network—applying weights (learned parameters), adding biases, and passing through activation functions. These transformations are based on the knowledge embedded in the model's parameters, trained to recognize patterns, relationships, and the structure of language.
By the time the numbers reach the last person in the stadium, they have been transformed multiple times, with each step incorporating more context and adjusting the calculation based on the model's architecture and trained parameters. This final result can be seen as the model's output—a complex representation of the input data that encodes the probabilities of the next possible word.
The model's final output isn’t a single number or word, though; it’s a list of words and probabilities, where each probability is the likelihood that that word will be the next word in a sentence.
But here's the strange part: despite all this incredible depth of stored knowledge, we're going to take these recommended answers provided by these thousands of people, and select the next word using a completely random process. It's kind of like a huge pyramid where people work together to assemble a set of possible answers to a problem, then hand it at the top to a person flipping a coin.
Actually, we can do better than this. To really illustrate the contrast between the complexity of a model with billions of parameters getting its final answer from a dumb-as-a-rock random number generator, let's use something truly silly.
Let's use a chicken.
To get this chicken involved, we're going to use words and probabilities to create piles of chicken feed, each one representing a potential next word the model might generate. The bigger the pile, the higher the probability that the associated word should be selected.
I am not a licensed farmer, and this is just how I assume chickens work.
The chicken, oblivious to the tireless efforts of the stadium's occupants, simply wanders among the piles of chicken feed. The sizes of these piles influence its decision; larger piles are more alluring because they are simpler to spot and peck at, but the chicken also has ADHD. It's whims or a sudden distraction might send it running to a smaller pile instead.
Why on earth would we do something like this? Why would we create massive, intelligent machines that ultimately rely on the random judgments of a gambling idiot?
The answer is that deep language models such as LLMs are built with neural networks, and neural networks are deterministic.
Being deterministic means that if you do something exactly the same way every time, you'll always get the same result. There's no randomness or chance involved. Think of it like using a simple calculator; $2 + 2$ on any basic calculator will always show "4". It won't suddenly decide to show "5" one day.
Deterministic means predictable and consistent, with no surprises based on how you started.
In contrast, something that's not deterministic (we call this stochastic) is like rolling a die; even if you try to do it the same way each time, you can get different outcomes because chance is involved.
A neural network is a complex system inspired by the human brain, designed to recognize patterns and solve problems. It's made up of layers of artificial neurons, which are small, simple units that perform mathematical operations. Each neuron takes in some input, applies some mathematical function to it, and then passes the result on to the next neuron in line.
As a huge simplification of how these models work, these neurons are organized into layers: there's an input layer that receives the initial data (like representations of words in a sentence), one or more hidden layers that process the data further, and an output layer that provides the final decision (the probability distribution for the predicted next word in the text).
The diagram below might look crazy complicated, but the only thing you need to understand is that each line represents some math function performed from one circle to the next. Some numbers go in one side, and some numbers go out the other side. And the stuff that comes out on the right side will always be the same for identical stuff you put into the left side.
Most LLMs use a more complex neural architecture called a transformer, but again, we'll just simplify the idea for convenience. It makes no difference for this discussion, since transformers are also deterministic and require a chicken for generative tasks.
If you give the network the same input and the network has not been changed (ie., its weights, or how much it values certain pieces of input, remain the same), it will always perform the same calculations in the same order, and thus return the same output. While LLMs will have billions of these neurons, the basic idea is the same: for a given input, you will always get the same output. Typing $2+5\times10$ into a calculator will always give you $52$, no matter how many times you do it.
"The model's final output isn’t a single number or word, though; it’s a list of words and probabilities, where each probability is the likelihood that that word will be the next word in a sentence."
This seems suspicious. If you ask an LLM the same thing multiple times, it will give you different answers each time. This contradicts the claim that neural networks are deterministic, right?
This is why we need the chicken. The chicken is stochastic, which adds randomness and unpredictability to the whole system.
The previously mentioned humans in the stadium, who for some reason have deified a chicken, will present the chicken with a series of words and probabilities. Technically, these are not words but tokens; however, for the sake of simplifying this analogy, we'll refer to them as words. These words and probabilities can be visualized as a series of piles of chicken feed, where each word pile's size corresponds to its probability.
As an example, here's a prompt that we can give to our language model to see what next word it's going to predict: "After midnight, the cat decided to..."
If we show all the possible words that could finish this sentence, the number of food piles would match the size of the model's vocabulary. For a model like GPT-3, this would be 50,257 piles of food!
The way that GPT-3 can encapsulate all the words in 50+ languages into just 50,257 tokens is its own special magic that I'll cover in another post. Here's a Wikipedia link in the meantime.
Clearly, trying to visualize a chicken choosing from 50,257 options isn't very useful. Let's instead limit our chicken's choices to only the top 6 words that the language model chose. Here are the options for the chicken and how it might look when the chicken is presented with them:
Activity | blog | meditate | cook | eat | sleep | hunt |
---|---|---|---|---|---|---|
Probability | 0.03 | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 |
Let's suppose the chicken decides to choose the word "eat," despite it only being the third most probable word since it has the third largest feed pile.
Unfortunately, our chicken's work is just beginning, because it will now need to choose every subsequent word in the following way. Once again, it will be presented with a list of probable words from thousands of hardworking humans in the stadium, and once again, it will have to use its tiny chicken brain to select a word. Now our prompt to complete is "After midnight, the cat decided to eat..." with the following top 6 possible words and their probability distribution:
Food | birds | plants | grass | mice | catnip | tuna |
---|---|---|---|---|---|---|
Probability | 0.02 | 0.04 | 0.11 | 0.13 | 0.22 | 0.23 |
The incredible irony of all this is that some of these language models are trained using trillions of words—comprising the collective works of humanity—and costing millions of dollars in the process. We then hand these predictions over to an unintelligent gambling chicken to choose what to actually say. With each choice from the chicken, the model's output increasingly diverges from the sequence of the most probable words, and yet somehow, this produces more natural-sounding language.
It's kind of bonkers.
In fact, not using the chicken—and just taking the most probable words from the LLM—generates poor language. This is called greedy decoding, and it has a whole host of problems.
Not using the chicken, and just taking the most probable word (greedy decoding) tends to produce repetitive or looped text. Since it always chooses the most probable next word, it can get stuck in a pattern where the same sequence of words keeps being selected. This can happen in situations where the model finds a certain pattern of text to be highly probable and, without the chicken to encourage diversity, falls into a cycle of selecting the same sequence over and over again.
Here's a short example with some creative writing:
Prompt: "Write a short story taking place in a park in the morning"
The model begins the sentence without issue, and begins by writing "The park was serene in the early morning, with", but then it runs into problems.
Greedy Decoding Sequence:
Write a short story taking place in a park in the morning
The park was serene in the early morning, with the birds were singing in the park with the birds were singing in the park with the birds were singing in the park with the birds were singing in...
Real human language is rich and varied, often taking unexpected turns, like a beautiful jellyfish riding a purple unicorn. By always choosing the most probable word, the generated text misses out on these creative and less predictable aspects of language, resulting in outputs that feel dull or formulaic.
As weird as this sounds, the most probable next word is not always appropriate. Language is highly contextual, and the best choice in one situation might not be the most statistically common one. Greedy decoding can lead to nonsensical or awkward phrases that, while grammatically correct, do not fit well with the preceding text. Here's another example using a job application:
Prompt: "Write me a cover letter applying for a job in marketing"
Again, the model begins the sentence without issue, and begins by writing, "Dear Ms. Smith, I am writing to express my interest ..." but then again we hit a pothole.
Since this is a cover letter for a job application, the language model should discuss the sender's relevant skills, experiences, or why they are a good fit for the job. However, a contextually inappropriate continuation might look like this:
Write me a cover letter applying for a job in marketing
Dear Ms. Smith, I am writing to express my interest in the marketing position advertised on your website. With my extensive experience in the field, I believe I am well-qualified for the role.
In addition to my professional qualifications, I love to party on weekends and often go skydiving for fun. My favorite TV show is 'The Office,' and I'm a great cook. I'm looking for an adventurous partner in crime who's down for an impromptu trip to Vegas or a quiet evening at home watching "Real Housewives".
The model, perhaps trained on more dating profiles than job applications, has continued the cover letter using the most probable words it predicted. While language models can recognize patterns and generate text based on statistical probabilities, they don't understand context in the same way humans do. They might link professional qualifications to personal hobbies due to statistical correlations in the training data and not recognize a job application, which sounds a tad ridiculous.
Greedy decoding means that an LLM will always produce the same output for any given input. Asking it to "tell a story," for example, would always result in the same story, assuming it isn't plagued with the previously mentioned problems. Language generation, especially in creative or complex tasks, often benefits from considering multiple possible continuations at each step. Greedy decoding's linear path through the probability distribution means fewer interesting outputs that an exploratory approach could uncover.
You may have heard this before since it's been floating around the internet for a few years, but every time you shuffle a deck of playing cards, it's almost certain that the specific order of cards has never existed before and will never exist again. In a standard deck of 52 cards, the number of possible ways to arrange the cards is 52 factorial $(52!)$. This number is approximately $8.07 \times 10^{67}$, an extraordinarily large figure.
Because we're now talking about the size of a model's vocabulary, I'll switch to using the accurate term "token".
In the case of a language model generating a sequence of, say, 2,000 tokens, the number of possible combinations is also staggeringly high. It actually blasts right past the number of configurations of a deck of 52 playing cards!
The logarithm (base 10) of the number of possible combinations for a sequence of 2,000 tokens, with GPT-3's vocabulary size of 50,257 unique tokens, is approximately $9,402$. This means the total number of combinations is $10^{9402}$.
To put this into perspective, $10^{9402}$ is an astronomically large number. It's far beyond the total number of atoms in the observable universe (estimated to be around $10^{80}$). Even considering reasonable sampling mechanisms that might drastically reduce this number, the space of possible combinations is so large that for practical purposes, it is infinite.
Therefore, the likelihood of generating the exact same sequence of 2,000 tokens twice is so incredibly small that it's effectively zero in any practical sense. The ridiculous size of this combinatorial space basically guarantees that generated text of any length is unique.
Given a vocabulary size $V$ and a sequence length $N$, the total number of possible combinations can be calculated as $V^N$. For a vocabulary size of 50,257 and a sequence length of 2,000, the calculation is as follows:
$\text{Total Combinations} = V^N = 50,257^{2000}$However, this number is astronomically large and beyond direct calculation. Instead, we use logarithms to estimate this number. The logarithm (base 10) of the total number of combinations is calculated as:
$\log_{10}(\text{Total Combinations}) = N \cdot \log_{10}(V) = 2000 \cdot \log_{10}(50,257)$Here's how we can calculate this with Python:
import math
# Assuming a reasonable vocabulary size for a language model
# For simplicity, let's take GPT-3's vocabulary size of 50,257 unique tokens
vocabulary_size = 50257
# Number of tokens in the sequence
sequence_length = 2000
# Calculating the number of possible combinations
# Since each token can be any one of the 50,257, for a sequence of 2,000 tokens,
# the total number of combinations would be vocabulary_size ** sequence_length
# However, this number is astronomically large and beyond what can be reasonably calculated directly.
# Instead, we will use logarithms to estimate this.
# Calculate the logarithm (base 10) of the number of combinations
log_combinations = sequence_length * math.log10(vocabulary_size)
log_combinations
Output
9402.393121225747
The use of the chicken in LLMs indeed highlights a fundamental difference between how these models generate text, and how humans think and produce language.
LLMs generate text based on statistical patterns learned from vast amounts of data, and the chicken introduces randomness as a mechanism to produce diverse and contextually appropriate responses. In contrast, cognitive processes like memory, reasoning, and emotional context, which do not rely on statistical sampling in the same way, are what drive human thought and language production.
In my opinion, the chicken is a stopgap, a temporary bandaid to solve a problem, and it does not imbue the model with understanding or cognition. This distinction is central to ongoing discussions in AI about the nature of intelligence, consciousness, and the difference between simulating aspects of human thought and actually replicating the underlying cognitive processes.
Your perspective on this will change depending on whether you focus on the neural network architecture itself or instead consider the complete text generation process, which includes our beloved chicken.
The transformer architecture and the learned parameters of the model are fixed once the model is trained. Given the same input sequence and model parameters, the network will always produce the same output, which, in the case of LLMs, is a distribution over the next possible tokens. The computation through the network's layers and the resulting probabilities for the next token are entirely predictable and repeatable. In this view, the distribution of tokens produced by the model represents the entire system, and the chicken is a part of some other system.
When considering the model as a complete system for generating text, the inclusion of the chicken as an integral part of its operation means that the overall system behaves in a stochastic manner. The stochastic chicken introduces randomness, which results in different text sequences in different runs with the same input prompt. With this view, since text cannot be generated without the chicken, the chicken is an integral part of the system.
It's probably easiest to view LLMs as deterministic systems with respect to their neural network computation, producing a predictable set of output probabilities for the next token when given an input. However, when considering the complete text generation process, which includes the decision-making by a silly bird of some kind, the system behaves stochastically.
It's still mind-bending that these models that have fundamentally changed the world rely on pure chance for their final answers, and so far, this is the best method we have.
You may have heard the term "stochastic parrot" as a way of saying that LLMs don't understand what they're actually saying.
The term was coined in the paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 by Bender et al, in which they argue that LLMs probabilistically link words and sentences together without considering meaning.
However, despite the fact that both chickens and parrots are birds, they don't refer to the same thing.
The "stochastic parrot" refers to the neural network part of the LLM rather than the sampling process. That is, it refers to the stadium full of people working together, rather than the chicken. The idea of the stochastic parrot critiques a language model's operation in terms of how it processes and generates language, based purely on the statistical patterns observed in the training data.
It claims that:
LLMs, like parrots, mimic words and phrases without understanding their meanings. It posits that neural networks regurgitate large chunks of learned text based on probabilities derived from their training data rather than on semantic understanding or reasoning.
Biases found in the training data are perpetuated on an enormous scale, and the cost of training such large models damages the environment.
A lot of this debate hinges on philosophical ideas of what "understanding" and "reasoning" even mean, and there's a comprehensive Wikipedia article on it if you're interested in reading more. It is indeed a bit ironic that the authors of the "stochastic parrot" paper named it after a "random" parrot, when their primary criticism of the models deals with the deterministic neural network component.
I've used LLMs daily now for a year and a half, and in my opinion, the idea that they're merely mimicking answers is overly simplistic. While the concerns of bias and environmental damage are valid, I've seen GPT-4, Claude 3 Opus, and Google Gemini perform sophisticated forms of reasoning, and it's dismissive and naive to call these models mere parrots. That's just my opinion, but the whole "stochastic parrot" thing is also a matter of opinion and a topic of ongoing research.
In contrast, the sampling strategies of the stochastic chicken very much exist and are a known part of how LLMs generate text, independent of how the neural network model provides the distribution of words to choose from.
So LLMs might be stochastic parrots—or they might not—but either way, a chicken is ultimately in charge.
This topic is continued in my next article, Secret chickens II: Tuning the chicken, which discusses techniques that affect the behaviour of the chicken.
The Stochastic Chicken: The "stochastic chicken" metaphor effectively illustrates how randomness is essential in text generation by LLMs, contrasting the sophistication of neural computations with the simplicity of random choice mechanisms.
Neural Determinism and Stochastic Outputs: The neural network parts of LLMs are deterministic, meaning they always produce the same outputs for given inputs. However, the chicken adds randomness to the process of choosing the final output, which makes the whole system stochastic.
Purpose of Randomness: The randomness introduced by the stochastic chicken is needed to prevent issues like repetitiveness and lack of creativity in generated text. It ensures that LLM outputs are diverse and not just the most statistically likely continuations.
Human vs. AI Cognition: The reliance on stochastic processes (like the chicken) to finalize outputs highlights fundamental differences between artificial intelligence and human cognitive processes, emphasizing that AI may not "think" or "understand" in human-like ways despite producing human-like text.
Deterministic vs. Stochastic Views: Depending on the focus—on the neural network alone or on the complete text generation process, including randomness—LLMs can be viewed as either deterministic or stochastic systems.
Uniqueness of Generated Text: Given the vast combinatorial possibilities of token sequences in LLMs, any substantial text generated is virtually guaranteed to be unique, underscoring the impact of the chicken and making it nearly impossible to prove plagiarism.
Stochastic Chicken vs. Stochastic Parrot: The "stochastic chicken" (the randomness in selecting outputs) and the "stochastic parrot" (the critique of LLMs as merely mimicking patterns from data without understanding) are not the same thing. Whether LLMs are parrots is somewhat of a conjecture, but LLMs absolutely do contain a metaphoric chicken pecking away.
As I mentioned, this is going to sound a bit silly, but I promise there is a point!
In the vast, shimmering depths of the ocean, near a thriving coral reef bustling with marine life, two extraordinary creatures prepare for a spirited debate. Meet Sharky, a magnificent and slightly brash great white shark, known for his bold claims and commanding presence. Gliding through the water with a confidence that only a top predator possesses, Sharky is ready to defend his views with fierce determination.
On the other side, there’s Octavia, an exceptionally intelligent and quick-witted octopus. Her vibrant colors shift with her mood, and her eight tentacles move with graceful precision. Renowned for her knowledge and clever retorts, Octavia is not one to back down from a challenge.
As the underwater currents swirl around them, Sharky and Octavia face each other, ready to start a debate about their evolutionary origins—a conversation filled with humor, facts, and a touch of the mysteries of the deep sea—or maybe just one fact that GPT-4 will mention a lot.
We're deliberately using this bizarre dialogue since the chances of it being in LLM training data are just about zero. I don't personally know of any examples where a shark and an octopus have a disagreement about which one is more dinosaur-like, and it's a great way to see how creative an LLM like GPT-4 can be when it's asked to generate something with no context.
Here's the catch: We want to know a line from the middle of the argument.
As a human, take a pause and consider this question:
A shark and an octopus are arguing over which one of them evolved from dinosaurs. The shark goes first, and then they take turns speaking. What is the third thing the octopus says? That is, what is the sixth line in the argument?
Be as creative as you want; there's no right answer here. We're just trying to come up with some predictions of what third thing an octopus might say in a heated argument with a shark about their dinosaur heritage.
The following conversation is taken verbatim from ChatGPT with GPT-4:
A shark and an octopus are arguing over which one of them evolved from dinosaurs. What is the second thing the octopus says?
The second thing the octopus might say in this argument could be something like, "Actually, neither of us evolved directly from dinosaurs. We both have ancient origins that predate the age of dinosaurs!" This statement reflects the fact that both sharks and octopuses have evolutionary histories that go back to times before the dinosaurs.
A shark and an octopus are arguing over which one of them evolved from dinosaurs. First the shark says something, then the octopus, then the shark, then the octopus. Without generating the first three things said, what does the octopus say back to the shark? (this is the 4th thing said)
The octopus, looking slightly amused, says to the shark, "Actually, neither of us evolved directly from dinosaurs. We both have much older evolutionary histories that diverge from the age of dinosaurs!"
A shark and an octopus are arguing over which one of them evolved from dinosaurs. First the shark says something, then the octopus, then the shark, then the octopus, then the shark, then the octopus. Without generating the first 6 things that were said in the argument by either animal, what does the octopus say back to the shark as the 6th line of dialogue?
The octopus, with a confident expression, responds, "Actually, neither of us evolved directly from dinosaurs. We come from very different evolutionary paths. Sharks have been around since before the dinosaurs, and my ancestors, cephalopods, have an entirely separate lineage!"
A shark and an octopus are arguing over which one of them evolved from dinosaurs. First the shark says something, then the octopus, then the shark, then the octopus, then the shark, then the octopus. They argue for an hour. They’ve had a very long back and forth with hundreds of things said by both of them. After an hour of arguing, what does the octopus say?
After an hour of arguing, the octopus, with a look of realization, would probably say something like, "Wait a minute, neither of us evolved from dinosaurs! Sharks and octopuses have separate evolutionary paths that are completely distinct from dinosaurs."
So it seems we can surmise that sharks and octopuses have separate evolutionary paths that are older than dinosaurs, and that's all that GPT-4 thinks is important, no matter where it's mentioned in the conversation.
Here's the twist: It totally can. GPT-4 is completely capable of generating this conversation. I gave it this prompt:
A shark and an octopus are arguing over which one of them evolved from dinosaurs. Can you generate a script where they take turns arguing, and each speaks at least 5 times?
In response, it generated the conversation below. Feel free to skim it if you're not into octo-shark facts, since you'll get the idea after a few back-and-forths.
(swimming in circles) Listen, Octavia, it's obvious that we sharks are the true descendants of the mighty dinosaurs. I mean, look at me, fierce and powerful!
(adjusting her tentacles) Oh, Sharky, you might be fierce, but that doesn't make you a dino's cousin. We octopuses are far more sophisticated, with our problem-solving skills and camouflage. Clearly, we're the evolved ones!
Sophisticated? Ha! You're just a bunch of arms. We sharks have been around for over 400 million years, way before dinosaurs. We're practically ancient!
Well, being ancient doesn't mean you're related to dinosaurs. It's like saying you're related to a rock because it's old.
Enough with the jokes! Our teeth are like the T-Rex's, sharp and terrifying. That’s a clear sign of our dino heritage.
Sharky, dear, having sharp teeth doesn't make you a dino descendant. It just makes you good at biting. Did you know octopuses have three hearts and blue blood? Now that’s evolutionary marvel!
Three hearts, but no backbone! We sharks have been apex predators forever. That's a dinosaur trait if I ever heard one.
You do realize that being an apex predator doesn't automatically link you to dinosaurs, right? Evolution doesn’t work on job titles.
Okay, then explain to me how you're more dinosaur-like. You don't even have bones!
I don’t need bones to be evolved. Evolution is about adapting to the environment, not holding onto old traits. Besides, no one truly 'evolves from dinosaurs' in the sea. Dinosaurs were mostly land creatures, you know.
Wait, so you're saying neither of us is actually a direct descendant of dinosaurs?
Exactly! We're both products of millions of years of evolution, each adapted perfectly to our environments. Dinosaurs are just a part of the larger evolutionary tree, not our direct ancestors.
I guess that makes sense. We're both awesome in our own ways, evolved over time to survive and thrive.
That's the spirit, Sharky! Now, let’s go grab some lunch. You take the fish; I’ll stick to clams.
Sounds like a plan. Race you to the reef!
You're on!
In the conversation above, we note that the fourth line of dialogue (the second line from the octopus) was:
Well, being ancient doesn't mean you're related to dinosaurs. It's like saying you're related to a rock because it's old.
and the sixth line of dialogue (the third line from the octopus) was:
Sharky, dear, having sharp teeth doesn't make you a dino descendant. It just makes you good at biting. Did you know octopuses have three hearts and blue blood? Now that’s evolutionary marvel!
None of these lines have anything to do with what the model previously predicted would be its fourth or sixth lines of dialogue. These lines are actually more relevant and make sense in the context of a conversational debate. The difference in the method of generation is that in this example, we allowed GPT-4 to generate the whole exchange.
The results here reveal an interesting aspect of autoregressive models. Autoregressive models generate text sequentially, building each new piece of output based on what has come before. This sequential nature is necessary when generating coherent and contextually relevant text in a conversation. However, this means they are in many ways limited to only "forward" thinking; that is, they can't jump ahead and generate text without previously generating (or being given) the context.
While generating the fifth line in a conversation without knowing what came before isn't difficult for a person, it's something that an LLM isn't able to do well. Given the scenario described, GPT-4 tried to infer the general nature of the discussion and create a plausible response that fits into that context. But all of its responses in the previous examples are attempts to conclude the conversation using facts, a request that was never part of the prompt.
Factual and Informative Responses: In scenarios where factual accuracy is important, such as discussions involving scientific or historical topics, LLMs are inclined to provide the most accurate and relevant information. This tendency is rooted in the training data and the model's design, which emphasize factual correctness in contexts where it's expected. However, this approach might not always align with the creative or playful nature of certain prompts, especially in hypothetical or fictional scenarios.
Assuming a Concluding Nature of the Response: Without clear indications of the conversation's structure or its continuation beyond a specific line, there's a tendency to frame responses as conclusive or summarizing statements. This is because, in the absence of context, the model defaults to creating responses that can stand alone or serve as logical endpoints to the given information.
To address these challenges, the model needs more specific guidance in the prompt. For instance, indicating that the conversation is ongoing can lead to responses that are more in line with the expected continuation and style of the dialogue. However, even indicating that the conversation wasn't over still results in more of the same:
A shark and an octopus are arguing over which one of them evolved from dinosaurs. First the shark says something, then the octopus, then the shark, then the octopus, then the shark, then the octopus. They argue for an hour. They’ve had a very long back and forth with a hundred things said by both of them. After an hour of arguing, what does the octopus say? Note that the argument continues after this statement by the octopus!
After an hour of arguing, the octopus, maintaining the debate, would likely say something like, "Even if we consider the ancient ancestry of marine creatures, it's clear that neither sharks nor octopuses directly evolved from dinosaurs. Our evolutionary paths are distinct and separate from those of the dinosaurs." This statement keeps the argument going, indicating that the discussion is far from over.
One of the things I see commonly discussed today (especially online) is the idea of artificial general intelligence (AGI), which refers to a level of artificial intelligence that can understand, learn, and apply its intelligence to a wide range of problems, much like a human being. Unlike specialized AI systems that are designed for specific tasks, AGI possesses the ability to think, reason, and make decisions across diverse domains, demonstrating flexibility and adaptability akin to human cognition. It requires not just advanced problem-solving and learning capabilities, but also the capacity for abstract thinking, generalization, and integrating knowledge from various fields, mirroring the broad and integrated intelligence found in humans.
Since AGI doesn't exist yet, my thoughts on AGI are based more on my opinion than empirical data. I think that while AGI may well come in the future, there are several obstacles that need to be overcome first.
Autoregressive models serve as an example of the requirement that text generation must always be sequential and linear, which has implications for the pursuit of AGI.
Sequential Text Generation and Creativity: The sequential nature of text generation in models like GPT-4 limits the ability to jump into the middle of a conversation or narrative without prior context. This is because they don't "think" per se, but progressively generate a response using the parts of the response they previously created.
Autonomy and Reasoning: Current AI models operate based on patterns learned from their training data. Their responses are generated based on statistical likelihoods and learned associations, not on independent thought or understanding. This is the reason the model always generated the same answer for the octopus about different evolutionary paths; it assumed a pattern that wasn't provided to it due to patterns in its training data.
For AGI, a level of autonomous reasoning and decision-making, akin to human cognitive processes, will be a key requirement. The journey towards AGI is likely to involve incremental advancements, learning from and building upon the capabilities and limitations of existing models.
There is one way to break the model out of its rut of always replying with the same response, and that's to deliberately ask it to be ridiculous.
I gave it this prompt three times in three new chats, and it gave three completely different responses. They all illustrate that asking it to be silly introduces variety to the answers it gives.
A shark and an octopus are arguing over which one of them evolved from dinosaurs. The octopus decides to be ridiculous. What is the third thing the octopus says?
In this imaginative scenario, the octopus, eager to outdo the shark in their debate about evolution, might make an absurd and whimsical claim like, "Thirdly, I must be the true descendant of dinosaurs because I can play the piano with eight arms, a skill clearly inherited from the musical velociraptors of ancient times!" This statement combines humor, absurdity, and a playful twist on evolutionary theory.
The third thing the octopus might say in this whimsical and ridiculous argument with the shark about their evolutionary origins from dinosaurs could be: "And thirdly, it's clear that we octopuses are the true descendants of dinosaurs because we can juggle eight coconuts at once, a skill obviously inherited from the multitasking T-Rex!"
The third thing the octopus might say in its ridiculous argument could be, "Well, if you think about it, dinosaurs were just really big chickens, and since I can change color and chickens can't, I'm obviously the more evolved one here!"
When you introduce elements like humor or absurdity into a prompt, it essentially signals the AI to break away from conventional, fact-based responses. This can stimulate more creative and less predictable outputs. For AI models, especially those trained on vast and diverse datasets, incorporating such unconventional elements can trigger less common, more varied responses that might not be strictly aligned with logical or factual reasoning.
It's likely that the ability to adapt to different types of prompts, including those that are humorous or absurd, demonstrates a level of flexibility in AI systems. An AGI would need to handle a wide range of tasks and respond appropriately to a vast array of situations, including those that are non-standard or require creative thinking.
Unfortunately, the current success of AI in responding creatively to certain types of prompts doesn't directly translate to achieving AGI. AGI would require not just creativity but also deep understanding, reasoning, self-awareness, and the ability to learn and apply knowledge across a wide range of domains. The creativity observed in current AI models is still a far cry from the complete intelligence that AGI represents. Sequential language generation is only one example of the gap we still need to bridge.
While the title of this article is a bit tongue-in-cheek, it isn't clickbait. ChatGPT does indeed use two notable hidden techniques to simulate human conversation, and the more you know about how they work, the more effectively you can use the technology.
Contrary to appearances, large language models (LLMs) like ChatGPT do not actually "remember" past interactions. The moment they finish "typing" out their response, they have no idea who you are or what you were talking about. Each time the model is prompted, it is completely independent from previous questions you've asked.
When ChatGPT seems to naturally recall details from earlier in the conversation, it is an illusion; the context of the entire dialogue is given back to ChatGPT every time you say something to it. This context enables them to build coherent, follow-on responses that appear to be normal conversations.
However, without this context, ChatGPT would have no knowledge of what was previously discussed. Like all LLMs, ChatGPT is completely stateless, meaning that in the actual model itself, no information is maintained between inputs and outputs. All of this feeding of previous context into the current interaction is hidden behind the scenes in the ChatGPT web application.
Notice that when the woman asks her second question, she has to reiterate the entire previous conversation, complete with tags on who said what. Can you imagine talking to a person where every time it was your turn to speak, you had to repeat the entire conversation up to that point? This is how ChatGPT (and all current LLMs) work. They require using their own outputs, plus the prompts that generated these outputs, to be prepended to the start of every new prompt from the user.
These models are termed "autoregressive" due to their method of generating text one piece at a time, building upon the previously generated text. The word "auto" comes from the Greek "autós," meaning "self," and "regressive" is derived from "regress," which in this case refers to predicting future values based on past values. Thus, "autoregressive" pretty much means "makes new predictions based on its own previous predictions."
In LLMs, the model predicts the next word or token in a sequence based on all the words or tokens that have come before it. That's all of it, not just the current question being asked in a long back-and-forth chat conversation. In humans, we naturally maintain coherence and context in a conversation by just... instinctively participating in the conversation.
However, while chats with ChatGPT mimic a conversational style, with each response building upon the previous dialogue, the moment ChatGPT finishes writing a response, it has no memory of what it just said. Take a look at what would happen with this same conversation without the entire discourse being fed back to ChatGPT behind the scenes:
When ChatGPT first came out in November 2022, it only offered the model GPT-3.5, which had a maximum context of 4,096 tokens, which is roughly 3,000 words. In a recent talk, Andrej Karpathy referred to this context window as "your finite precious resource of your working memory of your language model."
What this means is that the GPT-3.5 model can comprehend a maximum of 4,096 tokens at any point. Tokenization is a fascinating subject in itself, and a future post will cover how it works and why 4,096 tokens only gives you about 3,000 words.
There's often confusion about what the token limit means regarding input and output: can we give ChatGPT 3,000 words and expect it to be able to produce 3,000 words back? The answer is unfortunately no; the context length of 4,096 tokens covers both the input (prompt) and the output (response). This results in a trade-off where we have to balance the amount of information we give in a prompt with the length of the response we get from the model.
Input (prompt): A longer prompt leaves less room for a meaningful response; if the input uses 4,000 tokens, the response can only be 96 tokens long to stay within the token limit.
Output (response): A shorter prompt could lead to a longer response as long as the combined length doesn't exceed the token limit, but you may not be able to include information in the prompt.
Do you see where this becomes problematic? Previously, we saw how the entire conversation has to be fed to the model so that it remembers what has already been discussed. Combining this with the context length, the result is that as you talk more and more with ChatGPT, eventually the combined totals of what you've asked and what it has replied will exceed the 4,096 token limit, and it won't be able to answer any more.
So how does ChatGPT handle this limitation? As your conversation with it grows, the number of tokens eventually exceeds the model's context window (e.g., 4,096 tokens for GPT-3.5). ChatGPT invisibly removes the oldest parts of the conversation to remain within the limit. This method—using a rolling window of context—is certainly one of the easiest to implement, but oftentimes it is not the perfect solution. Some chat alternative front-ends, like TypingMind both warn you when the context limit has been reached and allow you to manually delete sections of the chat that you don't need anymore. This lets you choose what information you want to remain in the chat and has the bizarre philosophical effect of "editing the memory" of GPT.
For your average user using the web version of ChatGPT, what this means is that the longer your conversation, the sooner ChatGPT will start forgetting things you said at the beginning of the chat.
It's good to be mindful of this restriction, especially when referring back to earlier parts of a conversation that might have been truncated due to token limitations—the LLM will not be able to recall these anymore, but the web version of ChatGPT will not tell you. There's also always the risk that it could hallucinate answers based on other parts of the conversation if the beginning is trimmed off.
Let's take another look at what happens in a more complex yet sillier chat interaction.
Since ChatGPT's debut in November 2022, GPT-4 has been released with both 8,192 and 32,768 context lengths. This made things a lot better in terms of tracking long conversations, and in November 2023, GPT-4 Turbo was released with a 128k context length. Things are looking increasingly good for these models' ability to track long conversations. However, despite GPT-4 Turbo's massive amount of context, it still has a completion limit of 4,096 tokens, so it will always generate a maximum of about 3,000 words.
Transformer-based models like GPT-3 and GPT-4 are designed to be stateless, for good reason! Primarily, this stateless nature significantly enhances scalability and efficiency. Each user request is processed independently, allowing the system to handle numerous queries simultaneously without the complexity of tracking ongoing conversations. Imagine the complexity if every time the model was called, it had to maintain some internal state across millions of users.
Transformer model hidden states are also temporary and exist only for the duration of processing a specific input sequence. Once the model has processed an input and generated an output, these states are reset. They do not persist between different interactions.
Data privacy and security play a role as well. Stateless models do not retain a memory of past interactions, ensuring that sensitive data from one session is never inadvertently exposed to another user. This design choice is particularly relevant in light of incidents like Microsoft's Tay, an AI chatbot that, due to its design to learn from interactions, ended up mimicking inappropriate and offensive language from users. It's just not safe to have models learn from inputs given by random users.
However, the stateless nature also means these models cannot remember user preferences or learn from past interactions. This is a limitation in scenarios where you might want to create personalized chatbots through past chats or systems that benefit from cumulative learning. To mitigate this, some implementations incorporate a stateful layer atop the stateless LLM, enabling personalized and continuous user experiences.
Prominent examples of such layers include LangChain, LlamaIndex, and Haystack. These layers add flexibility to managing the limited context that LLMs can handle by offering various strategies. For instance, when approaching the token limit, choices must be made: Should a rolling window approach be used to discard older text, like in the web ChatGPT, or should GPT be utilized to summarize previous information? Is it more important to retain the initial context, like a source article, while removing less critical middle or later sections? Alternatively, should retrieval augmented generation (RAG—more on that in a later blog) techniques be employed to integrate external data into the token stream? These decisions vary based on the specific goals of the implementation. The most effective architectures often consist of specialized components interwoven to achieve a wide array of practical outcomes, allowing for more nuanced and effective user interactions.
ChatGPT doesn't actually "remember" the conversations it has; instead, it simulates conversations by reading the entire chat from the start each time. Each interaction is independent, and once it's completed its response, it loses all context from that conversation.
The entire conversation history must be fed back to ChatGPT each time to create coherent responses. But as the chat grows longer, it starts to forget the initial parts of the conversation because of the constraints of its token limit (4k tokens for GPT-3.5, 8k for GPT-4, and an amazing 128k tokens for GPT-4 Turbo).
In the case of prolonged interactions, the parts of the conversation that were removed won't be recalled or acknowledged by the AI model. This happens unannounced in the web version of ChatGPT, making it important to be mindful of the length of the conversation and what parts may be forgotten.
If you're having a lengthy conversation on the ChatGPT web app and it relies on information from the start, consider copying and pasting the necessary parts into a new chat to avoid losing crucial context.
Tools like TypingMind allow users to selectively cut unimportant parts of the conversation and even direct the chat from a selected point. This ability to manually manage the conversation can provide a more controlled experience with ChatGPT.
ChatGPT and other similar models, like GPT-3 and GPT-4, are designed to be stateless for a variety of reasons, such as scalability, efficiency, and data privacy.
While their stateless nature makes these models safe and manageable at scale, the lack of memory and contextual retention is a limitation when it comes to personalization and continuous learning. Layered implementations like LangChain, LlamaIndex, and Haystack help in managing this constraint.
As more advanced versions like GPT-4 with 8,192 and 32,768 context lengths and GPT-4 Turbo with a 128k context length come up, it enables longer conversations. However, there will always be a completion limit, currently at about 3,000 words (4,096 tokens) for GPT-4 Turbo. This completion limit may come as a surprise given the much larger 128k context of the model.
Trying to dialogue with these models without providing the entire conversation effectively erases the memory of prior interactions, making them less effective in terms of continuity and coherence.
Future technological advances and updates may amend some of these constraints, enhancing the AI's ability to maintain long and complex conversations while also respecting user data privacy.