From Bigram to Infinigram

I have been considering a lot more how n-grams are amazing tools to show some of the fundamentals of language models. With the new paper of infini-gram I think they are becoming more relevant again. So I decided to write an explanation on them.

Statistical Language Models can be represented as the probability of the last n-tokens times the current token 1. More formally:

$$P(w_1, w_2, \ldots, w_n) = P(w_1) \times \prod_{i=2}^n P(w_i | w_1, \ldots, w_{i-1})$$

Bigrams

Bigrams are the special case of an n-gram that is constrained to be based on only the previous token to predict the next one.

To train a bigram, you first generate a vocabulary of unique words from the corpus that you are using and then count all of the pairs of words that happen in it. Finally, you estimate the bigram probabilities by dividing the count of each bigram by the count of the first word in it.

$$P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i)}{count(w_{i-1})}$$

In real life, to avoid divisions by zero you can treat probabilities as log probabilities.

As you can see, there are a lot of zeros in the matrix. And as you scale up n-grams to higher dimensions like pentagrams2 you are going to make really sparse matrices that take incredible amounts of space. Let’s not forget that the curse of dimensionality in language bites really hard and pretty fast for these statistical models.

Infinigram

A way to avoid this problem if you don’t have enough data to find the specific n-gram is by backing off, going n − 1 until you find your probability for the next token. So infinigram solves this problem by always assuming the biggest possible n and then backing off to find the biggest possible n-gram in your sentence.

For example:

the quick brown fox jumps over the lazy -> NOT FOUND.
quick brown fox jumps over the lazy -> NOT FOUND.
brown fox jumps over the lazy -> NOT FOUND.
fox jumps over the lazy -> NOT FOUND.
over the lazy -> dog.

I really like the ngram approach to language because intuitively it feels like you are talking to a collective. The n-grams are just snippets of sentences in the training data that try to predict the next token. And even in their current state (not winning against LLMs), there are still some good reasons to use them like for query searches and factual checking.

In the paper, they never evaluate infinigram alone, instead they use a mixture of a Large Language Model (LLM) with infinigram to enhance the outputs of LLMs and get reductions of perplexity of up to 40% on tiny models like GPT-2 and ~20% on things like Llama-2.

Although in the paper they quite clearly mention that infinigram is not yet ready to generate open ended text, they have a public api here, so let’s try it out.

Infinigram generation

I created a quick and dirty python script that calls the infinigram api and can be found at my mono repo for teMporary ideas m.


      while True:
          payload = {
              'corpus': 'v4_dolma-v1_6_llama',
              'query_type': 'ntd',
              'query': state,
          }
          result = requests.post('https://api.infini-gram.io/', json=payload).json()
          if not result['result_by_token_id']:
              state = ' '.join(state.split()[1:])
              print("Current state:", state)
              continue
          token_id = next(iter(result['result_by_token_id']))
          token_info = result['result_by_token_id'][token_id]
          token = token_info['token']
          
          # append the new token
          # ... there is more code here but abbreviated to make it look nicer.
          state += " " + token
          final_output += " " + token
          

          if len(final_output.split()) >= 100:
              break
   if '<0x0A>' in state or len(state) == 0:
       break
   print(final_output, end='\r')
    

The results fill like a blast from the past for language models or like a language model that has not been aligned yet. After experimenting a little with the model, here are some of the prompts with its responses.

PROMPT: “The circumference of the sun”

RESPONSE:

"in miles is 2,730,000. An odd coincidence as the mile is not a standard based on any calculation other than itself. Perhaps the unit of length called the foot did not originate from a part of anatomy of an ancient king or pharaoh but from circumference of our planet?"

PROMPT: “Do you dream?”

RESPONSE:

"Consider what it might mean to be a literalist of the imagination, and where it might lead you—to what places?"

PROMPT: “the best movie of all time is”

RESPONSE:

"Schindler’s list. The 1993 winner of the Academy Award for Best Picture stars Liam Nelson as Schindler. It is regarded as the greatest holocaust movie of all time and is not suitable for viewers under the ages of 18. This is because of its scenes of sexual content."

Conclusion

Even if these models are not the best at generating text, I genuinely believe that old ideas are making a comeback but now with a much bigger scale3. Maybe I’ll create some perplexity measurements for infinigram. For now, I have some nostalgia glasses for this LM generating some cute text.

Footnotes

Back to articles