Paper Review: Compression Represents Intelligence Linearly

"There is a belief that learning to compress well will lead to intelligence. Recently, language modeling has been shown to be equivalent to compression, which offers a compelling rationale for the success of large language models (LLMs): the development of more advanced language models is essentially enhancing compression which facilitates intelligence. Despite such appealing discussions, little empirical evidence is present for the interplay between compression and intelligence. In this work, we examine their relationship in the context of LLMs, treating LLMs as data compressors."
Source

This paper tries to correlate two metrics:

  1. Performance on "average benchmark scores" as "Intelligence;"
  2. Ability to compress language.

The first metric is self explanatory. The second is much more difficult to understand. How do we measure a LLM's ability to compress language?

  • The authors propose using the bit's per character and propose the formula:
"Intuitively, we seek to utilize pmodel(x) to encode rare sequences with more bits and the frequent ones with fewer bits."

Due to the variety of tokenizers employed by different LLMs, the average bits per token are not directly comparable. Therefore, we utilize the average bits per character (BPC) as the metric.

But the question remains: how do we know the "bits" per character for a LLM? It's not like this a readily available by examining the generated text. What are the authors talking about.

We need a bit more context here. Let's look at an earlier paper that developed this idea more:

Language Modeling Is Compression

from Google Deepmind and Meta AI

The core idea behind this paper is to combine language models with a statistical compression algorithm.

LLMZip: Lossless Text Compression using Large Language Models

at ARXIV

This paper gives a better description:

  • We start with a piece of text like
  • My first attempt at writing a book
  • We use the first 5 words (or tokens) to try to predict the next:
  • The correct term is at 0, we store that value = 0
  • Then we move forward one word:
  • The correct term is at 0, we store that value = 0

After we repeat this process over and over we will have a string of indexes:[1, 0, 0, 0, 2, 5, 1, etc]as a simple example. Those sequences are fed into a compression algorithm:

What does this mean?

  • A perfect LLM will always predict the next word with perfect accuracy
  • The indexes will always be 0
  • [0, 0, 0, 0, 0, 0, .... ]
  • This array of indexes can be compressed maximally giving a the smallest compression.

Let's say that we have 10 words, all predicted perfectly:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]We can compress that by saying
10 x [0]Giving us an incredibly small compression. Bits < Words

On the other hand

  • A terrible LLM will do a terrible job predicting [1001, 3425, 83847, 923838, etc]
  • These values will be incredibly hard to compress without loss
  • The bits will not be much less than the original string itself. Bits = Words

Who cares about any of this?

  • There is an old idea in AI that compression is intelligence, or, at least, highly correlates with intelligence.
  • Claude Shannon introduced Information Theory and published much about compressing the English language efficiently.
  • Kolmogorov Complexity. In the 1960s, Andrei Kolmogorov and Gregory Chaitin developed concepts around the complexity of a string defined as the length of the shortest possible description of that string within a fixed computational model. This idea suggests that the more you can compress a dataset (i.e., the shorter the description), the more you understand the underlying patterns and structure of the data.

Consider the following two strings of 32 lowercase letters and digits:

abababababababababababababababab, and[
1001, 3425, 83847, 923838, etc]
function GenerateString2()    return "4c1j5b2p0cv4w1x8rx2y39umgw5q85s7"
function GenerateString1()
   return
"ab" × 16


Why this may or may not be a reasoned argument.

  • As a thought experiment, let's assume that we have a giant database, of every sentence every written.
  • Compression of any sentence now becomes a single digit number


There is no intelligence, only memorization. But awesome compression.

Read More

The Cost of Shortcuts: Lessons From a $4.9 Million Mistake

4/21/25

When the Department of Justice announces settlements, many of us glance at the headlines and move on. Yet, behind those headlines are real stories about real decisions...

Read more

One Biller, One Gap: How a Missing Piece Reshapes Everything

4/14/25

There’s a quiet agreement most of us make in business. It’s not in a contract. It’s not written on a whiteboard. But it runs everything: trust...

Read more

The System Is Rigged: How AI Helps Independent Docs Fight Back

4/10/25

Feeling like you’re drowning in regulations designed by giants, for giants? If you're running a small practice in today's healthcare hellscape, it damn sure feels that way...

Read more

Trust Is the Real Technology: A Lesson in Healthcare Partnerships

4/7/25

When people ask me what Intelligence Factory does, they often expect to hear about AI, automation, or billing systems. And while we do all those things...

Read more

Million Dollar Surprise

4/3/25

“They’re going to put me out of business. They want over a million dollars. I don’t have a million dollars”, his voice cracked over the phone...

Read more

Unlocking AI: A Practical Guide for IT Companies Ready to Make the Leap

12/22/24

Introduction: The AI Revolution is Here—Are You Ready?

Artificial intelligence isn’t just a buzzword anymore—it’s a transformative force reshaping industries worldwide. Yet for many IT companies, the question isn’t whether to adopt AI but how...

Read more

Agentic RAG: Separating Hype from Reality

12/18/24

Agentic AI is rapidly gaining traction as a transformative technology with the potential to revolutionize how we interact with and utilize artificial intelligence. Unlike traditional AI systems that passively respond to...

Read more

From Black Boxes to Clarity: Buffaly's Transparent AI Framework

11/27/24

Large Language Models (LLMs) have ushered in a new era of artificial intelligence, enabling systems to generate human-like text and engage in complex conversations...

Read more

Bridging the Gap Between Language and Action: How Buffaly is Revolutionizing AI

11/26/24

The rapid advancement of Large Language Models (LLMs) has brought remarkable progress in natural language processing, empowering AI systems to understand and generate text with unprecedented fluency. Yet, these systems face...

Read more

When Retrieval Augmented Generation (RAG) Fails

11/25/24

Retrieval Augmented Generation (RAG) sounds like a dream come true for anyone working with AI language models. The idea is simple: enhance models like ChatGPT with external data so...

Read more

SemDB: Solving the Challenges of Graph RAG

11/21/24

In the beginning there was keyword search. Eventually word embeddings came along and we got Vector Databases and Retrieval Augmented...

Read more

Metagraphs and Hypergraphs with ProtoScript and Buffaly

11/20/24

In Volodymyr Pavlyshyn's article, the concepts of Metagraphs and Hypergraphs are explored as a transformative framework for developing relational models in AI agents’ memory systems...

Read more

Chunking Strategies for Retrieval-Augmented Generation (RAG): A Deep Dive into SemDB’s Approach

11/19/24

In the ever-evolving landscape of AI and natural language processing, Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technology...

Read more

Is Your AI a Toy or a Tool? Here’s How to Tell (And Why It Matters)

11/7/24

As artificial intelligence (AI) becomes a powerful part of our daily lives, it’s amazing to see how many directions the technology is taking. From creative tools to customer service automation...

Read more

Stop Going Solo: Why Tech Founders Need a Business-Savvy Co-Founder (And How to Find Yours)

10/24/24

Hey everyone, Justin Brochetti here, Co-founder of Intelligence Factory. We're all about building cutting-edge AI solutions, but I'm not here to talk about that today. Instead, I want to share...

Read more

Why OGAR is the Future of AI-Driven Data Retrieval

9/26/24

When it comes to data retrieval, most organizations today are exploring AI-driven solutions like Retrieval-Augmented Generation (RAG) paired with Large Language Models (LLM)...

Read more

The AI Mirage: How Broken Systems Are Undermining the Future of Business Innovation

9/18/24

Artificial Intelligence. Just say the words, and you can almost hear the hum of futuristic possibilities—robots making decisions, algorithms mastering productivity, and businesses leaping toward unparalleled efficiency...

Read more

A Sales Manager’s Perspective on AI: Boosting Efficiency and Saving Time

8/14/24

As a Sales Manager, my mission is to drive revenue, nurture customer relationships, and ensure my team reaches their goals. AI has emerged as a powerful ally in this mission...

Read more

Prioritizing Patients for Clinical Monitoring Through Exploration

7/1/24

RPM (Remote Patient Monitoring) CPT codes are a way for healthcare providers to get reimbursed for monitoring patients' health remotely using digital devices...

Read more

10X Your Outbound Sales Productivity with Intelligence Factory's AI for Twilio: A VP of Sales Perspective

6/28/24

As VP of Sales, I'm constantly on the lookout for ways to empower my team and maximize their productivity. In today's competitive B2B landscape, every interaction counts...

Read more

Practical Application of AI in Business

6/24/24

In the rapidly evolving tech landscape, the excitement around AI is palpable. But beyond the hype, practical application is where true value lies...

Read more

AI: What the Heck is Going On?

6/19/24

We all grew up with movies of AI and it always seemed to be decades off. Then ChatGPT was announced and suddenly it's everywhere...

Read more

SQL for JSON

4/22/24

Everything old is new again. A few years back, the world was on fire with key-value storage systems. I think it was Google's introduction of MapReduce that set the fire...

Read more

Telemedicine App Ends Gender Preference Issues with AWS Powered AI

4/19/24

AWS machine learning enhances MEDEK telemedicine solution to ease gender bias for sensitive online doctor visits...

Read more