The "best" LLM: Musical Chairs

jeredb
Aug 8, 2024
4 min read

The world of Large Language Models is evolving at a breakneck pace. It seems like every few days, a new model is released that's supposedly better than the last. This rapid progress is happening across the board, from frontier models like *ChatGPT4o*, *Claude 3.5 Sonnet*, and *Gemini 1.5 Pro* to open-source models like *Llama3.1 405b* and *Mistral Large*.

For example, the largest most capable open-source model *Llama 3.1 405B* was recently released, and the very next day, (the semi open-source) *Mistral Large* came out packing similar benchmarks.

This constant stream of "groundbreaking" advancements is overwhelming, even for AI enthusiasts… And for the average person who uses AI occasionally, it becomes a blur of identical announcements: "new model better." But rightfully so, this pace of advancement in Artificial Intelligence is unlike anything we've seen in other technological fields.

Model Comparisons: A Moving Target

Every model has its strengths and weaknesses, and each new release attempts to outperform its predecessors and competitors in specific ways. Let's look at some recent examples:

ChatGPT 4o:

Released in May with much fanfare. It boasted improved reasoning capabilities, better benchmarks, and a new "advanced voice chat" feature. It also came with a more attractive price point. This became the top of the line, best frontier model.

Drawbacks:

- Most of the features announced haven't yet made it to the general public.

- I found that it seems to have a shorter context window compared to GPT-4.

Claude 3.5 Sonnet:

One month later, in June, Claude 3.5 Sonnet was released, easily overtaking the recent champion ChatGPT4o. Since its release, I've personally found myself using ChatGPT much less, like barely ever.

What made Claude 3.5 Sonnet better?

- Improvements in output quality, reasoning, and coding capabilities.

- In terms of benchmarks, it appears to be on par with or better than GPT-4o.

They also introduced a new feature called *Artifacts*, which I love. This feature essentially splits your screen, allowing you to converse with the AI on the left while viewing its output (like code or document analysis) on the right.

However, Claude has its limitations too:

- There's a cap on the number of messages per day (or every few hours)

- Unlike GPT-4o, it doesn't have a voice mode

- It can't search the internet or run code directly

Feature Replication and Market Dynamics

Besides the head-to-head race in upgrading benchmarks, expanding context windows, and increasing speed, we're seeing a familiar pattern where companies quickly replicate each other's innovations. For example, ChatGPT released custom GPTs, which allowed users to create specialized versions without actual fine-tuning (awesome), Claude followed a few months later with a similar feature called Projects (awesome+).

While there are differences in implementation (such as how they use knowledge bases - RAG vs context stuffing), the core functionality is remarkably similar.

With companies offering comparable models and replicating each other's features, how does this affect us as users? Whether we're paying customers or using free versions, this competition provides us with a constantly improving array of AI tools to choose from.

The Ease of Switching and Lack of Loyalty

It's crazy how quickly our loyalty to these models can shift. This is a stark contrast to what we've seen in other tech sectors. For example, iPhone vs Android, Mac vs PC, and even gaming consoles like Xbox vs PlayStation inspire (or require) strong loyalty.

With those other platforms, switching is not so easy. It often involves purchasing expensive hardware and accessories, learning a new interface, and transferring all your data. As a result of that, people often became entrenched in their chosen ecosystem, leading to almost cult-like devotion to their preferred platform.

But when it comes to AI models, it's a completely different story. The barrier to switching is incredibly low. Most offer similarly priced subscription plans or API usage-based pricing. Switching often requires nothing more than cancelling a subscription or changing an API key. This ease of transition, combined with the rapid pace of improvement, leads to a pendulum-like effect when it comes to user preference and market share. We're quick to drop the previous LLM for the latest LLM with better capabilities.

The Race to AGI vs the attempt to capture Market Share

The relentless pace of advancement is driven by two main factors:

1. The race to achieve Artificial General Intelligence (AGI)

2. The competition to capture market share

While the race to AGI (which we will talk about in another post), is more of a long-term goal, achieving market share is both a goal for the short term and the long term (more users, more money, more innovation). Therefore, companies push hard to make headlines with each new advancement, knowing that user loyalty is tenuous at best.

Looking Ahead: The Future of AI Competition

As the AI landscape continues to evolve, I believe we will see companies develop unique, hard-to-replicate features or create ecosystem lock-in (similar to Apple) to combat the loyalty problem. Personally, I like the current competitive landscape and how it drives innovation. But I doubt it will stay like this forever.

Given that Claude 3.5 is currently considered the best frontier model, and with recent releases like Llama 3.1 and Gemini (not to mention the financial pressures at OpenAI), it seems very likely that we'll see the elusive GPT-5 soon. Then again, as we've seen with previously announced but still missing features like "Sora", "Advanced voice chat", and "SearchGPT", timelines in AI development (like all software development) can be unpredictable.

What's certain is that every few weeks, new models are released that not only surpass previous benchmarks but also add new features. The back-and-forth nature of these advancements, with companies leapfrogging each other, keeps the field exciting and rapidly evolving.

In the next part of this post, we'll delve into how models are actually compared and evaluated, so we can understand how "the best model" is determined. We'll explore both the benchmarks and community-based approaches, like the Chatbot Arena by LMSys. This will give us a clearer picture of how the industry measures progress in this fast-paced field.

Model Comparisons: A Moving Target

Feature Replication and Market Dynamics

The Ease of Switching and Lack of Loyalty

The Race to AGI vs the attempt to capture Market Share

Looking Ahead: The Future of AI Competition

Comments