⚗️ Learning from Agents

LLMs round 2 - lessons learned from working with agents

  1. Less is more
  2. Context is everything
  3. Every token matters
  4. Over-Anthropomorphization
  5. Best practices still apply
  6. Hoping for better models

It’s been some time since my initial impression post on LLMs, and since I’ve recently had an opportunity to gather my thoughts, I wanted to also summarize them here so that I can come back and see how they changed over time.

The following are rough notes of lessons learned from working with LLMs and the Agent concept.

Less is more

  • Usually to improve quality, we take away, rather than adding more
  • This means distill your prompts and systems as much as possible

As you distill your ideas, they naturally improve, because when you drop the merely good parts, the great parts can shine more brightly.

  • It’s easy to overdo it with premature planning, fine-tuning, and RAG
  • Until there is proven value from such efforts, better to keep it simple

Context is everything

  • Agents are tool-driven. Without tools, they cannot do anything and even if you had a super advanced orchestration loop, it wouldn’t help.
  • High quality tools are the bread and butter of effective agents
  • Emulate a real chat. All context interactions have to account for the fact that models are trained on chat and work best with realistic chat messages.
  • As an example, some tools might produce no output, which is an open invitation to hallucinate. Handle special cases so that they produce a coherent chat history.

Every token matters

  • Some time ago, we expected that as models got better, they would be less sensitive to minor token variations in prompts.
  • It doesn’t seem like this problem has solved itself yet. We still encounter cases where single tokens make or break a scenario. As an example, the difference between singular and plural Contact -> Contacts is enough to produce the wrong table name in tests.
  • Reduce the noise - Any token can potentially throw off the LLM
    • As before, you will want to triple check every single token that makes it into your context, and only include those that add value.
    • Make sure your prompts are as distilled and concise as possible. Adopt a zero-tolerance stance towards grammar mistakes and typos.
  • LLMs increasingly depend on free-text data like names and descriptions. Ensure this data is of high semantic quality, and not too complex.
  • Rule of thumb: If it doesn’t make sense to you, it won’t work for the LLM.
  • This may also apply to localizations!
  • Allow the LLM to focus on what it’s good at
    • If possible define data at design time rather than relying on LLM to re-generate Guids, etc.

Over-Anthropomorphization

  • The language used to describe LLM-backed products is too close to human semantics
  • This leads to expectations that go beyond merely predicting the next token
  • We tend to use human verbs like think, know.
  • Agent is a bad term that inspires human analogies. An alternative is to consistently use bot or LLM-bot instead as this more clearly communicates expectations to users.
  • There is no brain / memory / thought / knowledge.
  • Simply put, if critical data is missing from the context, then it is very unlikely that the LLM will predict the correct tokens.
  • The Intelligence Paradox - by Jurgen Gravestein (substack.com)

Best practices still apply

  • This is still just software. All previous best practices apply.
  • Most importantly: Write tests / eval / benchmark.
  • LLM features without tests are like black boxes (same as any software really)
  • You cannot improve what you cannot measure
  • I recommend to do as always:
    • Iterate with unit tests, incrementally improve with feedback
    • Some tests are better than none
    • Be smart about what to test. Find test cases that provide relatively stable expectations, but represent a distinct behavior.

Hoping for better models

  • Or: The one model to rule them all
  • Some time ago a belief emerged that future models would solve prevalent issues with LLMs
  • Holistic Evaluation of Language Models (HELM) (stanford.edu)
  • While GPT-4o does seem to be the best model on average, it is not significantly outperforming GPT-4 (0613), which at this point is over a year old.
  • Think back one year and what expectations we had back then for where we would be by now. Given how quickly GPT 3.5 and 4 dropped, people were led to believe that we would have some truly amazing next version by mid-2024.
  • New models are different, not just better
    • While we did get many new models (gpt 4 turbo and now 4o), it’s not entirely clear whether they are improvements across the board
    • Ultimately there hasn’t been any groundbreaking improvement in LLM output quality since GPT 4.
  • Eventually GPT-5. But at what cost?

Those are my current thoughts. Excited to see how they will have changed in 6 months - 1 year’s time!


© 2024. All rights reserved.