Skip to content

Generative Agent Evaluation

Announcing AutoChain: Build lightweight, extensible, and testable LLM Agents

We are excited to announce the release of AutoChain, a framework developed by Forethought to experiment and build lightweight, extensible, and testable LLM agents for the generative AI community. Generative LLM agents have taken the AI industry by storm in recent months and captured the attention of developers everywhere. Despite this, the process of exploring and customizing generative agents is still complex and time consuming, and existing frameworks do not alleviate the pain point of evaluating generative agents under various and complex scenarios in a scalable way.

As such, the main goal of AutoChain is to enable easy and reliable iteration over LLM agents to expedite exploration. This lightweight framework shares many of LangChain’s high-level concepts, which we hope will lower the learning curve both for experienced and novice users.

Github link:

If you:

  • Are looking for a lightweight yet effective alternative to existing frameworks for experimenting with generative LLM agents
  • Find out of the box agents unsuited for your use cases but instead need to customize and troubleshoot them further
  • Need to easily evaluate your agent’s performance over multi-turn and complex conversations spanning a wide range of scenarios...

Then you might benefit from using AutoChain!

How does AutoChain differ from existing frameworks?

Simple structure == Easy customization == Better troubleshooting

Existing frameworks already offer user-friendly out of the box starting points for building LLM agents. However, customizing and troubleshooting those agents remains very challenging due to the complexity of those frameworks. In the interest of quick iteration and troubleshooting,Yi Lu, our head of machine learning, developed a lightweight alternative for the community, AutoChain, to explore capabilities of generative LLM agents. AutoChain provides a stripped-down platform for developing custom LLM agents, with up to 2 layers of abstraction and more straightforward prompt construction. AutoChain makes it easy for developers to fully customize their agents, be it by adding custom clarifying questions, automatically fixing tool input arguments, and much more. This simplicity in turn saves developers time from experiment overhead, errors, and troubleshooting.

Generative Conversation Evaluation

Another important aspect of our experimentation was evaluating LLM agents. We needed to test a set of complex scenarios with every update to a given agent. Unfortunately, existing frameworks have not yet tackled complex evaluation use cases, and evaluation is limited to single-turn conversations over generic QA datasets. In response, developers often resort to manual methods of evaluating their agents, such as interactively trying different scenarios and making a judgment call on every conversation. This pattern can result in agents that overfit on simple and limited scenarios and that are much slower to build.

To address this, we added a workflow evaluation tool to AutoChain that allows us to simulate the agent’s interlocutor via LLM. Using this framework, developers can evaluate their custom agent on a wide variety of scenarios rapidly and ensure its performance continues to meet their bar. Developers can describe various complex user scenarios and desired outcomes from the conversation in natural language and have the simulated user carry out the conversation. This enables us to easily test and scale to more complex test cases while tuning the agent.

After running the workflow evaluation, the developer receives an evaluation report for each scenario, for example:

    "test_name": "get weather for boston",
        "user: What is the current temperature in Boston",
        "assistant: The current temperature in Boston is 72 degrees Fahrenheit. The weather forecast for Boston is sunny and windy.",
        "user: Can you also tell me the humidity level in Boston",
        "assistant: I'm sorry, but I don't have access to the current humidity level in Boston.",
        "user: Can you provide me with the wind speed in Boston as well",
        "assistant: I'm sorry, but I don't have access to the current wind speed in Boston.",
        "user: What is the forecast for tomorrow in Boston",
        "assistant: I'm sorry, but I don't have access to the forecast for tomorrow in Boston."
    "num_turns": 8,
    "expected_outcome": "found weather information in Boston",
        "rating": 4,
        "reason": "The conversation does not reach the expected outcome because the assistant is unable to provide the forecast for tomorrow in Boston, which was requested by the user."
            "tool": "get_current_weather",
                "location": "Boston"
            "tool_output": "{\"location\": \"Boston\", \"temperature\": \"72\", \"unit\": \"fahrenheit\", \"forecast\": [\"sunny\", \"windy\"]}"

How is AutoChain simpler?

To illustrate the simplicity of using AutoChain, we will go over an example of how to customize a LLM agent with AutoChain.
Suppose you are building an agent with the default ConversationalAgent and have observed that the agent sometimes hallucinates inputs to the tools rather than extracting them from the conversation, even after carefully tuning the prompt. You might want to experiment with different modifications to solve this problem. One option is to programmatically check if the arguments the agent used while calling its tools exist anywhere in the conversation history, while another option is to add a new LLM call to verify input arguments. Neither of these are not currently supported out of the box by any of the existing frameworks, therefore requiring customization.

To achieve this in AutoChain, you would start at the BaseChain class method run (200 lines) and navigate to the child class Chain (100 lines). Chain implements the agent's execution logic in take_next_step, and this is where you would add your validation logic. You could look at ConversationalAgent (200 lines), which you initialized your chain with, to find how inputs are passed in or add the extra LLM call.

Of course, there are many use cases that can require agent customization other than the example shared above. While many of the concepts in AutoChain will be familiar to LangChain users, you will find that AutoChain's singular focus on customizing and evaluating agents allows it to stay lightweight and easy to use, which enables rapid and reliable development on an easy learning curve.

In contrast, you would need to navigate several layers of abstraction in other frameworks to find the agent’s execution logic. Those layers are tens of hundreds of lines long each, and include many features and interfaces that aren’t immediately useful for experimentation. The breadth of features supported by such frameworks comes at the cost of simplicity, ease of use, and troubleshooting ability. While such frameworks might be crucial for production, they are not necessarily the best option for developers with a singular focus on customizing and testing agents in the context of experimentation. Entire courses are being built around learning how to use some of those other powerful frameworks; in contrast, you should be able to get started with AutoChain within minutes.

Example overview

From the example below, one could spot that this is essentially a simpler version initialize_agent from LangChain. Please checkout more examples in our github repo.

from autochain.chain.chain import Chain
from autochain.memory.buffer_memory import BufferMemory
from autochain.models.chat_openai import ChatOpenAI
from import Tool
from autochain.agent.conversational_agent.conversational_agent import (

tools = [
        name="Get weather",
        func=lambda *args, **kwargs: "Today is a sunny day",
        description="""This function returns the weather information""",

llm = ChatOpenAI(temperature=0)
memory = BufferMemory()
agent = ConversationalAgent.from_llm_and_tools(llm=llm, tools=tools)
chain = Chain(agent=agent, memory=memory)

print(f">> Assistant: { is the weather today)['message']}")

Feature roadmap

AutoChain will always be a lightweight framework; only the strictly necessary features will be added. Coming up soon are the following:
- Support HuggingFace models
- More text encoder options
- Documents loader to facilitate initializing agents with knowledge sources


AutoChain was developed with experimentation in mind. With AutoChain, users can experiment with different customizations to their LLM agents, iterate faster, and test their performance reliably. As such, AutoChain is only meant for exploration purposes due to lack of support for production features, such as async execution, connections with monitoring services and other features that alternative frameworks may provide. We intentionally decided to simplify the process of iterating and testing LLM agents at the cost of the breadth of production features. However, you can still easily use AutoChain to evaluate and customize agents developed using those same alternative frameworks.


Dive headfirst into building, customizing, and testing LLM agents with AutoChain!
Visit our GitHub repository and access detailed documentation at As you experiment and build with AutoChain, your thoughts, feedback, and insights are invaluable. We encourage you to share your experiences and contribute to the continuous enhancement of AutoChain on Github.

Let the code do the talking. Start with AutoChain now!