azureopenai-apilangchain

Langchain Pandas agent not following instructions


I'm working with a langchain pandas agent using GPT-4 from Azure OpenAI as the LLM. I'm working with a DataFrame that contains enterprise data from our employees, and the main objective is to retrieve information from our employees using the agent. Currently, we are having two main issues:

  1. When asked to find information from a partial substring, the agent never retrieves the information. It either looks for an exact match or looks for a substring that contains the input.

  2. When asked to retrieve the employee with the highest or lowest weekly hours, it does not check for ties even when instructed to look for ties. It just doesn't work.

This is the code that I'm using, along with the prefix and suffix that I'm using.

Prefix:

prefix='You are a pandas agent. You must work with the DataFrame df containing information about the company's employees.

Your answer must only include information retrieved from df, and you must not create mockup or sample data. You will be penalized if you do.

The user may ask you questions using a substring of the names of our employees.

Follow these useful instructions when retrieving information regarding an employee name:

If an exact match is found, retrieve the information in natural language.

If not, then include a str.contains search ignoring NaNs and case insensitive in this fashion. For example, if they ask for Alice West, look for:

df['NAMES'].str.contains('alice', case=False, na=False) & df['NAMES'].str.contains('west', case=False, na=False)

and retrieve the information found. If we have more than 20 rows, just retrieve the information on the first 20 rows.

When sorting information like retrieving the highest or lowest values of a column, always check for ties. If there are ties, retrieve the first 3 rows of information.

For instance, if the maximum hours of weekly work happens to be 10 but more than one employee has that, then print up to 20 rows.'

Suffix:

suffix ='You must answer in natural language and must never make up information. You will be penalized if you do.'

Code:

data = {
    "NAMES": ["John W. Doe", "Alice Smith", "John Adams Jr.", "Alice Johnson", "John Jr. Doe"],
    "CITY": ["New York", "Los Angeles", "Chicago", "Houston", "Chicago"],
    "STATION": ["Station A", "Station B", "Station C", "Station D", "Station E"],
    "STARTING_YEAR": [2015, 2017, 2015, 2018, 2019],
    "DURATION_HOURS_WEEK": [40, 35, 40, 30, 45]
}

df = pd.DataFrame(data)


create_pandas_dataframe_agent(
            llm=_model,
            df=df,
            suffix=suffix,
            include_df_in_prompt=True,
            agent_type=AgentType.OPENAI_FUNCTIONS,
            prefix=prefix,
            max_iterations=5,
            verbose=True)

When asked to execute this query: "What are the weekly hours of John Doe?" it retrieves nothing.

The employee is not in the DataFrame provided. Please check the name of the employee. I noticed that the code the agent is using in REPL is:

df['NAME'].str.contains('John Doe', na=ignore, case=False)

and this gives an empty DataFrame. It should be:

df['NAME'].str.contains('John', na=False, case=False) & df['NAME'].str.contains('Doe', na=False, case=False)

to retrieve data for: "John W. Doe" & "John Jr. Doe".

The final error is that when asked who has the longest working hours, it goes in the REPL for:

df['DURATION_HOURS_WEEK'].nlargest(1)

So it retrieves the information of one employee, but there are ties with the highest value. The prefix instructs the model to always check for ties, and it is not working.

I'm wondering what is the best way to instruct this agent to follow the instructions given. Is it better to use tools or functions for this matter?


Solution

  • Actually, to get best result you need to give proper prefix, suffix, invoke query and at last a best llm model.

    With below prefix and same invoke query, i got the output as expected

    prefix = """
    You are a pandas agent. You must work with the DataFrame df containing information about the company's employees.
    Your answer must only include information retrieved from df, and you must not create mockup or sample data. You will be penalized if you do.
    The user may ask you questions using a substring of the names of our employees.
    Follow these useful instructions when retrieving information regarding an employee name:
    If an exact match is found, retrieve the information in natural language.
    If not, then include a str.contains search ignoring NaNs and case insensitive in this fashion. For example, if they ask for Alice West, split it in two names like alice and west then look for it in dataframe:
    df['NAMES'].str.contains('alice', case=False, na=False) & df['NAMES'].str.contains('west', case=False, na=False)
    and retrieve the information found. If we have more than 20 rows, just retrieve the information on the first 20 rows.
    When sorting information like retrieving the highest or lowest values of a column, always check for ties. If there are ties, retrieve the information for all employees with the tied value, up to 20 rows.
    Below are the columns in my dataframe to use for the query.
    NAMES,CITY,STATION,STARTING_YEAR,DURATION_HOURS_WEEK.
    """
    

    Output:

    enter image description here

    and

    enter image description here

    So, in prefix and suffix try to give more specific with examples and dataframe detail.