SG Data Analyst: Question-answering over a Dataset Repository with Reasoning Agents

13 min readAug 22, 2023

Leveraging Langchain agents, OpenAI functions API, and vector stores (FAISS) to implement a dataset repository querying agent

· Key Issues with Dataset Repositories from an Analyst’s Perspective
· Wishful Thinking with LLMs as Reasoning Agents
· Background on the Dataset Repository
· The Proposed Workflow
∘ 1. Getting the Metadata of all Tabular Datasets
∘ 2. Picking Candidate Datasets Given our Query
∘ 3. Retrieving Candidate Datasets in Full
∘ 4. Execute Code to Address Our Query
· SG Data Analyst is Born… Well Not Really…
· Demo: SG Data Analyst
· Next Steps
· Conclusion

Key Issues with Dataset Repositories from an Analyst’s Perspective

With an increasing emphasis on data-centricity, data practitioners such as data analysts and data scientists are taught in their training that data comes first. For example, the starting point for building data analytics applications such as interactive visualisations, dashboards, and powerful machine learning models is the data. This is where public datasets, or dataset repositories come into play.

However, from an analyst’s point-of-view, I could identify a few issues that come with dealing with these large dataset repositories:

Given a well-defined problem that I wish to solve with data, finding the most appropriate datasets can be tough

For example, if I would like to perform analysis on the Singapore real estate market, I would probably first look for datasets related to public and private housing.
This process involves scrolling through datasets to pinpoint the exact ones that best suit my analysis. This could take a long time.

2. Performing exploratory data analysis can seem repetitive to some extent

Common analyses such as getting the counts of categorical variables through bar charts, or plotting a histograms and line charts to observe common trends and patterns can be arduous — especially when the same kind of preliminary analysis is performed on multiple candidate datasets.
Arguably, such analyses are nuanced and really depends on the dataset, but most of the time when dealing with public datasets in tabular form, there are certain common analyses that an analyst can perform.

Wishful Thinking with LLMs as Reasoning Agents

With the rapidly evolving landscape of LLMs and their related applications, especially that of LLMs as reasoning agents, I couldn’t help but wonder if agents can be used to explore large dataset repositories to address the mentioned issues.

What if we could provide an agent with a problem statement and the descriptions of these datasets, and the agent could choose the best datasets and perform preliminary analyses on them to help address the problem?

Before we propose a workflow for the agent, we provide some background on the target dataset repository for the agent — data.gov.sg.

Note: We will only target tabular data for now (or files that are exported as .csv files) - stay tuned for more updates!

Background on the Dataset Repository

data.gov.sg is the central repository for Singapore datasets, curated by various government agencies. Singapore-based analysts and scientists alike would have accessed this site for various datasets to perform analysis and data mining — be it for personal or school projects.

Disclaimer: the API used in this project is the v1 API, which is considered to be legacy — though there is v2 API. Stay tuned for more updates!

The Proposed Workflow

Before diving into the workflow and implementation, the source code can be publicly found at here for reference. Contributions are welcome!

We propose a sequence of steps for our agent:

Get the metadata of all tabular datasets
Given all metadata, pick candidate datasets that could answer the given query
Retrieve the candidate datasets in full
Given the candidate datasets, execute python code iteratively to address the given query

In 1. and 3., we discuss straightforward wrappers for the data.gov.sg API. On the other hand, for 2. and 4., things get pretty interesting when we bring in OpenAI functions API, vector stores and Langchain agents in finalising our agent implementation.

We briefly elaborate on the steps below, the issues encountered along the way, and how we solve them.

1. Getting the Metadata of all Tabular Datasets

We can utilise data.gov.sg’s action/package_list endpoint to get a list of all dataset slugs.

def get_dataset_names(package_list_url: str = package_list_url) -> List[str]:
    """Returns a list of dataset names.
    :param package_list_url: the url to get the list of datasets :class `str`
    :return: a list of dataset names :class `List[str]`
    """
    res = requests.get(package_list_url)
    result = res.json()['result']
    return result

Then, theaction/package_show endpoint can help us get dataset metadata when we provide the dataset slug as a query string.

def get_dataset_metadata(dataset_name: str, package_show_url: str = package_show_url) -> dict:
    """Returns the metadata of a dataset.
    :param dataset_name: the name of the dataset :class `str`
    :param package_show_url: the url to get the metadata of a dataset :class `str`
    :return: the metadata of a dataset :class `dict`
    """
    res = requests.get(f"{package_show_url}{dataset_name}")
    result = res.json()['result']
    return result

For ease of use, we define a container class to contain the metadata of each dataset.

@dataclass
class Dataset:
    """Dataset class to store dataset information"""
    id: str
    name: str
    description: str

We can call get_dataset_metadata() for all dataset slugs obtained from action/package_list, storing relevant metadata in Dataset instances. We also filter out for CSV datasets, which are the datasets that we are interested in.

def get_csv_datasets(package_list_url: str, package_show_url: str = package_show_url) -> List[Dataset]:
    """Returns the metadata of datasets that are in CSV format.:param package_list_url: the url to get the list of datasets :class `str`
    :param package_show_url: the url to get the metadata of a dataset :class `str`
    :return: a list of datasets that are in CSV format :class `List[Dataset]`
    """
    datasets = get_dataset_names(package_list_url)
    csv_datasets = []
    for dataset in datasets:
        metadata = get_dataset_metadata(dataset, package_show_url)
        resources = metadata['resources']
        for resource in resources:
            if resource['format'] == 'CSV':
                csv_datasets.append(Dataset(
                    id=resource['id'],
                    name=dataset,
                    description=metadata['description'],
                ))
    return csv_datasets

Unsurprisingly, this function is pretty slow due to the sequential API calls. The actual implementation of get_csv_datasets is done asynchronously, and can be found in the main repository.

2. Picking Candidate Datasets Given our Query

Initially, the idea was to solely rely on the language model to generate dataset choices with all metadata provided as context. Essentially, this strategy entails dumping all the metadata in the prompt, followed by asking the model to choose datasets. Let’s refer to the below for an example prompt that highlights this strategy:

"""
Here are all dataset metadata enclosed in triple backticks:```
Dataset(name="…", "description="…")
…
…
..
Dataset(name="…", "description="…")
```
Choose the best datasets that can be analysed to address the query enclosed in triple backticks:
```
What are some observations in university education in Singapore?
```
"""

Well… That didn’t exactly work very well because of the following reasons:

Problem 1: The output is not guaranteed to assume a fixed structure

The language model can potentially return differently formatted outputs at each call, if not properly steered. This means that extracting the chosen dataset IDs from the output can be a chore.

Solving the problem with OpenAI functions API

However, this problem can be solved with the use of OpenAI functions API — which allows better steer-ability of language models in outputting structured data. A short summary of how it works:

Provide one or more function specifications during the call to the OpenAI completions endpoint, along with your messages
0316 models will output function arguments that fit the function specifications defined, if any
We can take these outputs to execute defined functions, or to process them further

In this specific example, We can define a get_best_dataset_ids() function that returns best_ids: List[str]

For more information, check out the update by OpenAI on 0316 models and their cookbook for some really helpful example code.

Problem 2: Limited context window of GPT models

Dumping all metadata of the 2,000+ datasets from data.gov.sg is likely to result in a maximum token limit error.

Inherently, this issue has little to do with language models in general, since the trend of newer models being able to take an increasing context window size is likely to continue in the near future. For example, Anthropic’s Claude (whose API I unfortunately do not have access to) has a context window of 100,000 tokens! Nevertheless, there are existing methods that can tackle this problem for the time being.

Solving the problem with good ol’ Retrieval Augmented Generation (RAG)

RAG is one of the most commonly used techniques in LLM application development. Essentially, it involves the following steps:

A document collection, or a knowledge base is encoded into embeddings, and the embeddings are indexed

Split documents into chunks, create embeddings, store into vector store

2. The query is compared against the indexed embeddings through a similarity search

3. The query is ‘augmented’ with the relevant context from the document collection or knowledge base before the final call to the language model

Augment question with retrieved documents in generating the final completion

All credits to Sam Witteveen for the diagrams.

We could apply this technique to our problem:

Convert the collection of metadata into embeddings and index the embeddings
Get the most relevant datasets based on the ‘closeness’ or similarity of metadata with respect to our query
Insert the most relevant datasets into the prompt as context for the LLM to generate the final response

For an introduction to vector indexing for similarity search, I highly recommend giving this a read.

Langchain provides an easy way to split and index documents. It also provides easy-to-use wrappers to vector indexes such as FAISS, and vector databases such as Chroma and Pinecone. For this project, I used FAISS, a library for efficient flat indexing of vectors — which emphasises on high search quality. The trade-off, being lower search speed, matters less since we are not performing frequent searches.

def split_and_store(dataset_docs: List[Document]) -> FAISS:
    """Splits the documents into chunks and stores them into the vector index.
    
    :param dataset_docs: the dataset documents :class `List[Document]`
    :return: the vector store :class `FAISS`
    """
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    dataset_split_docs = text_splitter.split_documents(dataset_docs)
    vector_store = FAISS.from_documents(dataset_docs, OpenAIEmbeddings())
    return vector_store

After initialising the vector store with our documents, we can retrieve the relevant documents:

db = split_and_store(dataset_docs)
fetched_docs = similarity_search(db, query)

The Langchain implementation of similarity search defaults to the top 20 documents. To narrow down the candidate datasets even further, we use the OpenAI functions API to generate the final candidate dataset choices.

The prompt I used can be found below:

def get_dataset_ids_prompt(fetched_docs, query):
    return f"""Given that you know the following datasets with information given in triple backticks:
```
{fetched_docs}
```

Retrieve the relevant datasets that can answer the following query given in triple backticks. If there are no relevant datasets, return an empty list.:
```
{query}
```
"""

Narrowing down candidate datasets even further using OpenAI functions API (with gpt-4–0613 as the model):

def generate_id_choices(fetched_docs: List[Dict[str, str]], query: str) -> List[str]:
    """Generates the choices for the ids of the datasets.
    :param fetched_docs: the fetched documents :class `List[Dict[str, str]]`
    :param query: the query :class `str`
    :return: the ids of the datasets :class `List[str]`
    """
    messages: List[Dict[str, str]] = [
        {"role": "user", "content": get_dataset_ids_prompt(fetched_docs, query)}]
    functions = [
        {
            "name": "get_datasets_from_ids",
            "description": "Get the datasets from the ids",
            "parameters": {
                "type": "object",
                "properties": {
                    "ids": {
                        "type": "array",
                        "items": {
                            "type": "string",
                            "description": "The ids of the datasets",
                        },
                    },
                },
                "required": ["ids"],
            },
        },
    ]
    # get best ids using openai once, forcefully
    response = chat_with_retry(
        model=GPT_MODEL,
        messages=messages,
        functions=functions,
        function_call={"name": "get_datasets_from_ids"},
        temperature=0,
    )
    try:
        ids_raw = response["choices"][0]["message"]["function_call"]["arguments"]
        # then parse the arguments
        ids_dict = literal_eval(ids_raw)
        ids = ids_dict["ids"]
    except KeyError:
        print("Could not retrieve ids.")
        ids = []
    return ids

Overall, the process of picking candidate datasets given our query is detailed in the diagram below:

Picking candidate datasets to answer our query

3. Retrieving Candidate Datasets in Full

Now that we have obtained the IDs of the most relevant datasets, we can retrieve them in full:

def get_datasets_from_ids(
   ids: List[str],
   base_url: str = base_url,
   dataset_search_url: str = dataset_search_url
) -> List[List[Dict[str, Any]]]:
    """Returns the dataset from the id.
    :param id: the id of the dataset :class `str`
    :param dataset_search_url: the url to get the dataset :class `str`
    :return the dataset :class `List[List[Dict[str, Any]]]`
    """
    datasets = []
    for id in ids:
        result = get_dataset_from_id(id, base_url, dataset_search_url)
        datasets.append(result)
    return datasets

The implementation of get_dataset_from_id() is simply a call to the action/datastore_search endpoint, passing in the ID as the query string. Do note that we have to also handle the pagination of results.

4. Execute Code to Address Our Query

Now that we have the most relevant datasets given a query, how can we analyse these datasets to answer the query?

Enter Langchain Pandas DataFrame Agent — an agent that wrangles Pandas DataFrames to achieve tasks. To use it is as simple as:

agent = create_pandas_dataframe_agent(llm, dfs)
agent.run(query)

🤯🤯🤯

Libraries like Langchain make it easy (almost too easy) to instantiate agents — but I thought it might be worthwhile to dive into what happens under the hood.

How Langchain Pandas DataFrame Agent works in 1 min

At its core, the Langchain Pandas DataFrame Agent is designed to interact with language models by sequencing tool actions. In this case, the Python REPL and the Pandas library are supplied as tools to this agent.

During execution, the pandas library is imported and the language model decides what pandas code to run given the query and supplied DataFrame instances
Code snippets generated by the language model are then run with the Python REPL
This process repeats until the query can be sufficiently addressed, or up till the max iteration behaviour set by the user

The Pandas DataFrame Agent can choose between two methods in how it interacts with language models:

Using the ReAct Framework

We can set this by passing in AGENT_TYPE=AgentType.ZERO_SHOT_REACT_DESCRIPTION in create_pandas_dataframe_agent()
The ReAct Framework uses language models to combine the generation of reasoning traces (“reasoning”) with task-specific actions (“acting”)
Reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources to gather additional information

2. Using OpenAI Functions

We can set this by passing in AGENT_TYPE=OPENAI_FUNCTIONS in create_pandas_dataframe_agent()
The key idea is the same as what we have done to retrieve the best datasets — by passing in function specifications, or tools to the agent, 0613 models can choose from the tools provided and output the next “actions” to take

In my implementation, I used the OpenAI Functions agent type because I found it to be more stable and performant than using the ReAct framework.

SG Data Analyst is Born… Well Not Really…

With that, we are ready to go! Presenting the SGDataAnalystAgent:

class SGDataAnalystAgent:
    @staticmethod
    def run(query: str) -> str:
        """Runs the agent on a query.
        :param query: the query :class `str`
        :return: the response :class `str`
        """
        csv_datasets: List[Dataset] = asyncio.run(aget_csv_datasets())
        dataset_docs = [dataset.to_document() for dataset in csv_datasets]
        # put them into a dictionary as well for easy access
        dataset_docs_dict = {dataset.id: dataset for dataset in csv_datasets}
        
        db = split_and_store(dataset_docs)
        fetched_docs = similarity_search(db, query)
        
        # get the most relevant ids from the fetched docs
        ids = generate_id_choices(fetched_docs, query)
        
        # identify which datasets were chosen
        chosen_dataset_names = [dataset_docs_dict[id].name for id in ids]
        # get the datasets from the ids
        chosen_datasets_full = get_datasets_from_ids(ids)
        chosen_dataset_dfs = _datasets_to_dataframes(chosen_datasets_full)
        
        llm = ChatOpenAI(temperature=0, model=GPT_MODEL)
        agent = create_pandas_dataframe_agent(
            llm,
            chosen_dataset_dfs,
            verbose=True,
            AgentType=AgentType.OPENAI_FUNCTIONS,
        )
        return agent.run(query)

Admittedly, this implementation is not the most optimised and is pretty scrappy — but it sort of works for now… Stay tuned!

Demo: SG Data Analyst

For a simple chat UI, I drew inspiration from this Streamlit app — a huge shoutout to Dr. Chanin!

As a demo, the query I provided was:

“What are some observations in university education?”

The final output was:

“Some observations in university education include:
There are 16 different fields of study represented in the data.
The data includes both male and female students.
The data covers a period of about 39 years, from 1982 to 2021.
There are some courses with no graduates in certain years.
The number of graduates from different universities varies from year to year.”

A pretty decent response for a prototype, and improvements can definitely be made, which I will discuss later on.

More detailed execution details can be found below — which shows the “thought” process of the Langchain Pandas DataFrame Agent.

Output of Pandas DataFrame Agent execution with GPT-4–0613 🤯

Next Steps

The existing implementation is far from complete. From the demo, it is clear that the although the response is decent — it is lacking in many areas. In terms of the response, areas of improvement include:

Improving the speed of the SGDataAnalystAgent: there are a few tweaks that can be made to significantly improve the speed of the agent
Possibility of adding more details in the responses: Based on the output, it is unclear how the agent arrived at the final output — displaying information such as what datasets were utilised and the actions perform will greatly improve user experience

Other areas of improvement discovered during testing include:

The Pandas DataFrame Agent runs into max token errors — more work can be done to customise the Pandas DataFrame Agent to improve its performance
The Pandas DataFrame Agent gets stuck in a loop during execution — this issue is mostly due to the usage of the GPT-3.5-0613 model, which is less performant than the GPT-4 series of models in code generation, more on that here

Conclusion

Thanks for reading my first article! Overall, it was fun playing around with LLM application libraries such as Langchain and OpenAI API. As the space progresses, I’m excited to see how language models can be used for more advanced downstream applications. Stay tuned for more!

If you are interested to connect, feel free to say hi on LinkedIn, or visit my website.

Open Source: ivankqw/sg-data-analyst: LLMs as Data Analysts over Singapore Datasets 🤖 (github.com)

Here are all the resources from the article that you might be interested in reading:

LLM Powered Autonomous Agents by Lilian Weng

OpenAI’s update on function calling (0316 models)

OpenAI’s cookbook showing examples of how to call functions with chat models

Antropic’s update on Claude’s 100k token window

A brief introduction to Retrieval Augmented Generation

Faiss: The Missing Manual | Pinecone

ReAct: Synergizing Reasoning and Acting in Language Models