KAI-GPT: The First Large Language Model Purpose-Built for Banking

By Keelan Evanini, Senior Vice President of Engineering and AI at Kasisto

We have all read a lot about the emergence of Large Language Models (LLMs) such as OpenAI’s GPT series of models and Google’s BARD. These models trained on massive corpora have demonstrated an amazing ability to generate fluent and appropriate conversational responses on a wide variety of topics.

However, LLMs of this type come with many drawbacks, such as cost, inefficiency, lack of data provenance control, privacy, and other factors. Recent findings also suggest that smaller models can deliver competitive performance, especially when the LLM only needs to handle tasks in a restricted, targeted domain.

Kasisto was guided by these considerations when we began investigating how generative AI can be used to improve conversational AI for the banking domain, and, as a result, we decided to create our own LLM.

The table provides a comparison between large industry-agnostic LLMs and Kasisto’s smaller banking-specific LLM, KAI-GPT.

Consideration	Large, Commercially Available Industry-Agnostic LLMs	KAI-GPT
Cost	Training a very large LLM from scratch requires: – millions of dollars of cloud computing expenses – associated staffing costs	Starting with a smaller LLM and fine-tuning it for our specific needs is much more cost-effective.
Efficiency	Hundreds of billions of parameters: – require vast system memory – generate text predictions slowly unless large and expensive machines host models in production	Fewer parameters leads to: – faster response times – lower hosting costs
Data Provenance	Large corporations that train LLMs typically do not disclose all of the data sources used to train the model.	The KAI-GPT model: – Explicitly lists all data sources used for training – Has been fine-tuned with data we compiled – Ensures that we know the provenance of all data included to avoid potential copyright issues
Frequency of Model Updates	Due to high costs and long training time frames, content is often months (or years) out-of-date.	Using a fine-tuned model for a specific domain instead of a generic, larger model means that the model can be updated frequently as the underlying content changes.
Hallucination	Because they are designed to handle conversations about any arbitrary topic, these LLMs have been shown to hallucinate, or generate factually incorrect responses.	Using an LLM fine-tuned for a particular domain helps mitigate hallucinations. The LLM can be more easily trained to respond to questions within its domain of expertise.

Consideration:	Cost
Large, Commercially Available Industry-Agnostic LLMs	Training a very large LLM from scratch requires: millions of dollars of cloud computing expenses associated staffing costs
KAI-GPT	Hundreds of billions of parameters:require vast system memory generate text predictions slowly unless large and expensive machines host models in production
Consideration:	Efficiency
Large, Commercially Available Industry-Agnostic LLMs	Hundreds of billions of parameters:require vast system memory generate text predictions slowly unless large and expensive machines host models in production
KAI-GPT	Fewer parameters leads to:faster response times lower hosting costs
Consideration:	Data Provenance
Large, Commercially Available Industry-Agnostic LLMs	Large corporations that train LLMs typically do not disclose all of the data sources used to train the model.
KAI-GPT	The KAI-GPT model:Explicitly lists all data sources used for training Has been fine-tuned with data we compiled Ensures that we know the provenance of all data included to avoid potential copyright issues
Consideration:	Frequency of Model Updates
Large, Commercially Available Industry-Agnostic LLMs	Due to high costs and long training time frames, content is often months (or years) out-of-date.
KAI-GPT	Using a fine-tuned model for a specific domain instead of a generic, larger model means that the model can be updated frequently as the underlying content changes.
Consideration:	Hallucination
Large, Commercially Available Industry-Agnostic LLMs	Because they are designed to handle conversations about any arbitrary topic, these LLMs have been shown to hallucinate, or generate factually incorrect responses.
KAI-GPT	Using an LLM fine-tuned for a particular domain helps mitigate hallucinations. The LLM can be more easily trained to respond to questions within its domain of expertise.

In this blog post, we explain how KAI-GPT is trained to handle conversational AI tasks, specifically in the banking domain. While KAI-GPT is smaller than the largest LLMs currently available in the market, it has all of the benefits listed above.

More importantly, KAI-GPT performs competitively for a banking-related question-answering task when compared to massive LLMs that are available commercially, while also meeting the requirements for accuracy, transparency, trust, and customization.

In the sections below, we describe how we trained and evaluated KAI-GPT.

Training KAI-GPT

Our goal was to train an LLM specifically tailored to the banking domain. We wanted the LLM to be large enough to produce fluent, coherent, and accurate conversational responses but small enough to be hosted and operated efficiently and cost-effectively.

In addition, we want to rapidly test out different models trained using different data sets targeted for particular domains and applications. Based on these considerations, we decided to start with a smaller LLM as the base model and then fine-tune it using additional data that we compiled, including our own conversational data, collected over the past decade of deploying KAI at some of the largest banks and small community financial institutions (FIs) around the world.

Many smaller LLMs that perform competitively with massive, proprietary LLMs on specific tasks have recently become available. We selected the Pythia-Chat-Base-7B model as our base model because of its promising performance in initial evaluations and because it has less demanding hardware requirements compared to other models.

Pythia-Chat-Base-7B Details:

Based on an auto-regressive, transformer-based LLM with approximately 7B parameters.
Trained by EleutherAI on a data set consisting of 207B tokens and 211M documents drawn from the web, book repositories, and several other publicly available text corpora.
Further technical details about the model’s architecture and training data are available in this paper.

This LLM was then subsequently fine-tuned using instructions drawn from dialog interactions in order to make it more appropriate for conversational AI applications. The OIG dataset from which these instructions were drawn consists of 30 separate corpora containing a total of over 43M instructions across a range of tasks, including question answering, classification, extraction, and summarization. (Further details about the OIG dataset are available here.)

We further fine-tuned this base Pythia-Chat-Base-7B model to make it more knowledgeable about the banking domain using the following additional data sets:

A set of 24K instructions in the form of question/answer pairs about banking content based on web pages from the Common Crawl corpus. The specific pages were sampled by first extracting Common Crawl pages that contain banking-related content and then filtering for English-language pages with high-quality text.
A set of 18K additional instructions with questions sampled from Kasisto’s rich archive of millions of KAI conversations. The answers were again extracted from web pages in the Common Crawl corpus. (In this case, many of the answers could not be found in Common Crawl documents, which was the intended outcome, since these instructions without an answer can help prevent the LLM from hallucinating.)
A data set containing 245M words from 44K documents crawled from a variety of web sources that have banking-related content.
A set of 15K instructions from the databricks-dolly-15k data set. While this data set isn’t specifically related to banking, it was also included in the fine-tuning set in order to improve the model’s ability to answer questions.

These data sets were used to fine-tune the baseline Pythia-Chat-Base-7B model to produce the KAI-GPT model. We conducted fine-tuning for 5K steps, which took approximately 120 hours on a dual A6000 GPU machine.

Evaluating KAI-GPT

Since the initial targeted application for the KAI-GPT LLM is a retrieval-based question-answering system for the banking domain, we compiled two evaluation data sets that can be used to quantitatively measure an LLM’s performance at generating appropriate answers in this context.

Evaluation Set #1

We leveraged the expert knowledge that our internal team has about the banking domain and asked them to imagine they are customers of a large North American bank and to formulate realistic questions that they might have about the bank’s credit card products based on their knowledge of typical scenarios (e.g., opening a new credit card account, looking for particular types of rewards, traveling to foreign countries, etc.).
After generating these questions, the team members conducted Internet searches and scoured the bank’s website to find pages containing the answers to the questions. Then, they extracted the specific span of text on the web page that answers the question. These extracted answers, coupled with the pages they were taken from and the initial questions, are the data for the first evaluation set.

Evaluation Set #2

We used external crowdsourcing to obtain a larger evaluation set, starting with questions about credit cards and other personal banking products from the same North American bank that had been posted to a Reddit forum about that bank.

We then instructed the crowdsourced workers to find answers to these questions on the bank’s website using the same procedure that was followed for the first evaluation set and compiled a set of questions, web pages that contain the answers, and spans of text containing the answers on each page.

To evaluate KAI-GPT using these data sets, we used it as the LLM in the KAI Answers application which is designed to answer questions about a particular domain using a pre-defined document repository.

For this application, the questions and documents are converted to embeddings, and a semantic search is performed to select the document that is the closest match to the question. Then, the question and document are passed to the LLM in a zero-shot learning framework with a prompt that instructs the LLM to answer the question based on the document.

We can then compare the answers generated by the LLM to the gold-standard answers that were extracted manually. The following table presents the results of this comparison for both evaluation sets using the standard ROUGE-1 Score for text-to-text similarity applied to the output of three different LLMS: the baseline Pythia-Chat-Base-7B model, KAI-GPT, and a model from OpenAI (specifically, the text-davinci-003 GPT-3 model).

Data Set	Model	ROUGE-1 Score
Evaluation Set #1	Pythia-Chat-Base-7B	0.141
	OpenAI GPT-3	0.227
	KAI-GPT	0.215
Evaluation Set #2	Pythia-Chat-Base-7B	0.122
	OpenAI GPT-3	0.191
	KAI-GPT	0.212

As the table shows, KAI-GPT substantially outperforms the baseline Pythia-Chat-Base-7B model for the document-based question-answering task on both evaluation sets.

This indicates that the addition of the banking-related content to the model during the fine-tuning process enabled it to provide more accurate answers to questions from the personal banking domain.

The table also shows that the KAI-GPT results are on par with results using the OpenAI GPT-3 model, even though it is substantially smaller in comparison (7B parameters for KAI-GPT vs. 175B parameters for GPT-3).

Next Steps

Now that we have trained KAI-GPT and seen how it provides a performance boost for the KAI Answers application compared to a base, industry-agnostic LLM, we are excited to see how it can be used next!

The Kasisto team is currently conducting R&D that leverages the millions of conversations banking customers have had with KAI in order to further strengthen KAI-GPT so that it can assist with personal banking conversational tasks that go beyond document-based question answering, such as providing information about transactions and balances.

In addition, we are exploring ways to use KAI-GPT as a conversational orchestrator that knows how to call on external tools, such as APIs and knowledge bases, when needed, in order to seamlessly answer a wide variety of questions.

KAI-GPT now powers our latest product, KAI Answers, and you can expect to see it soon enabling generative AI enhancements in other products. Stay tuned for further updates!