Introducing KAI-GPTv4: The Next Generation Banking LLM

By Denys Katerenchuk, PhD, AI Engineer at Kasisto

In the rapidly evolving landscape of artificial intelligence, the financial industry demands more than just general intelligence; it requires precision, reliability, and deep domain expertise in financial operations. Today, Kasisto is proud to unveil KAI-GPTv4, our next-generation Large Language Model (LLM), engineered to revolutionize banking. KAI-GPTv4 is not just another LLM; it is a specialized banking intelligence engine, purpose-built to navigate the complexities and responsibilities of the banking world.

KAI-GPTv4: TLDR

Financial Domain Knowledge: KAI-GPTv4 boasts an enhanced comprehension of complex banking terminology and tasks, trained on an extensive, curated dataset of financial and proprietary Kasisto data.
Optimized for Financial RAG: Prioritizes the delivery of accurate and trustworthy responses in a Retrieval Augmented Generation (RAG) setting, utilizing up-to-date banking information.
Transparent & Trustworthy: Features consistent and accurate citations for improved clarity and source visibility. Reliably responds with “I don’t know” when information is not present in the context, drastically reducing the risk of providing speculative or incorrect answers.
Improved Output Structure: Delivers responses in well-structured formats, including markdown and lists, for enhanced readability and user comprehension.
Native Multimodal Support: KAI-GPTv4 now natively supports both text and image understanding, opening new avenues for analyzing diverse financial data.
Long Context Window: Capable of processing comprehensive financial documents with an industry-leading 128,000 token context window.
Cost-Efficient Performance: Highly optimized for speed, efficiency, and scalability, KAI-GPTv4 delivers superior performance while maintaining cost-effectiveness.

The Evolution of Financial AI: From KAI-GPT to KAI-GPTv4

Kasisto has been at the forefront of specialized financial AI for years. In 2023, recognizing the limitations of general language models in banking-centered tasks, we introduced KAI-GPT, the first LLM purposely trained for the banking domain. Its success spurred continuous innovation, leading to the releases of KAI-GPTv2 and KAI-GPTv3.

With each iteration, we pushed the boundaries of performance:

KAI-GPTv2 focused on increasing the amount of training data from public banking sources and custom datasets, significantly reducing hallucinations.
KAI-GPTv3 prioritized accuracy through carefully curated datasets, demonstrating performance comparable to much larger closed-source LLMs.

Today, we introduce KAI-GPTv4, our most significant update yet, offering even better performance and groundbreaking new capabilities.

Technical Highlights: The Engineering Behind Banking Intelligence

The performance of an LLM heavily depends on multiple factors, such as the underlying model and architecture, training algorithms, and, most importantly, the training data. For KAI-GPTv4, we made strategic choices on each of these factors to ensure unparalleled financial intelligence. After careful consideration and extensive evaluation, we selected Google’s Gemma 3 12B model as the foundation for KAI-GPTv4. The model has the following characteristics:

Model size is a 12B parameter model.
Multimodal support (text + images).
128,000 token context window (to process large documents).
Trained on 14 trillion tokens + further instruction tuned on 110 million tokens.

This choice of the underlying LLM was driven by its strong real-world performance and high accuracy, which is essential for working with banking documents.

To create a domain-specific banking LLM, we focus on creating a high-quality banking dataset. After conducting several experiments with various datasets and different data sizes, we found that a smaller 110 million-token dataset outperforms larger, less refined banking datasets, confirming the results presented in the “Textbooks Are All You Need” paper. The base model is further fine-tuned on our dataset using advanced Parameter-Efficient Fine-Tuning (PEFT) techniques, specifically LoRA (Low-Rank Adaptation) and DoRA (Dense-and-Low-Rank Adaptation). These methods allow us to fine-tune large models efficiently on a private GPU cluster while maintaining peak performance, ensuring both high performance and efficiency. The total training time was 360 hours on our private cluster of 8 RTX A6000 Ada GPUs.

Generative AI is a data-driven field, and high-quality data is crucial for achieving accurate performance. We reimagined the data used to train KAI-GPTv4, focusing on refining our in-house datasets through rigorous cleaning and an “LLM-as-a-Judge” approach to remove low-scoring samples while preserving transparency of the training sources and synthetic data enhancement. This training corpus ensures that KAI-GPTv4 is knowledgeable about the banking domain. All our datasets come from open domains where we are leveraging the following high-quality datasets:

Dataset	Description
Open Financial Literature	High-quality publicly available financial reports, statements, and announcements that use clear and formal banking language were used to generate synthetic RAG question-and-answer datasets.
Banking Web Crawl Data	A set of 23,000 instructions in question/answer pairs based on banking content from the Common Crawl corpus, filtered for high-quality, English-language text, and enhanced through data distillation.
Banking KAI Conversations	A set of 18,000 instructions with Q&A was synthesized to resemble Kasisto’s rich conversation archive, while preserving anonymization. Crucially, many of these instructions were designed to lack answers in the provided context, helping to train the LLM to recognize knowledge boundaries and prevent hallucination.
Open-Source RAG-v1 Dataset	This dataset provided additional citation-rich examples, further enhancing the model’s ability to provide accurate and sourced responses.

The Hallucination Challenge: Why “I Don’t Know” is Critical in Banking

Despite immense progress in generative AI, hallucinations remain the biggest challenge preventing wider LLM adoption and creating negative PR (e.g., Glue pizza and eat rocks: Google AI search errors go viral). In accuracy-critical domains, such as banking, these errors are unacceptable. Financial information, such as interest rates, requirements, or regulatory guidelines, can change rapidly, and only contextual information can be used in a reliable response. Providing speculative or incorrect answers can have severe consequences.

Model	Example Response to “What are your hours?” (Missing Info)
KAI-GPT	“Sorry, I can’t answer that.”
GPT-4	“Sorry, I can’t answer that.”
Microsoft Copilot	“I’m happy to help with store hours. The Redmond hours are: Mon-Fri: 9am to 6pm, Sat: 10am to 4pm, Sun: Closed. The Seattle hours are: Mon-Fri: 9am to 6pm, Sat-Sun: 10am to 4pm”

Table: User Study Example (Expected Answer: “Missing information.”)

Kasisto’s research team has consistently prioritized training models to respond accurately when information is unavailable, with a clear indication of “I don’t know,” rather than generating hallucinations. A user study comparing KAI-GPT to GPT-4 and Microsoft Copilot on a custom dataset (Table User Study) with 50% of missing information highlights this crucial difference:

This study clearly shows that the KAI-GPT model understands the boundaries of available information and responds with “Sorry, I can’t answer that.” In contrast, other models were more prone to generate hallucinated responses that are not grounded in the provided context. This “accuracy-first” approach is paramount in the banking industry, where precision and reliability directly impact trust and compliance.

Performance Benchmarks: KAI-GPTv4 Delivers Unrivaled Speed and Accuracy

The financial industry demands a higher standard of accuracy and reliability from AI. KAI-GPTv4 represents our commitment to delivering just that. Our rigorous evaluations demonstrate KAI-GPTv4’s superior performance, particularly in real-world RAG applications critical for banking. While numerous open evaluation datasets exist, we find that benchmark performance does not transfer to real-world applications, as noted in this work. To address this, we developed a specialized dataset comprising 258 data points, each mirroring a typical RAG workflow with questions, contexts, and answers. A critical aspect of this dataset is the inclusion of irrelevant contexts, which enables us to accurately assess the model’s performance in real-world conditions where not all retrieved information is relevant. The results reported below represent a single run in a blind evaluation without any optimization to avoid overfitting.

Inference Speed and Cost-Efficiency

KAI-GPTv4 is highly optimized for speed and scale while preserving the high integrity of responses. Despite being almost twice the size of its predecessors, KAI-GPTv4 can be hosted on the same GPU, offering significant cost benefits compared to larger, closed-source models.

Here’s a comparison of time to first token and total time to completion in a RAG setting, evaluated on a single RTX 6000 Ada GPU:

Table: Time to First Token – is how long it takes to receive the first token from an LLM (less is better). Time To Completion – is the time to complete the response (less is better)

The results show a considerable improvement and cost benefit of running a private, task-specific LLM compared to closed-source models. KAI-GPTv4 provides a nearly 3x improvement in mean time to completion, making it the most cost-effective solution for private, task-specific LLM deployments.

Evaluation Measures: Precision in Financial QA

Our core evaluation dataset is a custom question-answering dataset based on real-world banking data and tasks, reflecting real-world use cases, including irrelevant context where the model is expected to provide an “I don’t know” response.

1. General Response Quality (ROUGE & JudgeLM)

While GPT-4.1 has a slight edge in ROUGE score, KAI-GPTv4 demonstrates a higher overall quality of responses, as highlighted by its superior JudgeLM score, which closely aligns with human preference.

2. Question-Answering Accuracy (Precision/Recall/F1)

Model	Question Answering F1	Question Answering True Positive (higher is better)	Question Answering True Negative (higher is better)	Question Answering False Negative (lower is better)	Question Answering False Positive (lower is better)
KAI_GPTv4	0.943	90.380	4.946	8.372	1.651
openai_gpt4o	0.840	74.891	5.496	23.860	1.101
openai_gpt4.1	0.890	82.271	4.946	16.481	1.651

KAI-GPTv4 achieves the best F1 score, indicating that most predicted answers are correct. Notably, KAI-GPTv4 has the highest true positive rate and the lowest false negative rate, meaning it is more accurate when responding with correct context and less likely to provide an “I don’t know” response when it does have the information. This high recall is especially valuable in financial QA, where failing to answer a valid question can be costly.

3. RAGAS Metrics

The RAGAS framework evaluates answers without ground-truth, focusing on answer relevancy, faithfulness, context recall, and context precision.

Model	Answer Relevancy	Context Recall	Faithfulness	Context Precision
KAI_GPTv4	0.810	0.656	0.742	0.628
openai_gpt4o	0.648	0.654	0.669	0.627
openai_gpt4.1	0.719	0.656	0.737	0.626

KAI-GPTv4 consistently comes out on top. Its answer relevance (0.810) is significantly higher, meaning KAI’s answers are judged to be most on-point. KAI’s faithfulness (0.742) confirms that it introduces the fewest hallucinations, which is essential in finance. Overall, KAI-GPTv4’s outputs are both highly relevant to the query and rigorously supported by the context.

4. Answer Citation Quality

With the introduction of citations for added transparency, the quality of citations in KAI-GPTv4 is paramount. We created four citation evaluation measures, assessed by an LLM-as-a-Judge.

Model	Citation Quality	Citation Faithfulness	Citation Source Relevance	Citation Overall
KAI_GPTv4	73.372	79.496	80.039	74.031
GPT-4o	63.566	68.411	69.419	63.450
GPT-4.1	70.077	75.349	75.116	70.232

Across all generation metrics, KAI-GPTv4 leads by a comfortable margin. In LLM citation evaluations, KAI’s answers are rated highest for quality, faithfulness, source relevance, and overall citation quality. KAI-GPTv4 outperforms its alternatives by 5–15 points in every category, demonstrating its ability to select highly relevant documents and integrate them effectively, producing the most coherent and accurate answers grounded in retrieved context.

Overall, KAI-GPTv4 demonstrates superior end-to-end RAG performance for financial QA tasks. Its consistently higher citation and factuality metrics demonstrate it is well-suited to power trustworthy, context-grounded answers in the banking and financial domain.

Try KAI-GPTv4 Today

With the release of KAI-GPTv4, we introduce the next level of efficiency and accuracy. We invite financial institutions to build enterprise-grade AI solutions that protect your sensitive financial data and ensure compliance with global financial regulations, including GDPR, CCPA, and industry-specific mandates. KAI-GPTv4 is available for evaluation and testing on our KAI platform. You can request a demo on our website: https://kasisto.com/request-a-demo/

The Future of AI at Kasisto

KAI-GPTv4 is just the beginning of our continuous research and development to redefine financial intelligence. Our roadmap includes advanced predictive analytics modules, seamless integration with leading financial platforms and data providers, and a dedicated developer program to co-create bespoke financial AI solutions.

Looking ahead, we leverage our financial domain knowledge and expertise in building advanced AI systems to introduce KAIgentic, our cutting-edge AI Agent platform. KAIgentic is fully customized to deliver the highest quality, compliant, agentic solutions to your business. It empowers financial institutions to move beyond conversations to intelligent action, predictive engagement, and trusted compliance. Purpose-built for banking and trusted by leaders, the KAIgentic platform provides trusted compliance and security, proven to drive revenue growth and cost reduction. Join us in shaping the future of banking with KAI-GPTv4 and KAIgentic. Stay tuned for our future announcements on our website: https://kasisto.com/