Prompting the Market? A Large-Scale Meta-Analysis of
GenAI in Finance NLP (2022–2025)
Paolo Pedinotti∗, Peter Baumann, Nathan Jessurun
Leslie Barrett, Enrico Santus∗
Bloomberg {ppedinotti, pbaumann25, njessurun, lbarrett4, esantus}@bloomberg.net
Abstract Large Language Models (LLMs) have
rapidly reshaped financial NLP, enabling
new tasks and driving a proliferation of
datasets and diversification of data sources.
Yet, this transformation has outpaced
tra-ditional surveys In this paper, we present
MetaGraph, a generalizable methodology
for extracting knowledge graphs from
scien-tific literature and analyzing them to obtain
a structured, queryable view of research
trends We define an ontology for financial
NLP research and apply an LLM-based
ex-traction pipeline to 681 papers (2022–2025),
enabling large-scale, data-driven analysis.
MetaGraph reveals three key phases: early
LLM adoption and task/dataset
innova-tion; critical reflection on LLM limitations;
and growing integration of peripheral
tech-niques into modular systems This
struc-tured view offers both practitioners and
researchers a clear understanding of how
financial NLP has evolved—highlighting
emerging trends, shifting priorities, and
methodological shifts—while also
demon-strating a reusable approach for mapping
scientific progress in other domains.
1 Introduction
The release of ChatGPT in late 2022
trig-gered a structural shift in NLP, rapidly
accel-erating adoption in high-stakes domains such
as finance LLMs expanded what was possible,
shifting the field from supervised learning to
zero-shot reasoning, and from structured tasks
to flexible document parsing Financial NLP
is undergoing rapid reinvention, after a long
period of dominance by sentiment analysis and
structured extraction (e.g., NER)
This fast-moving evolution has outpaced
traditional surveys, which lack a quantitative
grasp of generative AI’s impact on Financial
∗Equal contribution.
Figure 1: Example of paper subgraph.
NLP We introduce MetaGraph, a method-ology for automated Knowledge Graph (KG) construction from research papers using LLMs,
to address this gap MetaGraph involves the manual definition of an ontology of information that is relevant to tracking research evolution (such as papers metadata, motivations and lim-itations, tasks approached, techniques, models, and datasets), and the use of LLMs in a human-in-the-loop approach to extract this informa-tion and gather it into a unified, queryable
KG, enabling large-scale, data-driven insights about the evolution of the field
We applied MetaGraph to 681 Financial NLP papers (2022–2025), uncovering a land-scape that has undergone substantial trans-formation LLMs have forcefully entered the domain, reshaping both the tasks and data landscape In these three years, we noticed the rise of financial QA and an explosion of datasets powered by synthetic generation, fol-lowed by a growing awareness of LLMs limi-tations The growth in model size has slowed down and has been accompanied by both more sophisticated architectures and the integration
of LLMs with peripheral technologies (such as retrieval) to build integrated systems
Our contributions are three-fold: i) method-ology, as we provide a generalizable pipeline for extraction of quantitative insights from scientific literature This pipeline combines ontology definition, human-in-the-loop graph
Trang 2Table 1: Comparison with Financial NLP surveys.
Survey GenAI Quantit Taxon KG
Xing et al (2018) ✗ ✗ ✗ ✗
Li et al (2023) ✓ ✗ ✓ ✗
Nie et al (2024) ✓ ✗ ✓ ✗
Du et al (2025) ✓ ✗ ✓ ✗
Ours (2025) ✓ ✓ ✓ ✓
extraction and taxonomy construction, and
it is designed to be reused in other domains
(e.g., biology, legal, etc.); ii) field synthesis,
as we map the shift from model adoption in
pre-existing structured pipelines to redesign
of modular, generative systems; iii) open
re-sources, as we made the graph available at
2 Related Work
Financial NLP Surveys While the survey
tasks such as classification, more recent
sur-veys (Li et al.,2023b;Nie et al.,2024;Du et al.,
2025) have focused on the transformative
im-pact LLMs have had on financial applications
Yet, these works rely on a traditional narrative
review methodology, qualitatively
summariz-ing the applications in the literature
Our approach diverges fundamentally by
em-ploying a bibliometric and holistic analysis of
the field We uncover structural shifts and
data-driven trends within the field by
quanti-tatively mapping the research landscape into
a knowledge graph, offering a comprehensive
view of the impact of LLMs on its evolution
LLM-Assisted Knowledge Graph
Con-struction Carta et al (2023) construct
domain-specific knowledge graphs through
stepwise prompting strategies, whileFunk et al
(2023) and Babaei Giglou et al (2023) use
LLMs to learn hierarchical relations among
con-cepts We take inspiration from GraphRAG
ex-tract a knowledge graph and enrich it with
information at different levels of granularity
1
58 of the 681 papers used in our analysis are not
included in the published dataset These papers were
posted to arXiv under a CC-BY-NC-ND 4.0 license,
which prevents reusers from distributing any derivative,
adapted form of the original material As such, we will
not be distributing them as part of the MetaGraph
knowledge graph.
3 Methodology
We introduce MetaGraph, a methodol-ogy for automatically constructing knowledge graphs to extract quantitative insights from large scientific corpora Graphs support struc-tured and scalable representations of com-plex information, enabling analysis beyond frequency queries, and uncovering relational patterns otherwise dispersed across text MetaGraph is designed to be generalizable and reusable, although in this work we apply
it specifically to NLP in Finance
3.1 Method and Implementation Ontology definition This stage involves finding the graph structure that allows us to attain the most actionable insights The ex-pressiveness of this ontology directly shapes the analytical power of the resulting graph
In our case, we focused on NLP entities, at-tributes, and relationships Whenever possible,
we constrain attributes by either controlled vo-cabularies or existing frameworks The graph structure with the entity types and their rela-tionships are described in Figure 13
Corpus acquisition We curated a corpus
of 681 financial NLP papers (2022-2025) from the ACL Anthology and arXiv (cs.CL, cs.AI, q-fin) using keyword filters23 The geograph-ical distribution of the corpus is in Figure 12
To support trend analysis, we grouped papers into time periods of varying granularity (see partition details inF), ensuring each contained
an equal number of papers We also mapped each paper to its Semantic Scholar ID to cap-ture citation relationships between papers LLM-based extraction We used Gemini 2.5 Flash4 for its trade-off between cost and performance5 We adopted a human-in-the-loop process with sample audits, error analysis, and prompt refinement Key design choices include: i) crafting separate prompts for each information type (e.g., motivations, limitations, tasks) to improve extraction precision; ii) al-lowing abstention and fallback options to
min-2 The set of keywords is financial, fintech, fraud, stock, portfolio, finance
3 Documents were processed via Mistral OCR: https: //mistral.ai/news/mistral-ocr
4 https://deepmind.google/models/gemini/ flash/
5 https://lmarena.ai/leaderboard
Trang 3imize hallucinations; iii) using CoT prompting,
where models are required to explain their
an-swers The full set of prompts can be seen in
SectionC For a subset of entity types (Models,
Tasks, and Dataset), we manually compared
the extracted ones with a gold set on 12
pa-pers We observed almost perfect performance,
with the exception of two minor tasks from one
paper, and one model from another
Entity Resolution Term inconsistencies
such as name variants (e.g., Finqa and FinQA),
were addressed by clustering over text
embed-dings using OpenAI’s text-embedding-small,
and merging semantically equivalent mentions
based on a cosine similarity threshold of ≥ 0.93
(tuned empirically) (seeZeakis et al.(2023) for
a similar methodology)
Taxonomy Induction We organized
se-lected entity types into taxonomies We
ap-plied zero-shot LLM-based categorization by
iteratively prompting the model on subsets of
entities and manually merging the resulting
hierarchies Each entity is annotated with the
taxonomy categories it is an example of
Relevance Scoring We define a paper
rel-evance score based on three factors: i)
institu-tional centrality, i.e., PageRank of affiliated
institutions in the co-authorship graph; ii)
pro-ductivity, i.e., number of papers published by
the institution; iii) citation normalization,
i.e., paper citations normalized by year-average
citations We have used these scores to capture
emerging trends, by paying more attention to
most relevant papers
4 Findings and Insights
In this section, we showcase the types of
analyses that MetaGraph can enable We focus
on how ChatGPT’s release in late 2022 marked
a turning point for Financial NLP Until then,
the field focused mainly on sentiment analysis,
information extraction, and stock prediction
Up until April 2023, these tasks constituted
90% of published work, and the most widely
used datasets (Table 2a) reflected this focus
LLMs dramatically altered this landscape,
unlocking new applications and pushing
re-search toward more complex tasks Financial
QA has become the leading focus, rising from
10% to 33% of tasks by 2025, while traditional
tasks have steadily declined (see 6 for the
de-Figure 2: Increasing Focus on Financial QA Task frequency here is the number of papers with an instance of the task category.
tailed distribution of task over time)
LLMs have transformed the the way re-searchers approach financial problems This shift moves from narrow, task-specific pipelines
to flexible, generative systems that bridge pre-viously isolated tasks Between April 2023 and February 2024, the average number of tasks per paper rose from 1.36 to 1.9 Traditional tasks such as sentiment analysis and information ex-traction are now often used as intermediate steps in broader systems, such as RAG and financial agents
Data Sources and Datasets Datasets changed too On the one hand, QA bench-marks now lead the field, overtaking traditional datasets (Table2) On the other hand, we wit-nessed an expansion and diversification of the data sources used to generate QA datasets Re-cent papers increasingly mention multimodal and structured inputs—such as tables, charts, audio, and analyst commentaries—alongside core sources such as news and company reports (Table 3) This expansion has been supported
by synthetic data generation, which reduced the need for expert annotations The share of synthetic or human-in-the-loop datasets nearly tripled as LLMs became data generators, from 5% in April 2023 to almost 15% by November
2024 (see references in Table10)
Data trends for tasks are plotted in F 3 Most new datasets target QA tasks, while the development of datasets for other tasks, such as sentiment analysis, has slowed An exception
is stock prediction, which continues to see new benchmarks due to proprietary constraints that make the relevant data unshareable
Trang 4Dataset Task Freq.
FPB ( Malo et al , 2014 ) SA 29
FinQA ( Chen et al , 2021 ) QA 19
FIQA-SA ( Maia et al , 2018 ) SA 15
ConvFinQA ( Chen et al , 2022 ) QA 13
RefInd ( Kaur et al , 2023 ) RE 7
(a) Nov 2022 – Feb 2024
Dataset Task Freq ConvFinQA ( Chen et al , 2022 ) QA 13 FPB ( Malo et al , 2014 ) SA 13 FinQA ( Chen et al , 2021 ) QA 12 FIQA-SA ( Maia et al , 2018 ) SA 7 FinanceBench ( Islam et al , 2023 ) QA 7
(b) Feb 2024 – Apr 2025
Table 2: Top datasets in Financial NLP by usage across two time periods (QA: Question Answering, SA: Sentiment Analysis, RE: Relation Extraction) We can note lower dataset usage frequencies in the second bin (sign of a fragmentation of the dataset landscape), with the relative proportion of QA vs Non-QA shifting to favor QA.
Table 3: Distribution of data sources and signal
types across time periods (T1: Jan ’22–Aug ’23,
T2: Sep ’23–Jul ’24, T3: Aug ’24–Apr ’25)
Sources (%) (%) (%)
News 27.48 29.14 25.35
Social Media / Forums 21.85 14.20 14.43
Company Reports 28.15 27.37 28.99
Company Fundamentals &
Indicators
11.04 15.09 15.13
Earnings Calls 5.18 7.10 6.58
Analyst Reports 4.73 3.25 3.92
University Textbooks 1.13 0.89 2.38
Financial Analyst Exams 0.45 2.96 3.22
Signals
Text 80.42 73.72 70.44
Tables 16.08 19.87 20.13
Image 1.40 2.56 5.03
Audio 0.70 1.92 1.89
Other 2.10 1.92 2.52
Figure 3: New datasets by period.
4.1 A Growing Awareness
LLMs have lowered key barriers to both
adoption and data processing On one hand,
they remove data format constraints—enabling
the processing of unstructured data On the
other, they support synthetic data
genera-tion, helping mitigate challenges such as cost,
scarcity, and domain bias (Table4) We show
how limitations have changed over time in
Fig-ure 4 As data constraints eased, research
attention increasingly shifted toward
model-level challenges—particularly reasoning,
inter-Figure 4: Reported limitations by period Syn-thetic data share complements data scarcity con-cerns.
pretability, efficiency, and safety We observed growing concerns around bias, privacy risks, and potential misuse (Table7)
This shift toward critical reflection is evident
in the evolution of research motivations, which increasingly convey a more cautious stance toward LLMs By 2024, critical themes such
as robustness, efficiency, reasoning, and RAG appeared in nearly 18% of papers—twice the share observed in early 2023 (see Table 5) This marks a shift from earlier studies, which primarily focused on leveraging LLMs through zero-shot learning and fine-tuning
4.2 From Models to Systems
In the wake of GenAI’s rise, researchers initially focused on adapting general-purpose LLMs to financial NLP tasks through prompt engineering—especially zero-shot and in-context learning—which quickly gained mo-mentum across applications This was often complemented by post-training methods such
as instruction tuning to further specialize mod-els for the financial domain (Table 9)
Researchers began to move beyond model-centric approaches as the limitations of rea-soning, safety, interpretability, and scalabil-ity became more apparent Over time, these
Trang 5Table 4: Distribution of reported limitations across
time periods.
Data-related Limitations
Costly Human Judging 3.20 3.22 2.83
Insufficient Data Scale/Coverage 12.45 13.00 10.59
Skewed/Imbalanced Classes 3.08 2.29 1.47
Domain/Language Bias 12.03 11.69 9.59
LLM Limitations
Interpretability Gaps 1.63 2.29 2.04
Weak Reasoning 4.11 4.59 5.24
Cost & Environmental Footprint 2.66 2.79 3.46
Hallucination & Bias 2.18 2.95 3.72
Prompt Sensitivity 1.75 2.84 2.78
Latency / Scalability 2.06 2.18 2.57
Synthetic Data / Label Issues 7.38 4.86 5.14
Capacity Constraints 9.07 9.07 9.70
Gaps: Lab vs Live 9.43 8.68 10.27
Other (see appendix) 28.81 29.55 30.06
Table 5: Distribution of future research directions
across time periods.
Data
Data Scarcity & Annotation
Cost
32.25 28.82 23
Other 37.12 35.81 35.48
Exploiting LLMs
Zero/Few-Shot Evaluation 4.41 4.37 2.53
Domain-Specific LLM Training 10.44 8.95 11.89
Solving LLM Limitations
Quantitative Reasoning Gaps 5.10 5.02 6.43
Interpretability & Explainability 3.25 3.71 5.07
Efficiency Constraints 3.02 5.24 5.65
Safety, Robustness, & Fairness 2.78 5.90 7.21
RAG & Retrieval Bottlenecks 1.62 2.18 2.73
techniques were increasingly complemented by
system-level innovations that integrate LLMs
into broader frameworks (Figure5)
The most prominent of these is RAG, which
has become a cornerstone of the field Zooming
in on RAG’s evolution ( Table 8), we find it
mirrors the datasets trend: the spectrum of
source types and data formats has widened,
knowledge bases have grown, and the size of
retrieved context has expanded from single
sentences to large document chunks
Figure 5: Technique evolution over time.
Figure 6: Share of papers using open-source models
by task and timeframe.
This marks a move from standalone LLMs to system-oriented design Prompting strategies have evolved as well (Table 9): the progres-sion from in-context learning to augmented methods such as chain-of-thought, retrieval-based prompts, and self-criticism reflects a move away from relying solely on the model’s few-shot capabilities toward more deliberate prompt enrichment aimed at reducing errors 4.3 Towards Maturity
As the field matured, researchers began prioritizing shared resources over creating new dataset, increasingly relying on estab-lished, literature-backed benchmarks (Fig-ure 14), with notable growth in datasets cov-ering different tasks
A similar trend emerged on the modeling side The community increasingly turned to open-source models, valued for their trans-parency, controllability, and adaptability, as attention shifted from rapid expansion and adaptation to critical evaluation (Figure 6) Table 7 illustrates three key phases: the early dominance of GPT models, the emer-gence of LLaMA (Touvron et al., 2023), and the current diversification toward a mix of open models—such as Qwen (Bai et al.,2023) and DeepSeek (DeepSeek-AI et al.,2025)—and pro-prietary ones
Figure 8 shows how model sizes have also changed over time The field is revisiting cost-performance tradeoffs, driven by the financial and computational cost of large models This shift is reflected in the observed peak and a recent inflection in dimension
One Revolution, Two Speeds GenAI reshaped industry and academia at different paces (Figure 9) We took all the instances
of tasks, models, and datasets in our corpus,
Trang 6Figure 7: LLMs’ usage distribution over time.
Figure 8: Open-source LLMs’ sizes over time.
and computed the relative proportion of
finan-cial QA instances, open models instances, new
datasets (datasets created after 2022), and
cre-ated datasets (the dataset has been crecre-ated by
the same authors who are using it) Industry
moved faster—dominating financial QA and
driving dataset innovation to stay competitive
Academia responded more cautiously, focusing
on established tasks and open-source models,
with a stronger emphasis on transparency and
reproducibility This is likely due to academia’s
structural constraints, which prioritize
trans-parency, reproducibility, and the use of publicly
available data and models—factors that
inher-ently slow down the adoption of cutting-edge
approaches In contrast, industry has largely
traded off transparency in favor of rapid
exper-imentation, leveraging proprietary data and
closed-source LLMs to push forward advanced
use cases such as Financial QA
4.4 Looking Ahead
Financial NLP is entering a new phase,
driven not just by LLMs’ impact but by a
deeper understanding of their strengths and
limitations As techniques such as RAG and
open-source fine-tuning become standard (grey
line in Figure11), multimodal and small
lan-guage models (green line) are gaining ground
Figure 9: Academia vs Industry.
Figure 10: Latest trends in financial NLP
New trends are also emerging (blue line in Figure 11), most notably multi-agent systems These range from simple expert-critic setups to more complex designs Hierarchical agents sim-ulate organizational roles and top-down flows, while collaborative systems divide tasks among peer agents (see references in 10)
Finally, reinforcement learning is re-entering the conversation: not only as a training ob-jective for trading policies, but also as a tool for improving LLM reasoning itself by align-ing it with a broader system ensuralign-ing more controlled and reliable output
The gap between academic research and real-world financial practice remains an open de-bate, as the focus shifts from QA to reasoning systems
5 Conclusion
MetaGraph is a scalable, general-purpose framework for analyzing scientific literature via LLM-powered knowledge graphs Applied
to 681 Financial NLP papers, it traces the field’s evolution, from rapid LLM adoption and multimodal expansion to growing empha-sis on system-level integration Recent work shifts toward architectural solutions such as RAG, agent-based workflows, and reinforce-ment learning MetaGraph offers a structured
Trang 7Figure 11: Latest trends in financial NLP
lens on the field’s changing priorities and a
reusable toolkit for data-driven meta-analysis
6 Limitations
• Our approach relies on a manually defined
ontology, which introduces an inductive
bias in how entities and relations are
cate-gorized While this provides structure and
interpretability, it may also limit
flexibil-ity and overlook alternative or emergent
conceptualizations
• Despite the use of a human-in-the-loop
approach with continuous validation, the
entity extraction and taxonomy induction
processes remain based on LLMs, which
are inherently susceptible to
hallucina-tions, inaccuracies, and bias These
limi-tations may affect both the precision and
completeness of the extracted knowledge
References
Hamed Babaei Giglou, Jennifer D’Souza, and S¨ oren
Auer 2023 Llms4ol: Large language models
for ontology learning In The Semantic Web
– ISWC 2023, pages 408–427, Cham Springer
Nature Switzerland.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai
Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei
Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao
Liu, Chengqiang Lu, Keming Lu, and 29
oth-ers 2023 Qwen technical report Preprint,
arXiv:2309.16609.
Gagan Bhatia, El Moatez Billah Nagoudi, Hasan
Cavusoglu, and Muhammad Abdul-Mageed.
2024 FinTral: A family of GPT-4 level
mul-timodal financial large language models In
Findings of the Association for Computational
Linguistics: ACL 2024, pages 13064–13087,
Bangkok, Thailand Association for
Computa-tional Linguistics.
Salvatore Carta, Alessandro Giuliani, Leonardo Piano, Alessandro Sebastian Podda, Livio Pom-pianu, and Sandro Gabriele Tiddia 2023 It-erative zero-shot llm prompting for knowledge graph construction Preprint, arXiv:2307.01128 Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, and Yusuke Miyao 2024a Hierar-chical organization simulacra in the investment sector Preprint, arXiv:2410.00354.
Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, and Zhongyu Wei 2023 Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning Preprint, arXiv:2310.15205 Yuemin Chen, Feifan Wu, Jingwei Wang, Hao Qian, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, and Meng Wang 2024b Knowledge-augmented financial market analysis and report generation In Pro-ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Indus-try Track, pages 1207–1217, Miami, Florida, US Association for Computational Linguistics Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang 2021.
FinQA: A dataset of numerical reasoning over financial data In Proceedings of the 2021 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic Association for Computational Linguistics.
Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang
Ma, Sameena Shah, and William Yang Wang.
2022 ConvFinQA: Exploring the chain of numer-ical reasoning in conversational finance question answering In Proceedings of the 2022 Confer-ence on Empirical Methods in Natural Language Processing, pages 6279–6292, Abu Dhabi, United Arab Emirates Association for Computational Linguistics.
Nicole Cho, Nishan Srishankar, Lucas Cecchi, and William Watson 2024 Fishnet: Financial intel-ligence from sub-querying, harmonizing, neural-conditioning, expert swarms, and task planning
In Proceedings of the 5th ACM International Conference on AI in Finance, ICAIF ’24, page 591–599, New York, NY, USA Association for Computing Machinery.
Raul Salles de Padua, Imran Qureshi, and Mustafa U Karakaplan 2023 Gpt-3 mod-els are few-shot financial reasoners Preprint, arXiv:2307.13617.
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi,
Trang 8Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu,
Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao,
and 181 others 2025 Deepseek-r1:
Incentiviz-ing reasonIncentiviz-ing capability in llms via reinforcement
learning Preprint, arXiv:2501.12948.
Xiang Deng, Vasilisa Bashlovkina, Feng Han,
Si-mon Baumgartner, and Michael Bendersky 2022.
What do llms know about financial markets? a
case study on reddit market sentiment analysis
Preprint, arXiv:2212.11311.
Kelvin Du, Yazhi Zhao, Rui Mao, Frank Xing, and
Erik Cambria 2025 Natural language
process-ing in finance: A survey Information Fusion,
115:102755.
Darren Edge, Ha Trinh, Newman Cheng, Joshua
Bradley, Alex Chao, Apurva Mody, Steven
Tru-itt, Dasha Metropolitansky, Robert Osazuwa
Ness, and Jonathan Larson 2025 From local to
global: A graph rag approach to query-focused
summarization Preprint, arXiv:2404.16130.
Sorouralsadat Fatemi and Yuheng Hu 2024
En-hancing financial question answering with a
multi-agent reflection framework In Proceedings
of the 5th ACM International Conference on AI
in Finance, ICAIF ’24, page 530–537 ACM.
George Fatouros, Kostas Metaxas, John Soldatos,
and Dimosthenis Kyriazis 2024 Can large
lan-guage models beat wall street? evaluating gpt-4’s
impact on financial decision-making with
mar-ketsenseai Neural Computing and Applications.
Maurice Funk, Simon Hosemann, Jean Christoph
Jung, and Carsten Lutz 2023 Towards ontology
construction with language models In Joint
pro-ceedings of the 1st workshop on Knowledge Base
Construction from Pre-Trained Language Models
(KBC-LM) and the 2nd challenge on Language
Models for Knowledge Base Construction
(LM-KBC) co-located with the 22nd International
Se-mantic Web Conference (ISWC 2023), Athens,
Greece, November 6, 2023, volume 3577 of CEUR
Workshop Proceedings CEUR-WS.org.
Yunfan Gao, Yun Xiong, Xinyu Gao,
Kangxi-ang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei
Sun, Qianyu Guo, Meng Wang, and Haofen
Wang 2023 Retrieval-augmented generation
for large language models: A survey ArXiv,
abs/2312.10997.
Jiayu Guo, Yu Guo, Martha Li, and Songtao Tan.
2025 Flame: Financial large-language model
assessment and metrics evaluation Preprint,
arXiv:2501.06211.
Yue Guo and Yi Yang 2024 EconNLI: Evaluating
large language models on economics reasoning
In Findings of the Association for Computational
Linguistics: ACL 2024, pages 982–994, Bangkok,
Thailand Association for Computational
Lin-guistics.
Xuewen Han, Neng Wang, Shangkun Che, Hongyang Yang, Kunpeng Zhang, and Sean Xin
Xu 2024 Enhancing Investment Analysis: Op-timizing AI-Agent Collaboration in Financial Research Papers 2411.04788, arXiv.org Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vid-gen 2023 Financebench: A new benchmark for financial question answering Preprint, arXiv:2311.11944.
Simerjot Kaur, Charese Smiley, Akshat Gupta, Joy Sain, Dongsheng Wang, Suchetha Siddagan-gappa, Toyin Aguda, and Sameena Shah 2023.
Refind: Relation extraction financial dataset
In Proceedings of the 46th International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval, SIGIR ’23, page 3054–3063, New York, NY, USA Association for Computing Machinery.
Rik Koncel-Kedziorski, Michael Krumdick, Viet Lai, Varshini Reddy, Charles Lovering, and Chris Tanner 2024 Bizbench: A quantitative reasoning benchmark for business and finance Preprint, arXiv:2311.06602.
Michael Krumdick, Rik Koncel-Kedziorski, Viet Dac Lai, Varshini Reddy, Charles Lovering, and Chris Tanner 2024 BizBench: A quanti-tative reasoning benchmark for business and finance In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8309–8332, Bangkok, Thailand Association for Computational Linguistics.
Viet Dac Lai, Michael Krumdick, Charles Lovering, Varshini Reddy, Craig Schmidt, and Chris Tan-ner 2024 Sec-qa: A systematic evaluation cor-pus for financial qa Preprint, arXiv:2406.14394 Jiangtong Li, Yuxuan Bian, Guoxuan Wang, Yang Lei, Dawei Cheng, Zhijun Ding, and Changjun Jiang 2023a Cfgpt: Chinese financial as-sistant with large language model Preprint, arXiv:2309.10654.
Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, and Jun Huang 2024a AlphaFin: Benchmarking financial analysis with retrieval-augmented stock-chain framework In Proceed-ings of the 2024 Joint International Confer-ence on Computational Linguistics, Language Re-sources and Evaluation (LREC-COLING 2024), pages 773–783, Torino, Italia ELRA and ICCL Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen 2023b Large language models in finance:
A survey arXiv preprint arXiv:2311.10723 Ac-cepted at the 4th ACM International Conference
on AI in Finance (ICAIF-23).
Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen,
Xu Liu, and Bingsheng He 2024b CryptoTrade:
Trang 9A reflective LLM-based agent to guide zero-shot
cryptocurrency trading In Proceedings of the
2024 Conference on Empirical Methods in
Natu-ral Language Processing, pages 1094–1106,
Mi-ami, Florida, USA Association for
Computa-tional Linguistics.
Jiaxin Liu, Yi Yang, and Kar Yan Tam 2025a.
Evaluating and aligning human economic risk
preferences in llms Preprint, arXiv:2503.06646.
Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang,
and Daochen Zha 2023a Fingpt:
Democratiz-ing internet-scale data for financial large
lan-guage models Preprint, arXiv:2307.10485.
Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang,
and Daochen Zha 2023b FinGPT:
Democratiz-ing Internet-scale Data for Financial Large
Lan-guage Models Papers 2307.10485, arXiv.org.
Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng,
Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai,
Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu,
Dezhi Chen, Yun Chen, Zuo Bai, and Liwen
Zhang 2025b Fin-r1: A large language model
for financial reasoning through reinforcement
learning Preprint, arXiv:2503.16252.
Zhiwei Liu, Xin Zhang, Kailai Yang, Qianqian
Xie, Jimin Huang, and Sophia Ananiadou 2025c.
Fmdllama: Financial misinformation detection
based on large language models In
Compan-ion Proceedings of the ACM on Web Conference
2025, WWW ’25, page 1153–1157, New York,
NY, USA Association for Computing
Machin-ery.
Dakuan Lu, Hengkui Wu, Jiaqing Liang, Yipei
Xu, Qianyu He, Yipeng Geng, Mengkun Han,
Yingsi Xin, and Yanghua Xiao 2023 Bbt-fin:
Comprehensive construction of chinese financial
domain pre-trained language model, corpus and
benchmark Preprint, arXiv:2302.09432.
Macedo Maia, Siegfried Handschuh, Andr´ e Freitas,
Brian Davis, Ross McDermott, Manel Zarrouk,
and Alexandra Balahur 2018 Www’18 open
challenge: Financial opinion mining and
ques-tion answering In Companion Proceedings of
the The Web Conference 2018, WWW ’18, page
1941–1942, Republic and Canton of Geneva,
CHE International World Wide Web
Confer-ences Steering Committee.
Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki
Wallenius, and Pyry Takala 2014 Good debt
or bad debt: Detecting semantic orientations in
economic texts Journal of the Association for
Information Science and Technology, 65(4):782–
796.
Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M.
Mulvey, H Vincent Poor, Qingsong Wen, and
Stefan Zohren 2024 A survey of large
language models for financial applications:
Progress, prospects and challenges Preprint, arXiv:2406.11903.
Sohini Roychowdhury 2024 Journey of hallucination-minimized generative ai solutions for financial decision makers In Proceedings
of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 1180–1181, New York, NY, USA Association for Computing Machinery.
Manish Sanwal 2025 Layered chain-of-thought prompting for multi-agent llm systems: A com-prehensive approach to explainable large lan-guage models Preprint, arXiv:2501.18645 Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei
Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hud-son Tao, Ashay Srivastava, and 12 others.
2025 The prompt report: A systematic sur-vey of prompt engineering techniques Preprint, arXiv:2406.06608.
Raj Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, and Diyi Yang 2022.
When FLUE meets FLANG: Benchmarks and large pretrained language model for financial do-main In Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Pro-cessing, pages 2322–2335, Abu Dhabi, United Arab Emirates Association for Computational Linguistics.
Jiashuo Sun, Hang Zhang, Chen Lin, Xiangdong
Su, Yeyun Gong, and Jian Guo 2024 APOLLO:
An optimized training approach for long-form numerical reasoning In Proceedings of the 2024 Joint International Conference on Computa-tional Linguistics, Language Resources and Eval-uation (LREC-COLING 2024), pages 1370–1382, Torino, Italia ELRA and ICCL.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim-oth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil-laume Lample 2023 Llama: Open and ef-ficient foundation language models Preprint, arXiv:2302.13971.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravol-ski, Mark Dredze, Sebastian Gehrmann, Prab-hanjan Kambadur, David Rosenberg, and Gideon Mann 2023 Bloomberggpt: A large language model for finance Preprint, arXiv:2303.17564.
Zengqing Wu, Run Peng, Shuyuan Zheng, Qianying Liu, Xu Han, Brian I Kwon, Makoto Onizuka,
Trang 10Shaojie Tang, and Chuan Xiao 2024 Shall we
team up: Exploring spontaneous cooperation of
competing LLM agents In Findings of the
Asso-ciation for Computational Linguistics: EMNLP
2024, pages 5163–5186, Miami, Florida, USA.
Association for Computational Linguistics.
Qianqian Xie, Weiguang Han, Zhengyu Chen,
Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi
Xiao, Dong Li, Yongfu Dai, Duanyu Feng,
Yi-jing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan
Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang,
Zhiwei Liu, Guojun Xiong, and 15 others 2024.
Finben: A holistic financial benchmark for large
language models In Advances in Neural
Infor-mation Processing Systems, volume 37, pages
95716–95743 Curran Associates, Inc.
Qianqian Xie, Weiguang Han, Yanzhao Lai, Min
Peng, and Jimin Huang 2023a The wall street
neophyte: A zero-shot analysis of chatgpt over
multimodal stock movement prediction
chal-lenges Preprint, arXiv:2304.05351.
Qianqian Xie, Weiguang Han, Xiao Zhang,
Yanzhao Lai, Min Peng, Alejandro Lopez-Lira,
and Jimin Huang 2023b Pixiu: a large
lan-guage model, instruction data and evaluation
benchmark for finance In Proceedings of the
37th International Conference on Neural
Infor-mation Processing Systems, NIPS ’23, Red Hook,
NY, USA Curran Associates Inc.
Frank Z Xing, Erik Cambria, and Roy E Welsch.
2018 Natural language based financial
forecast-ing: a survey Artificial Intelligence Review,
50(1):49–73.
Siqiao Xue, Fan Zhou, Yi Xu, Ming Jin, Qingsong
Wen, Hongyan Hao, Qingyang Dai, Caigao Jiang,
Hongyu Zhao, Shuo Xie, Jianshan He, James
Zhang, and Hongyuan Mei 2024 Weaverbird:
Empowering financial decision-making with large
language model, knowledge base, and search
en-gine Preprint, arXiv:2308.05361.
Hongyang Yang, Xiao-Yang Liu, and Christina Dan
Wang 2023a Fingpt: Open-source
fi-nancial large language models Preprint,
arXiv:2306.06031.
Hongyang Yang, Boyu Zhang, Neng Wang, Cheng
Guo, Xiaoli Zhang, Likun Lin, Junlin Wang,
Tianyu Zhou, Mao Guan, Runjia Zhang, and
Christina Dan Wang 2024 Finrobot: An
open-source ai agent platform for financial
applica-tions using large language models Preprint,
arXiv:2405.14767.
Yi Yang, Yixuan Tang, and Kar Yan Tam 2023b.
Investlm: A large language model for
invest-ment using financial domain instruction tuning
Preprint, arXiv:2309.13064.
Antonio Jimeno Yepes, Yao You, Jan Milczek,
Se-bastian Laverde, and Renyu Li 2024 Financial
report chunking for effective retrieval augmented generation Preprint, arXiv:2402.05131.
Xinli Yu, Zheng Chen, and Yanbin Lu 2023 Har-nessing LLMs for temporal data - a study on explainable financial time series forecasting In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Indus-try Track, pages 739–753, Singapore Association for Computational Linguistics.
Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan W Suchow, Zhenyu Cui, Rong Liu, Zhaozhuo Xu, Denghui Zhang, Koduvayur Sub-balakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, and Qianqian Xie 2024 Fin-con: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making In Advances in Neu-ral Information Processing Systems, volume 37, pages 137010–137045 Curran Associates, Inc Alexandros Zeakis, George Papadakis, Dimitrios Skoutas, and Manolis Koubarakis 2023 Pre-trained embeddings for entity resolution: An experimental analysis Proc VLDB Endow., 16(9):2225–2238.
Boyu Zhang, Hongyang Yang, Tianyu Zhou, Muhammad Ali Babar, and Xiao-Yang Liu 2023.
Enhancing financial sentiment analysis via re-trieval augmented large language models In Proceedings of the Fourth ACM International Conference on AI in Finance, ICAIF ’23, page 349–356, New York, NY, USA Association for Computing Machinery.
Yiyun Zhao, Prateek Singh, Hanoz Bhathena, Bernardo Ramos, Aviral Joshi, Swaroop Gadi-yaram, and Saket Sharma 2024 Optimiz-ing LLM based retrieval augmented generation pipelines in the financial domain In Proceedings
of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol-ume 6: Industry Track), pages 279–294, Mexico City, Mexico Association for Computational Lin-guistics.
A The Geography of Financial NLP
The map depicted in Figure 12 illustrates the geographical distribution of institutions represented in our corpus It clearly high-lights that financial NLP research predomi-nantly clusters around three major global hubs
In the United States, the research activity is highly concentrated along the Atlantic Coast, with a distinct epicenter in New York City
In East Asia, significant research centers have emerged in major economic and technologi-cal hubs, notably within China, Korea, Japan,