Computation and Language 67
☆ Observational Scaling Laws and the Predictability of Language Model Performance
Understanding how language model performance varies with scale is critical to
benchmark and algorithm development. Scaling laws are one approach to building
this understanding, but the requirement of training models across many
different scales has limited their use. We propose an alternative,
observational approach that bypasses model training and instead builds scaling
laws from ~80 publically available models. Building a single scaling law from
multiple model families is challenging due to large variations in their
training compute efficiencies and capabilities. However, we show that these
variations are consistent with a simple, generalized scaling law where language
model performance is a function of a low-dimensional capability space, and
model families only vary in their efficiency in converting training compute to
capabilities. Using this approach, we show the surprising predictability of
complex scaling phenomena: we show that several emergent phenomena follow a
smooth, sigmoidal behavior and are predictable from small models; we show that
the agent performance of models such as GPT-4 can be precisely predicted from
simpler non-agentic benchmarks; and we show how to predict the impact of
post-training interventions like Chain-of-Thought and Self-Consistency as
language model capabilities continue to improve.
☆ A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers
Kaiyu Huang, Fengran Mo, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, Jinan Xu, Jian-Yun Nie, Yang Liu
The rapid development of Large Language Models (LLMs) demonstrates remarkable
multilingual capabilities in natural language processing, attracting global
attention in both academia and industry. To mitigate potential discrimination
and enhance the overall usability and accessibility for diverse language user
groups, it is important for the development of language-fair technology.
Despite the breakthroughs of LLMs, the investigation into the multilingual
scenario remains insufficient, where a comprehensive survey to summarize recent
approaches, developments, limitations, and potential solutions is desirable. To
this end, we provide a survey with multiple perspectives on the utilization of
LLMs in the multilingual scenario. We first rethink the transitions between
previous and current research on pre-trained language models. Then we introduce
several perspectives on the multilingualism of LLMs, including training and
inference methods, model security, multi-domain with language culture, and
usage of datasets. We also discuss the major challenges that arise in these
aspects, along with possible solutions. Besides, we highlight future research
directions that aim at further enhancing LLMs with multilingualism. The survey
aims to help the research community address multilingual problems and provide a
comprehensive understanding of the core concepts, key techniques, and latest
developments in multilingual natural language processing based on LLMs.
comment: 54 pages, Work in Progress
☆ GenToC: Leveraging Partially-Labeled Data for Product Attribute-Value Identification
In the e-commerce domain, the accurate extraction of attribute-value pairs
from product listings (e.g., Brand: Apple) is crucial for enhancing search and
recommendation systems. The automation of this extraction process is
challenging due to the vast diversity of product categories and their
respective attributes, compounded by the lack of extensive, accurately
annotated training datasets and the demand for low latency to meet the
real-time needs of e-commerce platforms. To address these challenges, we
introduce GenToC, a novel two-stage model for extracting attribute-value pairs
from product titles. GenToC is designed to train with partially-labeled data,
leveraging incomplete attribute-value pairs and obviating the need for a fully
annotated dataset. Moreover, we introduce a bootstrapping method that enables
GenToC to progressively refine and expand its training dataset. This
enhancement substantially improves the quality of data available for training
other neural network models that are typically faster but are inherently less
capable than GenToC in terms of their capacity to handle partially-labeled
data. By supplying an enriched dataset for training, GenToC significantly
advances the performance of these alternative models, making them more suitable
for real-time deployment. Our results highlight the unique capability of GenToC
to learn from a limited set of labeled data and to contribute to the training
of more efficient models, marking a significant leap forward in the automated
extraction of attribute-value pairs from product titles. GenToC has been
successfully integrated into India's largest B2B e-commerce platform,
IndiaMART.com, achieving a significant increase of 21.1% in recall over the
existing deployed system while maintaining a high precision of 89.5% in this
challenging task.
☆ COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain
Dimitrios P. Panagoulias, Persephone Papatheodosiou, Anastasios P. Palamidas, Mattheos Sanoudos, Evridiki Tsoureli-Nikita, Maria Virvou, George A. Tsihrintzis
Large Language Models (LLMs) constitute a breakthrough state-of-the-art
Artificial Intelligence (AI) technology which is rapidly evolving and promises
to aid in medical diagnosis either by assisting doctors or by simulating a
doctor's workflow in more advanced and complex implementations. In this
technical paper, we outline Cognitive Network Evaluation Toolkit for Medical
Domains (COGNET-MD), which constitutes a novel benchmark for LLM evaluation in
the medical domain. Specifically, we propose a scoring-framework with increased
difficulty to assess the ability of LLMs in interpreting medical text. The
proposed framework is accompanied with a database of Multiple Choice Quizzes
(MCQs). To ensure alignment with current medical trends and enhance safety,
usefulness, and applicability, these MCQs have been constructed in
collaboration with several associated medical experts in various medical
domains and are characterized by varying degrees of difficulty. The current
(first) version of the database includes the medical domains of Psychiatry,
Dentistry, Pulmonology, Dermatology and Endocrinology, but it will be
continuously extended and expanded to include additional medical domains.
comment: Technical Paper
☆ Tailoring Vaccine Messaging with Common-Ground Opinions NAACL
Rickard Stureborg, Sanxing Chen, Ruoyu Xie, Aayushi Patel, Christopher Li, Chloe Qinyu Zhu, Tingnan Hu, Jun Yang, Bhuwan Dhingra
One way to personalize chatbot interactions is by establishing common ground
with the intended reader. A domain where establishing mutual understanding
could be particularly impactful is vaccine concerns and misinformation. Vaccine
interventions are forms of messaging which aim to answer concerns expressed
about vaccination. Tailoring responses in this domain is difficult, since
opinions often have seemingly little ideological overlap. We define the task of
tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring
responses to a CGO involves meaningfully improving the answer by relating it to
an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a
dataset for evaluating how well responses are tailored to provided CGOs. We
benchmark several major LLMs on this task; finding GPT-4-Turbo performs
significantly better than others. We also build automatic evaluation metrics,
including an efficient and accurate BERT model that outperforms finetuned LLMs,
investigate how to successfully tailor vaccine messaging to CGOs, and provide
actionable recommendations from this investigation.
Code and model weights: https://github.com/rickardstureborg/tailor-cgo
Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo
comment: NAACL Findings 2024
☆ ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains IJCAI 2024
Understanding the process of emotion generation is crucial for analyzing the
causes behind emotions. Causal Emotion Entailment (CEE), an
emotion-understanding task, aims to identify the causal utterances in a
conversation that stimulate the emotions expressed in a target utterance.
However, current works in CEE mainly focus on modeling semantic and emotional
interactions in conversations, neglecting the exploration of the
emotion-generation process. This hinders the models from deeply understanding
emotions, restricting their ability to produce explainable predictions. In this
work, inspired by the emotion generation process of
"stimulus-appraisal-emotion" in the cognitive appraisal theory, we introduce a
step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to
infer the stimulus from the target emotional expressions in conversations.
Specifically, we first introduce the ECR-Chain to ChatGPT via few-shot
prompting, which significantly improves its performance on the CEE task. We
further propose an automated construction process to utilize ChatGPT in
building an ECR-Chain set, which can enhance the reasoning abilities of smaller
models through supervised training and assist the Vicuna-7B model in achieving
state-of-the-art CEE performance. Moreover, our methods can enable these
generative language models to effectively perform emotion-cause reasoning in an
explainable manner. Our code, data and more details are at
https://github.com/hzp3517/ECR-Chain.
comment: Accepted by IJCAI 2024
☆ ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios
Active learning is designed to minimize annotation efforts by prioritizing
instances that most enhance learning. However, many active learning strategies
struggle with a 'cold start' problem, needing substantial initial data to be
effective. This limitation often reduces their utility for pre-trained models,
which already perform well in few-shot scenarios. To address this, we introduce
ActiveLLM, a novel active learning approach that leverages large language
models such as GPT-4, Llama 3, and Mistral Large for selecting instances. We
demonstrate that ActiveLLM significantly enhances the classification
performance of BERT classifiers in few-shot scenarios, outperforming both
traditional active learning methods and the few-shot learning method SetFit.
Additionally, ActiveLLM can be extended to non-few-shot scenarios, allowing for
iterative selections. In this way, ActiveLLM can even help other active
learning strategies to overcome their cold start problem. Our results suggest
that ActiveLLM offers a promising solution for improving model performance
across various learning setups.
comment: 18 pages, 7 figures, 4 tables
☆ Empowering Small-Scale Knowledge Graphs: A Strategy of Leveraging General-Purpose Knowledge Graphs for Enriched Embeddings LREC
Knowledge-intensive tasks pose a significant challenge for Machine Learning
(ML) techniques. Commonly adopted methods, such as Large Language Models
(LLMs), often exhibit limitations when applied to such tasks. Nevertheless,
there have been notable endeavours to mitigate these challenges, with a
significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While
KGs provide many advantages for representing knowledge, their development costs
can deter extensive research and applications. Addressing this limitation, we
introduce a framework for enriching embeddings of small-scale domain-specific
Knowledge Graphs with well-established general-purpose KGs. Adopting our
method, a modest domain-specific KG can benefit from a performance boost in
downstream tasks when linked to a substantial general-purpose KG. Experimental
evaluations demonstrate a notable enhancement, with up to a 44% increase
observed in the Hits@10 metric. This relatively unexplored research direction
can catalyze more frequent incorporation of KGs in knowledge-intensive tasks,
resulting in more robust, reliable ML implementations, which hallucinates less
than prevalent LLM solutions.
Keywords: knowledge graph, knowledge graph completion, entity alignment,
representation learning, machine learning
comment: Accepted for LREC-COLING 2024
☆ SBAAM! Eliminating Transcript Dependency in Automatic Subtitling ACL 2024
Subtitling plays a crucial role in enhancing the accessibility of audiovisual
content and encompasses three primary subtasks: translating spoken dialogue,
segmenting translations into concise textual units, and estimating timestamps
that govern their on-screen duration. Past attempts to automate this process
rely, to varying degrees, on automatic transcripts, employed diversely for the
three subtasks. In response to the acknowledged limitations associated with
this reliance on transcripts, recent research has shifted towards
transcription-free solutions for translation and segmentation, leaving the
direct generation of timestamps as uncharted territory. To fill this gap, we
introduce the first direct model capable of producing automatic subtitles,
entirely eliminating any dependence on intermediate transcripts also for
timestamp prediction. Experimental results, backed by manual evaluation,
showcase our solution's new state-of-the-art performance across multiple
language pairs and diverse conditions.
comment: Accepted to ACL 2024 main conference
☆ Feature-Adaptive and Data-Scalable In-Context Learning ACL 2024
In-context learning (ICL), which promotes inference with several
demonstrations, has become a widespread paradigm to stimulate LLM capabilities
for downstream tasks. Due to context length constraints, it cannot be further
improved in spite of more training data, and general features directly from
LLMs in ICL are not adaptive to the specific downstream task. In this paper, we
propose a feature-adaptive and data-scalable in-context learning framework
(FADS-ICL), which can leverage task-adaptive features to promote inference on
the downstream task, with the supervision of beyond-context samples.
Specifically, it first extracts general features of beyond-context samples via
the LLM with ICL input form one by one, and introduces a task-specific
modulator to perform feature refinement and prediction after fitting a specific
downstream task. We conduct extensive experiments on FADS-ICL under varying
data settings (4$\sim$128 shots) and LLM scale (0.8$\sim$70B) settings.
Experimental results show that FADS-ICL consistently outperforms previous
state-of-the-art methods by a significant margin under all settings, verifying
the effectiveness and superiority of FADS-ICL. For example, under the 1.5B and
32 shots setting, FADS-ICL can achieve \textbf{+14.3} average accuracy from
feature adaptation over vanilla ICL on 10 datasets, with \textbf{+6.2} average
accuracy over the previous state-of-the-art method, and the performance can
further improve with increasing training data. Code and data are publicly
available at \url{https://github.com/jiahaozhenbang/FADS-ICL}.
comment: Accepted at ACL 2024 main conference
☆ INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kayleen Bugbee, Mike Little, Elizabeth Fancher, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grazes, Megan Ansdel, Alberto Accomazzi, Yousef El-Kurdi, Davis Wertheimer, Birgit Pfitzmann, Cesar Berrospi Ramis, Michele Dolfi, Rafael Teixeira de Lima, Panos Vegenas, S. Karthik Mukkavilli, Peter Staar, Sanaz Vahidinia, Ryan McGranaghan, Armin Mehrabian, Tsendgar Lee
Large language models (LLMs) trained on general domain corpora showed
remarkable results on natural language processing (NLP) tasks. However,
previous research demonstrated LLMs trained using domain-focused corpora
perform better on specialized tasks. Inspired by this pivotal insight, we
developed INDUS, a comprehensive suite of LLMs tailored for the Earth science,
biology, physics, heliophysics, planetary sciences and astrophysics domains and
trained using curated scientific corpora drawn from diverse data sources. The
suite of models include: (1) an encoder model trained using domain-specific
vocabulary and corpora to address natural language understanding tasks, (2) a
contrastive-learning-based general text embedding model trained using a diverse
set of datasets drawn from multiple sources to address information retrieval
tasks and (3) smaller versions of these models created using knowledge
distillation techniques to address applications which have latency or resource
constraints. We also created three new scientific benchmark datasets namely,
CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR
(IR) to accelerate research in these multi-disciplinary fields. Finally, we
show that our models outperform both general-purpose encoders (RoBERTa) and
existing domain-specific encoders (SciBERT) on these new tasks as well as
existing benchmark tasks in the domains of interest.
☆ SignLLM: Sign Languages Production Large Language Models
In this paper, we introduce the first comprehensive multilingual sign
language dataset named Prompt2Sign, which builds from public data including
American Sign Language (ASL) and seven others. Our dataset transforms a vast
array of videos into a streamlined, model-friendly format, optimized for
training with translation models like seq2seq and text2text. Building on this
new dataset, we propose SignLLM, the first multilingual Sign Language
Production (SLP) model, which includes two novel multilingual SLP modes that
allow for the generation of sign language gestures from input text or prompt.
Both of the modes can use a new loss and a module based on reinforcement
learning, which accelerates the training by enhancing the model's capability to
autonomously sample high-quality data. We present benchmark results of SignLLM,
which demonstrate that our model achieves state-of-the-art performance on SLP
tasks across eight sign languages.
comment: 33 pages, website at https://signllm.github.io/
☆ Persian Pronoun Resolution: Leveraging Neural Networks and Language Models
Coreference resolution, critical for identifying textual entities referencing
the same entity, faces challenges in pronoun resolution, particularly
identifying pronoun antecedents. Existing methods often treat pronoun
resolution as a separate task from mention detection, potentially missing
valuable information. This study proposes the first end-to-end neural network
system for Persian pronoun resolution, leveraging pre-trained Transformer
models like ParsBERT. Our system jointly optimizes both mention detection and
antecedent linking, achieving a 3.37 F1 score improvement over the previous
state-of-the-art system (which relied on rule-based and statistical methods) on
the Mehr corpus. This significant improvement demonstrates the effectiveness of
combining neural networks with linguistic models, potentially marking a
significant advancement in Persian pronoun resolution and paving the way for
further research in this under-explored area.
☆ Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and Interpretation
The classification of statements provided by individuals during police
interviews is a complex and significant task within the domain of natural
language processing (NLP) and legal informatics. The lack of extensive
domain-specific datasets raises challenges to the advancement of NLP methods in
the field. This paper aims to address some of the present challenges by
introducing a novel dataset tailored for classification of statements made
during police interviews, prior to court proceedings. Utilising the curated
dataset for training and evaluation, we introduce a fine-tuned DistilBERT model
that achieves state-of-the-art performance in distinguishing truthful from
deceptive statements. To enhance interpretability, we employ explainable
artificial intelligence (XAI) methods to offer explainability through saliency
maps, that interpret the model's decision-making process. Lastly, we present an
XAI interface that empowers both legal professionals and non-specialists to
interact with and benefit from our system. Our model achieves an accuracy of
86%, and is shown to outperform a custom transformer architecture in a
comparative study. This holistic approach advances the accessibility,
transparency, and effectiveness of statement analysis, with promising
implications for both legal practice and research.
☆ SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks
Diaspora communities are disproportionately impacted by off-the-radar
misinformation and often neglected by mainstream fact-checking efforts,
creating a critical need to scale-up efforts of nascent fact-checking
initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic
Dataset Generation to leverage the capabilities of the largest frontier Large
Language Models (LLMs) to train local, specialized language models. To the best
of our knowledge, SynDy is the first paper utilizing LLMs to create
fine-grained synthetic labels for tasks of direct relevance to misinformation
mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship
Classification. SynDy utilizes LLMs and social media queries to automatically
generate distantly-supervised, topically-focused datasets with synthetic labels
on these three tasks, providing essential tools to scale up human-led
fact-checking at a fraction of the cost of human-annotated data. Training on
SynDy's generated labels shows improvement over a standard baseline and is not
significantly worse compared to training on human labels (which may be
infeasible to acquire). SynDy is being integrated into Meedan's chatbot
tiplines that are used by over 50 organizations, serve over 230K users
annually, and automatically distribute human-written fact-checks via messaging
apps such as WhatsApp. SynDy will also be integrated into our deployed
Co-Insights toolkit, enabling low-resource organizations to launch tiplines for
their communities. Finally, we envision SynDy enabling additional fact-checking
tools such as matching new misinformation claims to high-quality explainers on
common misinformation topics.
☆ Revolutionizing Process Mining: A Novel Architecture for ChatGPT Integration and Enhanced User Experience through Optimized Prompt Engineering
In the rapidly evolving field of business process management, there is a
growing need for analytical tools that can transform complex data into
actionable insights. This research introduces a novel approach by integrating
Large Language Models (LLMs), such as ChatGPT, into process mining tools,
making process analytics more accessible to a wider audience. The study aims to
investigate how ChatGPT enhances analytical capabilities, improves user
experience, increases accessibility, and optimizes the architectural frameworks
of process mining tools. The key innovation of this research lies in developing
a tailored prompt engineering strategy for each process mining submodule,
ensuring that the AI-generated outputs are accurate and relevant to the
context. The integration architecture follows an Extract, Transform, Load (ETL)
process, which includes various process mining engine modules and utilizes
zero-shot and optimized prompt engineering techniques. ChatGPT is connected via
APIs and receives structured outputs from the process mining modules, enabling
conversational interactions. To validate the effectiveness of this approach,
the researchers used data from 17 companies that employ BehfaLab's Process
Mining Tool. The results showed significant improvements in user experience,
with an expert panel rating 72% of the results as "Good". This research
contributes to the advancement of business process analysis methodologies by
combining process mining with artificial intelligence. Future research
directions include further optimization of prompt engineering, exploration of
integration with other AI technologies, and assessment of scalability across
various business environments. This study paves the way for continuous
innovation at the intersection of process mining and artificial intelligence,
promising to revolutionize the way businesses analyze and optimize their
processes.
☆ Realistic Evaluation of Toxicity in Large Language Models
Large language models (LLMs) have become integral to our professional
workflows and daily lives. Nevertheless, these machine companions of ours have
a critical flaw: the huge amount of data which endows them with vast and
diverse knowledge, also exposes them to the inevitable toxicity and bias. While
most LLMs incorporate defense mechanisms to prevent the generation of harmful
content, these safeguards can be easily bypassed with minimal prompt
engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity
(TET) dataset, comprising manually crafted prompts designed to nullify the
protective layers of such models. Through extensive evaluations, we demonstrate
the pivotal role of TET in providing a rigorous benchmark for evaluation of
toxicity awareness in several popular LLMs: it highlights the toxicity in the
LLMs that might remain hidden when using normal prompts, thus revealing subtler
issues in their behavior.
☆ SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation
Compositional generalization is an important ability of language models and
has many different manifestations. For data-to-text generation, previous
research on this ability is limited to a single manifestation called
Systematicity and lacks consideration of large language models (LLMs), which
cannot fully cover practical application scenarios. In this work, we propose
SPOR, a comprehensive and practical evaluation method for compositional
generalization in data-to-text generation. SPOR includes four aspects of
manifestations (Systematicity, Productivity, Order invariance, and Rule
learnability) and allows high-quality evaluation without additional manual
annotations based on existing datasets. We demonstrate SPOR on two different
datasets and evaluate some existing language models including LLMs. We find
that the models are deficient in various aspects of the evaluation and need
further improvement. Our work shows the necessity for comprehensive research on
different manifestations of compositional generalization in data-to-text
generation and provides a framework for evaluation.
☆ Layer-Condensed KV Cache for Efficient Inference of Large Language Models ACL2024
Huge memory consumption has been a major bottleneck for deploying
high-throughput large language models in real-world applications. In addition
to the large number of parameters, the key-value (KV) cache for the attention
mechanism in the transformer architecture consumes a significant amount of
memory, especially when the number of layers is large for deep language models.
In this paper, we propose a novel method that only computes and caches the KVs
of a small number of layers, thus significantly saving memory consumption and
improving inference throughput. Our experiments on large language models show
that our method achieves up to 26$\times$ higher throughput than standard
transformers and competitive performance in language modeling and downstream
tasks. In addition, our method is orthogonal to existing transformer
memory-saving techniques, so it is straightforward to integrate them with our
model, achieving further improvement in inference efficiency. Our code is
available at https://github.com/whyNLP/LCKV.
comment: Accepted to ACL2024 main conference
☆ Medical Dialogue: A Survey of Categories, Methods, Evaluation and Challenges
Xiaoming Shi, Zeming Liu, Li Du, Yuxuan Wang, Hongru Wang, Yuhang Guo, Tong Ruan, Jie Xu, Shaoting Zhang
This paper surveys and organizes research works on medical dialog systems,
which is an important yet challenging task. Although these systems have been
surveyed in the medical community from an application perspective, a systematic
review from a rigorous technical perspective has to date remained noticeably
absent. As a result, an overview of the categories, methods, and evaluation of
medical dialogue systems remain limited and underspecified, hindering the
further improvement of this area. To fill this gap, we investigate an initial
pool of 325 papers from well-known computer science, and natural language
processing conferences and journals, and make an overview. Recently, large
language models have shown strong model capacity on downstream tasks, which
also reshaped medical dialog systems' foundation. Despite the alluring
practical application value, current medical dialogue systems still suffer from
problems. To this end, this paper lists the grand challenges of medical dialog
systems, especially of large language models.
☆ DeepPavlov at SemEval-2024 Task 8: Leveraging Transfer Learning for Detecting Boundaries of Machine-Generated Texts SemEval-2024
The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated
Text Detection shared task in the SemEval-2024 competition aims to tackle the
problem of misusing collaborative human-AI writing. Although there are a lot of
existing detectors of AI content, they are often designed to give a binary
answer and thus may not be suitable for more nuanced problem of finding the
boundaries between human-written and machine-generated texts, while hybrid
human-AI writing becomes more and more popular. In this paper, we address the
boundary detection problem. Particularly, we present a pipeline for augmenting
data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score,
according to the leaderboard of the competition, with this pipeline.
comment: New best score from the leaderboard, to appear in SemEval-2024
Workshop proceedings
☆ Dynamic data sampler for cross-language transfer learning in large language models ICASSP 2024
Large Language Models (LLMs) have gained significant attention in the field
of natural language processing (NLP) due to their wide range of applications.
However, training LLMs for languages other than English poses significant
challenges, due to the difficulty in acquiring large-scale corpus and the
requisite computing resources. In this paper, we propose ChatFlow, a
cross-language transfer-based LLM, to address these challenges and train large
Chinese language models in a cost-effective manner. We employ a mix of Chinese,
English, and parallel corpus to continuously train the LLaMA2 model, aiming to
align cross-language representations and facilitate the knowledge transfer
specifically to the Chinese language model. In addition, we use a dynamic data
sampler to progressively transition the model from unsupervised pre-training to
supervised fine-tuning. Experimental results demonstrate that our approach
accelerates model convergence and achieves superior performance. We evaluate
ChatFlow on popular Chinese and English benchmarks, the results indicate that
it outperforms other Chinese models post-trained on LLaMA-2-7B.
comment: Accepted by ICASSP 2024
☆ Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction
Transformer-based encoder-decoder models have demonstrated impressive results
in chemical reaction prediction tasks. However, these models typically rely on
pretraining using tens of millions of unlabelled molecules, which can be
time-consuming and GPU-intensive. One of the central questions we aim to answer
in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained
solely on language data, be effectively specialised for organic reaction
prediction through task-specific fine-tuning? We conduct a systematic empirical
study on several key issues of the process, including tokenisation, the impact
of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding
algorithms at inference. Our key findings indicate that although being
pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation
to fine-tune for reaction prediction, and thus become `chemistry domain
compatible' in the process. This suggests that GPU-intensive and expensive
pretraining on a large dataset of unlabelled molecules may be useful yet not
essential to leverage the power of language models for chemistry. All our
models achieve comparable Top-1 and Top-5 accuracy although some variation
across different models does exist. Notably, tokenisation and vocabulary
trimming slightly affect final performance but can speed up training and
inference; The most efficient greedy decoding strategy is very competitive
while only marginal gains can be achieved from more sophisticated decoding
algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions
and benchmark their impact on organic reaction prediction, which may guide more
effective use of these state-of-the-art language models for chemistry-related
tasks in the future.
comment: Preprint
☆ Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization ACL
In recent years, large language models (LLMs) have driven advances in natural
language processing. Still, their growing scale has increased the computational
burden, necessitating a balance between efficiency and performance. Low-rank
compression, a promising technique, reduces non-essential parameters by
decomposing weight matrices into products of two low-rank matrices. Yet, its
application in LLMs has not been extensively studied. The key to low-rank
compression lies in low-rank factorization and low-rank dimensions allocation.
To address the challenges of low-rank compression in LLMs, we conduct empirical
research on the low-rank characteristics of large models. We propose a low-rank
compression method suitable for LLMs. This approach involves precise estimation
of feature distributions through pooled covariance matrices and a Bayesian
optimization strategy for allocating low-rank dimensions. Experiments on the
LLaMA-2 models demonstrate that our method outperforms existing strong
structured pruning and low-rank compression techniques in maintaining model
performance at the same compression ratio.
comment: Accepted by 2024 ACL findings
☆ UniCL: A Universal Contrastive Learning Framework for Large Time Series Models
Time-series analysis plays a pivotal role across a range of critical
applications, from finance to healthcare, which involves various tasks, such as
forecasting and classification. To handle the inherent complexities of
time-series data, such as high dimensionality and noise, traditional supervised
learning methods first annotate extensive labels for time-series data in each
task, which is very costly and impractical in real-world applications. In
contrast, pre-trained foundation models offer a promising alternative by
leveraging unlabeled data to capture general time series patterns, which can
then be fine-tuned for specific tasks. However, existing approaches to
pre-training such models typically suffer from high-bias and low-generality
issues due to the use of predefined and rigid augmentation operations and
domain-specific data training. To overcome these limitations, this paper
introduces UniCL, a universal and scalable contrastive learning framework
designed for pretraining time-series foundation models across cross-domain
datasets. Specifically, we propose a unified and trainable time-series
augmentation operation to generate pattern-preserved, diverse, and low-bias
time-series data by leveraging spectral information. Besides, we introduce a
scalable augmentation algorithm capable of handling datasets with varying
lengths, facilitating cross-domain pretraining. Extensive experiments on two
benchmark datasets across eleven domains validate the effectiveness of UniCL,
demonstrating its high generalization on time-series analysis across various
fields.
☆ RDRec: Rationale Distillation for LLM-based Recommendation ACL 2024
Large language model (LLM)-based recommender models that bridge users and
items through textual prompts for effective semantic reasoning have gained
considerable attention. However, few methods consider the underlying rationales
behind interactions, such as user preferences and item attributes, limiting the
reasoning capability of LLMs for recommendations. This paper proposes a
rationale distillation recommender (RDRec), a compact model designed to learn
rationales generated by a larger language model (LM). By leveraging rationales
from reviews related to users and items, RDRec remarkably specifies their
profiles for recommendations. Experiments show that RDRec achieves
state-of-the-art (SOTA) performance in both top-N and sequential
recommendations. Our source code is released at
https://github.com/WangXFng/RDRec.
comment: 10 pages. Accepted to ACL 2024 Main as a short paper
☆ A Hybrid Deep Learning Framework for Stock Price Prediction Considering the Investor Sentiment of Online Forum Enhanced by Popularity
Stock price prediction has always been a difficult task for forecasters.
Using cutting-edge deep learning techniques, stock price prediction based on
investor sentiment extracted from online forums has become feasible. We propose
a novel hybrid deep learning framework for predicting stock prices. The
framework leverages the XLNET model to analyze the sentiment conveyed in user
posts on online forums, combines these sentiments with the post popularity
factor to compute daily group sentiments, and integrates this information with
stock technical indicators into an improved BiLSTM-highway model for stock
price prediction. Through a series of comparative experiments involving four
stocks on the Chinese stock market, it is demonstrated that the hybrid
framework effectively predicts stock prices. This study reveals the necessity
of analyzing investors' textual views for stock price prediction.
☆ A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models
In this work, we explore idiomatic language processing with Large Language
Models (LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new
dataset of difficult examples specifically designed by language experts to
assess the capabilities of LLMs to process figurative language at sentence
level. We propose a comprehensive evaluation methodology based on an idiom
detection task, where LLMs are prompted with detecting an idiomatic expression
in a given English sentence. We present a thorough automatic and manual
evaluation of the results and an extensive error analysis.
☆ Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks ACL 2024
Large Language Models (LLMs) have transformed NLP with their remarkable
In-context Learning (ICL) capabilities. Automated assistants based on LLMs are
gaining popularity; however, adapting them to novel tasks is still challenging.
While colossal models excel in zero-shot performance, their computational
demands limit widespread use, and smaller language models struggle without
context. This paper investigates whether LLMs can generalize from labeled
examples of predefined tasks to novel tasks. Drawing inspiration from
biological neurons and the mechanistic interpretation of the Transformer
architecture, we explore the potential for information sharing across tasks. We
design a cross-task prompting setup with three LLMs and show that LLMs achieve
significant performance improvements despite no examples from the target task
in the context. Cross-task prompting leads to a remarkable performance boost of
107% for LLaMA-2 7B, 18.6% for LLaMA-2 13B, and 3.2% for GPT 3.5 on average
over zero-shot prompting, and performs comparable to standard in-context
learning. The effectiveness of generating pseudo-labels for in-task examples is
demonstrated, and our analyses reveal a strong correlation between the effect
of cross-task examples and model activation similarities in source and target
input tokens. This paper offers a first-of-its-kind exploration of LLMs'
ability to solve novel tasks based on contextual signals from different task
examples.
comment: Accepted at ACL 2024 Main
☆ Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset ACL 2024
In light of recent breakthroughs in large language models (LLMs) that have
revolutionized natural language processing (NLP), there is an urgent need for
new benchmarks to keep pace with the fast development of LLMs. In this paper,
we propose CFLUE, the Chinese Financial Language Understanding Evaluation
benchmark, designed to assess the capability of LLMs across various dimensions.
Specifically, CFLUE provides datasets tailored for both knowledge assessment
and application assessment. In knowledge assessment, it consists of 38K+
multiple-choice questions with associated solution explanations. These
questions serve dual purposes: answer prediction and question reasoning. In
application assessment, CFLUE features 16K+ test instances across distinct
groups of NLP tasks such as text classification, machine translation, relation
extraction, reading comprehension, and text generation. Upon CFLUE, we conduct
a thorough evaluation of representative LLMs. The results reveal that only
GPT-4 and GPT-4-turbo achieve an accuracy exceeding 60\% in answer prediction
for knowledge assessment, suggesting that there is still substantial room for
improvement in current LLMs. In application assessment, although GPT-4 and
GPT-4-turbo are the top two performers, their considerable advantage over
lightweight LLMs is noticeably diminished. The datasets and scripts associated
with CFLUE are openly accessible at https://github.com/aliyun/cflue.
comment: Accepted by ACL 2024
☆ Smart Expert System: Large Language Models as Text Classifiers
Text classification is a fundamental task in Natural Language Processing
(NLP), and the advent of Large Language Models (LLMs) has revolutionized the
field. This paper introduces the Smart Expert System, a novel approach that
leverages LLMs as text classifiers. The system simplifies the traditional text
classification workflow, eliminating the need for extensive preprocessing and
domain expertise. The performance of several LLMs, machine learning (ML)
algorithms, and neural network (NN) based structures is evaluated on four
datasets. Results demonstrate that certain LLMs surpass traditional methods in
sentiment analysis, spam SMS detection and multi-label classification.
Furthermore, it is shown that the system's performance can be further enhanced
through few-shot or fine-tuning strategies, making the fine-tuned model the top
performer across all datasets. Source code and datasets are available in this
GitHub repository: https://github.com/yeyimilk/llm-zero-shot-classifiers.
comment: 11 pages, 3 figures, and 8 tables
☆ Towards Better Question Generation in QA-Based Event Extraction ACL2024
Event Extraction (EE) is an essential information extraction task that aims
to extract event-related information from unstructured texts. The paradigm of
this task has shifted from conventional classification-based methods to more
contemporary question-answering (QA)-based approaches. However, in QA-based EE,
the questions' quality dramatically affects the extraction accuracy, and how to
generate high-quality questions for QA-based EE still remains a challenge. In
this work, to tackle this challenge, we suggest four criteria to evaluate the
quality of a question and propose a reinforcement learning method for QA-Based
EE that can generate fluent, generalizable, and context-dependent questions and
provides clear guidance to QA models. The extensive experiments conducted on
ACE and RAMS datasets have strongly validated our approach's effectiveness,
which also demonstrates its robustness in scenarios with limited training data.
comment: Accepted to ACL2024
☆ Language Models can Evaluate Themselves via Probability Discrepancy ACL 2024
In this paper, we initiate our discussion by demonstrating how Large Language
Models (LLMs), when tasked with responding to queries, display a more even
probability distribution in their answers if they are more adept, as opposed to
their less skilled counterparts. Expanding on this foundational insight, we
propose a new self-evaluation method ProbDiff for assessing the efficacy of
various LLMs. This approach obviates the necessity for an additional evaluation
model or the dependence on external, proprietary models like GPT-4 for
judgment. It uniquely utilizes the LLMs being tested to compute the probability
discrepancy between the initial response and its revised versions. A higher
discrepancy for a given query between two LLMs indicates a relatively weaker
capability. Our findings reveal that ProbDiff achieves results on par with
those obtained from evaluations based on GPT-4, spanning a range of scenarios
that include natural language generation (NLG) tasks such as translation,
summarization, and our proposed Xiaohongshu blog writing task, and benchmarks
for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of
varying magnitudes.
comment: ACL 2024 Findings
☆ Automatic News Generation and Fact-Checking System Based on Language Processing
Xirui Peng, Qiming Xu, Zheng Feng, Haopeng Zhao, Lianghao Tan, Yan Zhou, Zecheng Zhang, Chenwei Gong, Yingqiao Zheng
This paper explores an automatic news generation and fact-checking system
based on language processing, aimed at enhancing the efficiency and quality of
news production while ensuring the authenticity and reliability of the news
content. With the rapid development of Natural Language Processing (NLP) and
deep learning technologies, automatic news generation systems are capable of
extracting key information from massive data and generating well-structured,
fluent news articles. Meanwhile, by integrating fact-checking technology, the
system can effectively prevent the spread of false news and improve the
accuracy and credibility of news. This study details the key technologies
involved in automatic news generation and factchecking, including text
generation, information extraction, and the application of knowledge graphs,
and validates the effectiveness of these technologies through experiments.
Additionally, the paper discusses the future development directions of
automatic news generation and fact-checking systems, emphasizing the importance
of further integration and innovation of technologies. The results show that
with continuous technological optimization and practical application, these
systems will play an increasingly important role in the future news industry,
providing more efficient and reliable news services.
☆ CNER: A tool Classifier of Named-Entity Relationships
We introduce CNER, an ensemble of capable tools for extraction of semantic
relationships between named entities in Spanish language. Built upon a
container-based architecture, CNER integrates different Named entity
recognition and relation extraction tools with a user-friendly interface that
allows users to input free text or files effortlessly, facilitating streamlined
analysis. Developed as a prototype version for the Natural Language Processing
(NLP) Group at Universidad del Valle, CNER serves as a practical educational
resource, illustrating how machine learning techniques can effectively tackle
diverse NLP tasks in Spanish. Our preliminary results reveal the promising
potential of CNER in advancing the understanding and development of NLP tools,
particularly within Spanish-language contexts.
☆ Rethinking ChatGPT's Success: Usability and Cognitive Behaviors Enabled by Auto-regressive LLMs' Prompting
Over the last decade, a wide range of training and deployment strategies for
Large Language Models (LLMs) have emerged. Among these, the prompting paradigms
of Auto-regressive LLMs (AR-LLMs) have catalyzed a significant surge in
Artificial Intelligence (AI). This paper aims to emphasize the significance of
utilizing free-form modalities (forms of input and output) and verbal free-form
contexts as user-directed channels (methods for transforming modalities) for
downstream deployment. Specifically, we analyze the structure of modalities
within both two types of LLMs and six task-specific channels during deployment.
From the perspective of users, our analysis introduces and applies the
analytical metrics of task customizability, transparency, and complexity to
gauge their usability, highlighting the superior nature of AR-LLMs' prompting
paradigms. Moreover, we examine the stimulation of diverse cognitive behaviors
in LLMs through the adoption of free-form text and verbal contexts, mirroring
human linguistic expressions of such behaviors. We then detail four common
cognitive behaviors to underscore how AR-LLMs' prompting successfully imitate
human-like behaviors using this free-form modality and channel. Lastly, the
potential for improving LLM deployment, both as autonomous agents and within
multi-agent systems, is identified via cognitive behavior concepts and
principles.
♻ ☆ ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
In our work, we explore the synergistic capabilities of pre-trained
vision-and-language models (VLMs) and large language models (LLMs) on visual
commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision
pipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit
strong performance for problems involving understanding the literal visual
content, which we noted as visual commonsense understanding (VCU). For problems
where the goal is to infer conclusions beyond image content, which we noted as
visual commonsense inference (VCI), VLMs face difficulties, while LLMs, given
sufficient visual evidence, can use commonsense to infer the answer well. We
empirically validate this by letting LLMs classify VCR problems into these two
categories and show the significant difference between VLM and LLM with image
caption decision pipelines on two subproblems. Moreover, we identify a
challenge with VLMs' passive perception, which may miss crucial context
information, leading to incorrect reasoning by LLMs. Based on these, we suggest
a collaborative approach, named ViCor, where pre-trained LLMs serve as problem
classifiers to analyze the problem category, then either use VLMs to answer the
question directly or actively instruct VLMs to concentrate on and gather
relevant visual elements to support potential commonsense inferences. We
evaluate our framework on two VCR benchmark datasets and outperform all other
methods that do not require in-domain fine-tuning.
♻ ☆ Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
Recent advances in Language Model (LM) agents and tool use, exemplified by
applications like ChatGPT Plugins, enable a rich set of capabilities but also
amplify potential risks - such as leaking private data or causing financial
losses. Identifying these risks is labor-intensive, necessitating implementing
the tools, setting up the environment for each test scenario manually, and
finding risky cases. As tools and agents become more complex, the high cost of
testing these agents will make it increasingly difficult to find high-stakes,
long-tailed risks. To address these challenges, we introduce ToolEmu: a
framework that uses an LM to emulate tool execution and enables the testing of
LM agents against a diverse range of tools and scenarios, without manual
instantiation. Alongside the emulator, we develop an LM-based automatic safety
evaluator that examines agent failures and quantifies associated risks. We test
both the tool emulator and evaluator through human evaluation and find that
68.8% of failures identified with ToolEmu would be valid real-world agent
failures. Using our curated initial benchmark consisting of 36 high-stakes
tools and 144 test cases, we provide a quantitative risk analysis of current LM
agents and identify numerous failures with potentially severe outcomes.
Notably, even the safest LM agent exhibits such failures 23.9% of the time
according to our evaluator, underscoring the need to develop safer LM agents
for real-world deployment.
♻ ☆ Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models
While recent research endeavors have focused on developing Large Language
Models (LLMs) with robust long-context capabilities, due to the lack of
long-context benchmarks, relatively little is known about how well the
performance of long-context LLMs. To address this gap, we propose a
multi-evidence, position-aware, and scalable benchmark for evaluating
long-context LLMs, named Counting-Stars, which evaluates long-context LLMs by
using two tasks: multi-evidence acquisition and multi-evidence reasoning. Based
on the Counting-Stars test, we conduct experiments to evaluate long-context
LLMs (i.e., GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1).
Experimental results demonstrate that Gemini 1.5 Pro achieves the best overall
results, while the performance of GPT-4 Turbo is the most stable across various
tasks. Furthermore, our analysis of these LLMs, which are extended to handle
long-context scenarios, indicates that there is potential for improvement as
the length of the input context and the intricacy of the tasks are increasing.
comment: work in progress
♻ ☆ Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features ICML2024
Understanding the reasons behind the exceptional success of transformers
requires a better analysis of why attention layers are suitable for NLP tasks.
In particular, such tasks require predictive models to capture contextual
meaning which often depends on one or few words, even if the sentence is long.
Our work studies this key property, dubbed word sensitivity (WS), in the
prototypical setting of random features. We show that attention layers enjoy
high WS, namely, there exists a vector in the space of embeddings that largely
perturbs the random attention features map. The argument critically exploits
the role of the softmax in the attention layer, highlighting its benefit
compared to other activations (e.g., ReLU). In contrast, the WS of standard
random features is of order $1/\sqrt{n}$, $n$ being the number of words in the
textual sample, and thus it decays with the length of the context. We then
translate these results on the word sensitivity into generalization bounds: due
to their low WS, random features provably cannot learn to distinguish between
two sentences that differ only in a single word; in contrast, due to their high
WS, random attention features have higher generalization capabilities. We
validate our theoretical results with experimental evidence over the BERT-Base
word embeddings of the imdb review dataset.
comment: Revision after ICML2024 reviews
♻ ☆ FOLIO: Natural Language Reasoning with First-Order Logic
Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wojciech Kryscinski, Semih Yavuz, Ye Liu, Xi Victoria Lin, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, Dragomir Radev
Large language models (LLMs) have achieved remarkable performance on a
variety of natural language understanding tasks. However, existing benchmarks
are inadequate in measuring the complex logical reasoning capabilities of a
model. We present FOLIO, a human-annotated, logically complex and diverse
dataset for reasoning in natural language (NL), equipped with first-order logic
(FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each
paired with one of 487 sets of premises used to deductively reason for the
validity of each conclusion. The logical correctness of the premises and
conclusions is ensured by their FOL annotations, which are automatically
verified by an FOL inference engine. In addition to the main NL reasoning task,
NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our
experiments on FOLIO systematically evaluate the FOL reasoning ability of
supervised fine-tuning on medium-sized language models. For both NL reasoning
and NL-FOL translation, we benchmark multiple state-of-the-art language models.
Our results show that a subset of FOLIO presents a challenge for one of the
most capable {Large Language Model (LLM)} publicly available, GPT-4.
♻ ☆ Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing? ACL 2024
The field of natural language processing (NLP) has recently witnessed a
transformative shift with the emergence of foundation models, particularly
Large Language Models (LLMs) that have revolutionized text-based NLP. This
paradigm has extended to other modalities, including speech, where researchers
are actively exploring the combination of Speech Foundation Models (SFMs) and
LLMs into single, unified models capable of addressing multimodal tasks. Among
such tasks, this paper focuses on speech-to-text translation (ST). By examining
the published papers on the topic, we propose a unified view of the
architectural solutions and training strategies presented so far, highlighting
similarities and differences among them. Based on this examination, we not only
organize the lessons learned but also show how diverse settings and evaluation
approaches hinder the identification of the best-performing solution for each
architectural building block and training choice. Lastly, we outline
recommendations for future works on the topic aimed at better understanding the
strengths and weaknesses of the SFM+LLM solutions for ST.
comment: Accepted to the ACL 2024 main conference
♻ ☆ ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale ACL
Multi-task learning (MTL) has shown considerable practical benefits,
particularly when using language models (LMs). While this is commonly achieved
by learning $n$ tasks under a joint optimization procedure, some methods, such
as AdapterFusion, divide the problem into two stages: (i) task learning, where
knowledge specific to a task is encapsulated within sets of parameters (e.g.,
adapters), and (ii) transfer, where this already learned knowledge is leveraged
for a target task. This separation of concerns provides numerous benefits
(e.g., promoting reusability). However, current two-stage MTL introduces a
substantial number of additional parameters. We address this issue by
leveraging the usefulness of linearly scaling the output representations of
source adapters for transfer learning. We introduce ScaLearn, a simple and
highly parameter-efficient two-stage MTL method that capitalizes on the
knowledge of the source tasks by learning a minimal set of scaling parameters
that enable effective transfer to a target task. Our experiments on three
benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn
consistently outperforms strong baselines with a small number of transfer
parameters (~ $0.35$% of those of AdapterFusion). Remarkably, we observe that
ScaLearn maintains its strong abilities even when further reducing parameters,
achieving competitive results with only $8$ transfer parameters per target
task. Our proposed approach thus demonstrates the power of simple scaling as a
promise for more efficient task transfer.
comment: Accepted to Findings of the ACL: ACL 2024
♻ ☆ Two-Stage Stance Labeling: User-Hashtag Heuristics with Graph Neural Networks
The high volume and rapid evolution of content on social media present major
challenges for studying the stance of social media users. In this work, we
develop a two stage stance labeling method that utilizes the user-hashtag
bipartite graph and the user-user interaction graph. In the first stage, a
simple and efficient heuristic for stance labeling uses the user-hashtag
bipartite graph to iteratively update the stance association of user and
hashtag nodes via a label propagation mechanism. This set of soft labels is
then integrated with the user-user interaction graph to train a graph neural
network (GNN) model using semi-supervised learning. We evaluate this method on
two large-scale datasets containing tweets related to climate change from June
2021 to June 2022 and gun control from January 2022 to January 2023. Our
experiments demonstrate that enriching text-based embeddings of users with
network information from the user interaction graph using our semi-supervised
GNN method outperforms both classifiers trained on user textual embeddings and
zero-shot classification using LLMs such as GPT4. We discuss the need for
integrating nuanced understanding from social science with the scalability of
computational methods to better understand how polarization on social media
occurs for divisive issues such as climate change and gun control.
♻ ☆ Multi-modal Stance Detection: New Datasets and Model
Stance detection is a challenging task that aims to identify public opinion
from social media platforms with respect to specific targets. Previous work on
stance detection largely focused on pure texts. In this paper, we study
multi-modal stance detection for tweets consisting of texts and images, which
are prevalent in today's fast-growing social media platforms where people often
post multi-modal messages. To this end, we create five new multi-modal stance
detection datasets of different domains based on Twitter, in which each example
consists of a text and an image. In addition, we propose a simple yet effective
Targeted Multi-modal Prompt Tuning framework (TMPT), where target information
is leveraged to learn multi-modal stance features from textual and visual
modalities. Experimental results on our three benchmark datasets show that the
proposed TMPT achieves state-of-the-art performance in multi-modal stance
detection.
♻ ☆ Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks LREC
Attention mechanisms play a crucial role in the neural revolution of Natural
Language Processing (NLP). With the growth of attention-based models, several
pruning techniques have been developed to identify and exploit sparseness,
making these models more efficient. Most efforts focus on hard-coding attention
patterns or pruning attention weights based on training data. We propose
Attention Pruning (AP), a framework that observes attention patterns in a fixed
dataset and generates a global sparseness mask. AP saves 90% of attention
computation for language modeling and about 50% for machine translation and
GLUE tasks, maintaining result quality. Our method reveals important
distinctions between self- and cross-attention patterns, guiding future NLP
research. Our framework can reduce both latency and memory requirements for any
attention-based model, aiding in the development of improved models for
existing or new NLP applications. We have demonstrated this with encoder and
autoregressive transformer models using Triton GPU kernels and make our code
publicly available at https://github.com/irugina/AP.
comment: Presented at LREC-COLING 2024: 12 pages, 4 figures, 11 tables
♻ ☆ TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese
Large language models (LLMs) have significantly advanced natural language
processing, but their progress has yet to be equal across languages. While most
LLMs are trained in high-resource languages like English, multilingual models
generally underperform monolingual ones. Additionally, aspects of their
multilingual foundation sometimes restrict the byproducts they produce, like
computational demands and licensing regimes. In this study, we document the
development of open-foundation models tailored for use in low-resource
settings, their limitations, and their benefits. This is the TeenyTinyLlama
pair: two compact models for Brazilian Portuguese text generation. We release
them under the permissive Apache 2.0 license on GitHub and Hugging Face for
community use and further development. See
https://github.com/Nkluge-correa/TeenyTinyLlama
comment: 21 pages, 5 figures
♻ ☆ GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving ACL 2024
Recent advancements in large language models (LLMs) and multi-modal models
(MMs) have demonstrated their remarkable capabilities in problem-solving. Yet,
their proficiency in tackling geometry math problems, which necessitates an
integrated understanding of both textual and visual information, has not been
thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark,
a comprehensive collection that includes a main subset of 2,000 problems, a 750
problems subset focusing on backward reasoning, an augmented subset of 2,000
problems, and a hard subset of 300 problems. This benchmark facilitates a
deeper investigation into the performance of LLMs and MMs in solving geometry
math problems. Our evaluation of ten LLMs and MMs across these varied subsets
reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on
the main subset but only a 6.00\% accuracy on the hard subset. This highlights
the critical need for testing models against datasets on which they have not
been pre-trained. Additionally, our findings indicate that GPT-series models
perform more effectively on problems they have rephrased, suggesting a
promising method for enhancing model capabilities.
comment: Accepted in ACL 2024 Findings
♻ ☆ Pose2Gest: A Few-Shot Model-Free Approach Applied In South Indian Classical Dance Gesture Recognition
The classical dances from India utilize a set of hand gestures known as
Mudras, serving as the foundational elements of its posture vocabulary.
Identifying these mudras represents a primary task in digitizing the dance
performances. With Kathakali, a dance-drama, as the focus, this work addresses
mudra recognition by framing it as a 24-class classification problem and
proposes a novel vector-similarity-based approach leveraging pose estimation
techniques. This method obviates the need for extensive training or
fine-tuning, thus mitigating the issue of limited data availability common in
similar AI applications. Achieving an accuracy rate of 92%, our approach
demonstrates comparable or superior performance to existing
model-training-based methodologies in this domain. Notably, it remains
effective even with small datasets comprising just 1 or 5 samples, albeit with
a slightly diminished performance. Furthermore, our system supports processing
images, videos, and real-time streams, accommodating both hand-cropped and
full-body images. As part of this research, we have curated and released a
publicly accessible Hasta Mudra dataset, which applies to multiple South Indian
art forms including Kathakali. The implementation of the proposed method is
also made available as a web application.
♻ ☆ Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Large Language Models (LLMs) have demonstrated exceptional proficiency in
language-related tasks. However, their deployment presents significant
challenges due to their substantial memory and storage requirements. To address
this challenge, weight-only quantization has emerged as a promising solution.
Previous research has indicated that fine-tuning through up and down rounding
can enhance performance. In this study, we introduce SignRound, a method that
utilizes signed gradient descent (SignSGD) to optimize rounding values and
weight clipping within just 200 steps, combining the strengths of both
Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).
SignRound achieves outstanding results compared to recent methods across 2 to 4
bits, while maintaining low tuning costs and without introducing any additional
inference overhead. For instance, SignRound led to absolute average accuracy
improvements ranging from 6.91\% to 33.22\% at 2 bits. Furthermore, it
demonstrates robust generalization to various recent models and achieves
near-lossless quantization in most scenarios at 4 bits. The source code is
publicly available at \url{https://github.com/intel/auto-round}.
♻ ☆ Biomedical Entity Linking as Multiple Choice Question Answering COLING 2024
Although biomedical entity linking (BioEL) has made significant progress with
pre-trained language models, challenges still exist for fine-grained and
long-tailed entities. To address these challenges, we present BioELQA, a novel
model that treats Biomedical Entity Linking as Multiple Choice Question
Answering. BioELQA first obtains candidate entities with a fast retriever,
jointly presents the mention and candidate entities to a generator, and then
outputs the predicted symbol associated with its chosen entity. This
formulation enables explicit comparison of different candidate entities, thus
capturing fine-grained interactions between mentions and entities, as well as
among entities themselves. To improve generalization for long-tailed entities,
we retrieve similar labeled training instances as clues and concatenate the
input with retrieved instances for the generator. Extensive experimental
results show that BioELQA outperforms state-of-the-art baselines on several
datasets.
comment: Accepted by COLING 2024
♻ ☆ EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs ACL 2024
We present EasyGen, an efficient model designed to enhance multimodal
understanding and generation by harnessing the capabilities of diffusion models
and large language models (LLMs), Unlike existing multimodal models that
predominately depend on encoders like CLIP or ImageBind and need ample amounts
of training data to bridge modalities,EasyGen leverages BiDiffuser,a
bidirectional conditional diffusion model, to foster more efficient modality
interactions. Easygen achieves text generation by training a projection layer
linking BiDiffuser and an LLM, and facilities image generation by training an
adapter to align the LLM's text space with the BiDiffuser's image space,
Comprehensive quantitative and qualitative experiments show that EasyGen excels
in data-efficient training, high-quality image generation, and extendibility,
effectively addressing the challenges in multimodal generation. The source code
is available at https://github.com/zxy556677/EasyGen.
comment: Accepted by ACL 2024, main conference
♻ ☆ OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
In recent years, Large Language Models (LLMs) have achieved almost human-like
performance on various tasks. While some LLMs have been trained on multilingual
data, most of the training data is in English. Hence, their performance in
English greatly exceeds their performance in other languages. This document
presents our approach to training and evaluating the first foundational and
chat LLM specialized for Romanian.
♻ ☆ ANALOGYKB: Unlocking Analogical Reasoning of Language Models with A Million-scale Knowledge Base ACL 2024
Analogical reasoning is a fundamental cognitive ability of humans. However,
current language models (LMs) still struggle to achieve human-like performance
in analogical reasoning tasks due to a lack of resources for model training. In
this work, we address this gap by proposing ANALOGYKB, a million-scale analogy
knowledge base (KB) derived from existing knowledge graphs (KGs). ANALOGYKB
identifies two types of analogies from the KGs: 1) analogies of the same
relations, which can be directly extracted from the KGs, and 2) analogies of
analogous relations, which are identified with a selection and filtering
pipeline enabled by large language models (LLMs), followed by minor human
efforts for data quality control. Evaluations on a series of datasets of two
analogical reasoning tasks (analogy recognition and generation) demonstrate
that ANALOGYKB successfully enables both smaller LMs and LLMs to gain better
analogical reasoning capabilities.
comment: Accepted to ACL 2024
♻ ☆ Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning NAACL 2024
Recently, large language models (LLMs) have emerged as a groundbreaking
technology and their unparalleled text generation capabilities have sparked
interest in their application to the fundamental sentence representation
learning task. Existing methods have explored utilizing LLMs as data annotators
to generate synthesized data for training contrastive learning based sentence
embedding models such as SimCSE. However, since contrastive learning models are
sensitive to the quality of sentence pairs, the effectiveness of these methods
is largely influenced by the content generated from LLMs, highlighting the need
for more refined generation in the context of sentence representation learning.
Building upon this premise, we propose MultiCSR, a multi-level contrastive
sentence representation learning framework that decomposes the process of
prompting LLMs to generate a corpus for training base sentence embedding models
into three stages (i.e., sentence generation, sentence pair construction,
in-batch training) and refines the generated content at these three distinct
stages, ensuring only high-quality sentence pairs are utilized to train a base
contrastive learning model. Our extensive experiments reveal that MultiCSR
enables a less advanced LLM to surpass the performance of ChatGPT, while
applying it to ChatGPT achieves better state-of-the-art results. Comprehensive
analyses further underscore the potential of our framework in various
application scenarios and achieving better sentence representation learning
with LLMs.
comment: NAACL 2024
♻ ☆ RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Retrieval-augmented generation (RAG) has become a main technique for
alleviating hallucinations in large language models (LLMs). Despite the
integration of RAG, LLMs may still present unsupported or contradictory claims
to the retrieved contents. In order to develop effective hallucination
prevention strategies under RAG, it is important to create benchmark datasets
that can measure the extent of hallucination. This paper presents RAGTruth, a
corpus tailored for analyzing word-level hallucinations in various domains and
tasks within the standard RAG frameworks for LLM applications. RAGTruth
comprises nearly 18,000 naturally generated responses from diverse LLMs using
RAG. These responses have undergone meticulous manual annotations at both the
individual cases and word levels, incorporating evaluations of hallucination
intensity. We not only benchmark hallucination frequencies across different
LLMs, but also critically assess the effectiveness of several existing
hallucination detection methodologies. Furthermore, we show that using a
high-quality dataset such as RAGTruth, it is possible to finetune a relatively
small LLM and achieve a competitive level of performance in hallucination
detection when compared to the existing prompt-based approaches using
state-of-the-art large language models such as GPT-4.
♻ ☆ Bypassing the Safety Training of Open-Source LLMs with Priming Attacks ICLR
With the recent surge in popularity of LLMs has come an ever-increasing need
for LLM safety training. In this paper, we investigate the fragility of SOTA
open-source LLMs under simple, optimization-free attacks we refer to as
$\textit{priming attacks}$, which are easy to execute and effectively bypass
alignment from safety training. Our proposed attack improves the Attack Success
Rate on Harmful Behaviors, as measured by Llama Guard, by up to $3.3\times$
compared to baselines. Source code and data are available at
https://github.com/uiuc-focal-lab/llm-priming-attacks.
comment: ICLR Tiny Paper camera ready version
♻ ☆ IDGenRec: LLM-RecSys Alignment with Textual ID Learning SIGIR 2024
Generative recommendation based on Large Language Models (LLMs) have
transformed the traditional ranking-based recommendation style into a
text-to-text generation paradigm. However, in contrast to standard NLP tasks
that inherently operate on human vocabulary, current research in generative
recommendations struggles to effectively encode recommendation items within the
text-to-text framework using concise yet meaningful ID representations. To
better align LLMs with recommendation needs, we propose IDGen, representing
each item as a unique, concise, semantically rich, platform-agnostic textual ID
using human language tokens. This is achieved by training a textual ID
generator alongside the LLM-based recommender, enabling seamless integration of
personalized recommendations into natural language generation. Notably, as user
history is expressed in natural language and decoupled from the original
dataset, our approach suggests the potential for a foundational generative
recommendation model. Experiments show that our framework consistently
surpasses existing models in sequential recommendation under standard
experimental setting. Then, we explore the possibility of training a foundation
recommendation model with the proposed method on data collected from 19
different datasets and tested its recommendation performance on 6 unseen
datasets across different platforms under a completely zero-shot setting. The
results show that the zero-shot performance of the pre-trained foundation model
is comparable to or even better than some traditional recommendation models
based on supervised training, showing the potential of the IDGen paradigm
serving as the foundation model for generative recommendation. Code and data
are open-sourced at https://github.com/agiresearch/IDGenRec.
comment: Accepted in SIGIR 2024
♻ ★ Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
Large vision-language models (VLMs) fine-tuned on specialized visual
instruction-following data have exhibited impressive language reasoning
capabilities across various scenarios. However, this fine-tuning paradigm may
not be able to efficiently learn optimal decision-making agents in multi-step
goal-directed tasks from interactive environments. To address this challenge,
we propose an algorithmic framework that fine-tunes VLMs with reinforcement
learning (RL). Specifically, our framework provides a task description and then
prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM
to efficiently explore intermediate reasoning steps that lead to the final
text-based action. Next, the open-ended text output is parsed into an
executable action to interact with the environment to obtain goal-directed task
rewards. Finally, our framework uses these task rewards to fine-tune the entire
VLM with RL. Empirically, we demonstrate that our proposed framework enhances
the decision-making capabilities of VLM agents across various tasks, enabling
7b models to outperform commercial models such as GPT4-V or Gemini.
Furthermore, we find that CoT reasoning is a crucial component for performance
improvement, as removing the CoT reasoning results in a significant decrease in
the overall performance of our method.
♻ ☆ Temporal Knowledge Question Answering via Abstract Reasoning Induction ACL 2024
In this study, we address the challenge of enhancing temporal knowledge
reasoning in Large Language Models (LLMs). LLMs often struggle with this task,
leading to the generation of inaccurate or misleading responses. This issue
mainly arises from their limited ability to handle evolving factual knowledge
and complex temporal logic. To overcome these limitations, we propose Abstract
Reasoning Induction (ARI) framework, which divides temporal reasoning into two
distinct phases: Knowledge-agnostic and Knowledge-based. This framework offers
factual knowledge support to LLMs while minimizing the incorporation of
extraneous noisy data. Concurrently, informed by the principles of
constructivism, ARI provides LLMs the capability to engage in proactive,
self-directed learning from both correct and incorrect historical reasoning
samples. By teaching LLMs to actively construct knowledge and methods, it can
significantly boosting their temporal reasoning abilities. Our approach
achieves remarkable improvements, with relative gains of 29.7% and 9.27% on two
temporal QA datasets, underscoring its efficacy in advancing temporal reasoning
in LLMs. The code can be found at https://github.com/czy1999/ARI-QA
comment: Accepted by ACL 2024. 17 pages, 10 figures
♻ ☆ How Can Large Language Models Understand Spatial-Temporal Data?
While Large Language Models (LLMs) dominate tasks like natural language
processing and computer vision, harnessing their power for spatial-temporal
forecasting remains challenging. The disparity between sequential text and
complex spatial-temporal data hinders this application. To address this issue,
this paper introduces STG-LLM, an innovative approach empowering LLMs for
spatial-temporal forecasting. We tackle the data mismatch by proposing: 1)
STG-Tokenizer: This spatial-temporal graph tokenizer transforms intricate graph
data into concise tokens capturing both spatial and temporal relationships; 2)
STG-Adapter: This minimalistic adapter, consisting of linear encoding and
decoding layers, bridges the gap between tokenized data and LLM comprehension.
By fine-tuning only a small set of parameters, it can effectively grasp the
semantics of tokens generated by STG-Tokenizer, while preserving the original
natural language understanding capabilities of LLMs. Extensive experiments on
diverse spatial-temporal benchmark datasets show that STG-LLM successfully
unlocks LLM potential for spatial-temporal forecasting. Remarkably, our
approach achieves competitive performance on par with dedicated SOTA methods.
♻ ☆ ConspEmoLLM: Conspiracy Theory Detection Using an Emotion-Based Large Language Model
The internet has brought both benefits and harms to society. A prime example
of the latter is misinformation, including conspiracy theories, which flood the
web. Recent advances in natural language processing, particularly the emergence
of large language models (LLMs), have improved the prospects of accurate
misinformation detection. However, most LLM-based approaches to conspiracy
theory detection focus only on binary classification and fail to account for
the important relationship between misinformation and affective features (i.e.,
sentiment and emotions). Driven by a comprehensive analysis of conspiracy text
that reveals its distinctive affective features, we propose ConspEmoLLM, the
first open-source LLM that integrates affective information and is able to
perform diverse tasks relating to conspiracy theories. These tasks include not
only conspiracy theory detection, but also classification of theory type and
detection of related discussion (e.g., opinions towards theories). ConspEmoLLM
is fine-tuned based on an emotion-oriented LLM using our novel ConDID dataset,
which includes five tasks to support LLM instruction tuning and evaluation. We
demonstrate that when applied to these tasks, ConspEmoLLM largely outperforms
several open-source general domain LLMs and ChatGPT, as well as an LLM that has
been fine-tuned using ConDID, but which does not use affective features. This
project will be released on https://github.com/lzw108/ConspEmoLLM/.
comment: Work in progress
♻ ☆ SCI 3.0: A Web-based Schema Curation Interface for Graphical Event Representations
To understand the complexity of global events, one must navigate a web of
interwoven sub-events, identifying those most impactful elements within the
larger, abstract macro-event framework at play. This concept can be extended to
the field of natural language processing (NLP) through the creation of
structured event schemas which can serve as representations of these abstract
events. Central to our approach is the Schema Curation Interface 3.0 (SCI 3.0),
a web application that facilitates real-time editing of event schema properties
within a generated graph e.g., adding, removing, or editing sub-events,
entities, and relations directly through an interface.
♻ ☆ An Analysis of Sentential Neighbors in Implicit Discourse Relation Prediction
Discourse relation classification is an especially difficult task without
explicit context markers (Prasad et al., 2008). Current approaches to implicit
relation prediction solely rely on two neighboring sentences being targeted,
ignoring the broader context of their surrounding environments (Atwell et al.,
2021). In this research, we propose three new methods in which to incorporate
context in the task of sentence relation prediction: (1) Direct Neighbors
(DNs), (2) Expanded Window Neighbors (EWNs), and (3) Part-Smart Random
Neighbors (PSRNs). Our findings indicate that the inclusion of context beyond
one discourse unit is harmful in the task of discourse relation classification.
♻ ☆ SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge CVPR
Learning commonsense reasoning from visual contexts and scenes in real-world
is a crucial step toward advanced artificial intelligence. However, existing
video reasoning benchmarks are still inadequate since they were mainly designed
for factual or situated reasoning and rarely involve broader knowledge in the
real world. Our work aims to delve deeper into reasoning evaluations,
specifically within dynamic, open-world, and structured context knowledge. We
propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K
situations with instance-level annotations depicted in the videos. The
reasoning process is required to understand and apply situated knowledge and
general knowledge for problem-solving. To create such a dataset, we propose an
automatic and scalable generation method to generate question-answer pairs,
knowledge graphs, and rationales by instructing the combinations of LLMs and
MLLMs. Concretely, we first extract observable situated entities, relations,
and processes from videos for situated knowledge and then extend to open-world
knowledge beyond the visible content. The task generation is facilitated
through multiple dialogues as iterations and subsequently corrected and refined
by our designed self-promptings and demonstrations. With a corpus of both
explicit situated facts and implicit commonsense, we generate associated
question-answer pairs and reasoning processes, finally followed by manual
reviews for quality assurance. We evaluated recent mainstream large
vision-language models on the benchmark and found several insightful
conclusions. For more information, please refer to our benchmark at
www.bobbywu.com/SOKBench.
comment: CVPR
♻ ☆ Red Teaming Language Models for Contradictory Dialogues
Most language models currently available are prone to self-contradiction
during dialogues. To mitigate this issue, this study explores a novel
contradictory dialogue processing task that aims to detect and modify
contradictory statements in a conversation. This task is inspired by research
on context faithfulness and dialogue comprehension, which have demonstrated
that the detection and understanding of contradictions often necessitate
detailed explanations. We develop a dataset comprising contradictory dialogues,
in which one side of the conversation contradicts itself. Each dialogue is
accompanied by an explanatory label that highlights the location and details of
the contradiction. With this dataset, we present a Red Teaming framework for
contradictory dialogue processing. The framework detects and attempts to
explain the dialogue, then modifies the existing contradictory content using
the explanation. Our experiments demonstrate that the framework improves the
ability to detect contradictory dialogues and provides valid explanations.
Additionally, it showcases distinct capabilities for modifying such dialogues.
Our study highlights the importance of the logical inconsistency problem in
conversational AI.
comment: 18 pages, 5 figures
♻ ☆ PipeNet: Question Answering with Semantic Pruning over Knowledge Graphs
It is well acknowledged that incorporating explicit knowledge graphs (KGs)
can benefit question answering. Existing approaches typically follow a
grounding-reasoning pipeline in which entity nodes are first grounded for the
query (question and candidate answers), and then a reasoning module reasons
over the matched multi-hop subgraph for answer prediction. Although the
pipeline largely alleviates the issue of extracting essential information from
giant KGs, efficiency is still an open challenge when scaling up hops in
grounding the subgraphs. In this paper, we target at finding semantically
related entity nodes in the subgraph to improve the efficiency of graph
reasoning with KG. We propose a grounding-pruning-reasoning pipeline to prune
noisy nodes, remarkably reducing the computation cost and memory usage while
also obtaining decent subgraph representation. In detail, the pruning module
first scores concept nodes based on the dependency distance between matched
spans and then prunes the nodes according to score ranks. To facilitate the
evaluation of pruned subgraphs, we also propose a graph attention network (GAT)
based module to reason with the subgraph data. Experimental results on
CommonsenseQA and OpenBookQA demonstrate the effectiveness of our method.
comment: 8 pages, 4 figures, accepted to *SEM 2024