TY - JOUR
T1 - Investigating the effect of different fine-tuning configuration scenarios on agricultural term extraction using BERT
AU - Panoutsopoulos, Hercules
AU - Espejo-Garcia, Borja
AU - Raaijmakers, Stephan
AU - Wang, Xu
AU - Fountas, Spyros
AU - Brewster, Christopher
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/10
Y1 - 2024/10
N2 - This paper compares different transformer-based language models for automatic term extraction from agriculture-related texts. Agriculture is an important economic sector faced with severe environmental and societal challenges. The collection, annotation and sharing of agricultural scientific knowledge is key to enabling the agricultural sector to address its challenges. Automatic term extraction is a Natural Language Processing task that can provide solutions to text tagging and annotation towards better knowledge and information exchange. It is concerned with the identification of terms pertaining to a domain, or area of expertise, in text and is an important step in knowledge base creation and update pipelines. Transformer-based language modeling technologies like BERT have become popular for automatic term extraction, but limited work has been undertaken so far in applying these methods to agriculture. This paper systematically compares Agriculture-BERT to Sci-BERT, RoBERTa, and vanilla BERT, which were fine-tuned for the automatic extraction of agricultural terms from English texts. The greatest challenge faced in our research was the scarcity of agriculture-related gold standard corpora for measuring automatic term extraction performance. Our results show that, with a few exceptions, Agriculture-BERT performs better than the other models considered in our research. Our main contribution and novelty of the presented research is the investigation of the impact that different language model fine-tuning configuration scenarios had on the term extraction task. More specifically, we tested different scenarios related to the model layers kept frozen, or being updated, during training, to measure the impact they may have on Agriculture-BERT's performance in automatic term extraction. Our results show that the best performance was achieved by: (i) the “embedding layer updated + all encoder layers updated” scenario for the identification of terms also seen during training; (ii) the “embedding layer frozen + all encoder layers updated” scenario for the identification of terms being synonyms to those seen during training; and (iii) the “embedding layer updated + top 4 encoder layers updated” scenario for identifying terms neither seen during training nor being synonyms to those seen during training (novel terms).
AB - This paper compares different transformer-based language models for automatic term extraction from agriculture-related texts. Agriculture is an important economic sector faced with severe environmental and societal challenges. The collection, annotation and sharing of agricultural scientific knowledge is key to enabling the agricultural sector to address its challenges. Automatic term extraction is a Natural Language Processing task that can provide solutions to text tagging and annotation towards better knowledge and information exchange. It is concerned with the identification of terms pertaining to a domain, or area of expertise, in text and is an important step in knowledge base creation and update pipelines. Transformer-based language modeling technologies like BERT have become popular for automatic term extraction, but limited work has been undertaken so far in applying these methods to agriculture. This paper systematically compares Agriculture-BERT to Sci-BERT, RoBERTa, and vanilla BERT, which were fine-tuned for the automatic extraction of agricultural terms from English texts. The greatest challenge faced in our research was the scarcity of agriculture-related gold standard corpora for measuring automatic term extraction performance. Our results show that, with a few exceptions, Agriculture-BERT performs better than the other models considered in our research. Our main contribution and novelty of the presented research is the investigation of the impact that different language model fine-tuning configuration scenarios had on the term extraction task. More specifically, we tested different scenarios related to the model layers kept frozen, or being updated, during training, to measure the impact they may have on Agriculture-BERT's performance in automatic term extraction. Our results show that the best performance was achieved by: (i) the “embedding layer updated + all encoder layers updated” scenario for the identification of terms also seen during training; (ii) the “embedding layer frozen + all encoder layers updated” scenario for the identification of terms being synonyms to those seen during training; and (iii) the “embedding layer updated + top 4 encoder layers updated” scenario for identifying terms neither seen during training nor being synonyms to those seen during training (novel terms).
KW - Agriculture
KW - Agriculture-BERT
KW - Automatic term extraction
KW - Fine tuning configuration scenarios
KW - Silver standard corpus
U2 - 10.1016/j.compag.2024.109268
DO - 10.1016/j.compag.2024.109268
M3 - Article
SN - 0168-1699
VL - 225
JO - Computers and Electronics in Agriculture
JF - Computers and Electronics in Agriculture
M1 - 109268
ER -