Comparative Study of Multiclass Text Classification in Research Proposals Using Pretrained Language Models

Eunchan Lee, Changhyeon Lee, Sangtae Ahn

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Recently, transformer-based pretrained language models have demonstrated stellar performance in natural language understanding (NLU) tasks. For example, bidirectional encoder representations from transformers (BERT) have achieved outstanding performance through masked self-supervised pretraining and transformer-based modeling. However, the original BERT may only be effective for English-based NLU tasks, whereas its effectiveness for other languages such as Korean is limited. Thus, the applicability of BERT-based language models pretrained in languages other than English to NLU tasks based on those languages must be investigated. In this study, we comparatively evaluated seven BERT-based pretrained language models and their expected applicability to Korean NLU tasks. We used the climate technology dataset, which is a Korean-based large text classification dataset, in research proposals involving 45 classes. We found that the BERT-based model pretrained on the most recent Korean corpus performed the best in terms of Korean-based multiclass text classi-fication. This suggests the necessity of optimal pretraining for specific NLU tasks, particularly those in languages other than English.

Original languageEnglish
Article number4522
JournalApplied Sciences (Switzerland)
Volume12
Issue number9
DOIs
StatePublished - 1 May 2022

Keywords

  • bidirectional encoder representations from transformers
  • cross-lingual representation learning
  • multiclass text classification
  • multilingual representation learning
  • natural language understanding
  • transfer learning

Fingerprint

Dive into the research topics of 'Comparative Study of Multiclass Text Classification in Research Proposals Using Pretrained Language Models'. Together they form a unique fingerprint.

Cite this