TY - JOUR
T1 - Comparative Study of Multiclass Text Classification in Research Proposals Using Pretrained Language Models
AU - Lee, Eunchan
AU - Lee, Changhyeon
AU - Ahn, Sangtae
N1 - Publisher Copyright:
© 2022 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2022/5/1
Y1 - 2022/5/1
N2 - Recently, transformer-based pretrained language models have demonstrated stellar performance in natural language understanding (NLU) tasks. For example, bidirectional encoder representations from transformers (BERT) have achieved outstanding performance through masked self-supervised pretraining and transformer-based modeling. However, the original BERT may only be effective for English-based NLU tasks, whereas its effectiveness for other languages such as Korean is limited. Thus, the applicability of BERT-based language models pretrained in languages other than English to NLU tasks based on those languages must be investigated. In this study, we comparatively evaluated seven BERT-based pretrained language models and their expected applicability to Korean NLU tasks. We used the climate technology dataset, which is a Korean-based large text classification dataset, in research proposals involving 45 classes. We found that the BERT-based model pretrained on the most recent Korean corpus performed the best in terms of Korean-based multiclass text classi-fication. This suggests the necessity of optimal pretraining for specific NLU tasks, particularly those in languages other than English.
AB - Recently, transformer-based pretrained language models have demonstrated stellar performance in natural language understanding (NLU) tasks. For example, bidirectional encoder representations from transformers (BERT) have achieved outstanding performance through masked self-supervised pretraining and transformer-based modeling. However, the original BERT may only be effective for English-based NLU tasks, whereas its effectiveness for other languages such as Korean is limited. Thus, the applicability of BERT-based language models pretrained in languages other than English to NLU tasks based on those languages must be investigated. In this study, we comparatively evaluated seven BERT-based pretrained language models and their expected applicability to Korean NLU tasks. We used the climate technology dataset, which is a Korean-based large text classification dataset, in research proposals involving 45 classes. We found that the BERT-based model pretrained on the most recent Korean corpus performed the best in terms of Korean-based multiclass text classi-fication. This suggests the necessity of optimal pretraining for specific NLU tasks, particularly those in languages other than English.
KW - bidirectional encoder representations from transformers
KW - cross-lingual representation learning
KW - multiclass text classification
KW - multilingual representation learning
KW - natural language understanding
KW - transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85129772255&partnerID=8YFLogxK
U2 - 10.3390/app12094522
DO - 10.3390/app12094522
M3 - Article
AN - SCOPUS:85129772255
SN - 2076-3417
VL - 12
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 9
M1 - 4522
ER -