TY - JOUR
T1 - Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation
AU - Go, Mijin
AU - Kong, Joonho
AU - Munir, Arslan
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - As recent machine translation models are mostly based on the attention-based neural machine translation (NMT), many well-known models such as Transformer or bidirectional encoder representations from Transformers (BERT) have been proposed. Along with algorithmic advancements, hardware acceleration methods for those attention-based neural machine translation models have also been introduced. However, the size of the parameters for attention-based NMT is also becoming larger to guarantee the satisfactory machine translation quality. Among various weights, linearization weights ( $W^{Q}$ , $W^{K}$ , $W^{V}$ , and $W^{O}$ ) account for a non-negligible portion (by up to 30%) among the entire parameters in the modern NMT models. In this paper, we propose a method for linearization weight compression and near-memory hardware decoder for fast and in-situ weight decompression. Our weight compression method exploits the fixed-point quantization along with Huffman coding which is selectively applied depending on the weight value distribution. Our hardware decoder decompresses the Huffman-coded weights near-memory to minimize the weight decoding latency. Our compression method shows 4.9-10.0 compression ratio with small NMT score drops across the five widely used attention-based NMT models (Transformer, Transformer-XL-base, Transformer-XL-large, BERT-base, and BERT-large). In addition, due to the reduced linearization weight size, our proposed method with near-memory decoding enables multi-head attention (MHA) execution latency reduction by 11.8%, on average, as compared to the baseline when considering the weight loading and initialization. In terms of the memory data transfer energy consumption, our proposed method leads to a memory energy saving of 16.1%, on average, as compared to the baseline.
AB - As recent machine translation models are mostly based on the attention-based neural machine translation (NMT), many well-known models such as Transformer or bidirectional encoder representations from Transformers (BERT) have been proposed. Along with algorithmic advancements, hardware acceleration methods for those attention-based neural machine translation models have also been introduced. However, the size of the parameters for attention-based NMT is also becoming larger to guarantee the satisfactory machine translation quality. Among various weights, linearization weights ( $W^{Q}$ , $W^{K}$ , $W^{V}$ , and $W^{O}$ ) account for a non-negligible portion (by up to 30%) among the entire parameters in the modern NMT models. In this paper, we propose a method for linearization weight compression and near-memory hardware decoder for fast and in-situ weight decompression. Our weight compression method exploits the fixed-point quantization along with Huffman coding which is selectively applied depending on the weight value distribution. Our hardware decoder decompresses the Huffman-coded weights near-memory to minimize the weight decoding latency. Our compression method shows 4.9-10.0 compression ratio with small NMT score drops across the five widely used attention-based NMT models (Transformer, Transformer-XL-base, Transformer-XL-large, BERT-base, and BERT-large). In addition, due to the reduced linearization weight size, our proposed method with near-memory decoding enables multi-head attention (MHA) execution latency reduction by 11.8%, on average, as compared to the baseline when considering the weight loading and initialization. In terms of the memory data transfer energy consumption, our proposed method leads to a memory energy saving of 16.1%, on average, as compared to the baseline.
KW - hardware-based near-memory decoding
KW - Huffman coding
KW - multi-head attention
KW - Neural machine translation
KW - quantization
UR - http://www.scopus.com/inward/record.url?scp=85159650050&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3271640
DO - 10.1109/ACCESS.2023.3271640
M3 - Article
AN - SCOPUS:85159650050
SN - 2169-3536
VL - 11
SP - 42751
EP - 42763
JO - IEEE Access
JF - IEEE Access
ER -