TY - GEN
T1 - Data Allocation Rearrangement on CNN Accelerator Based on Reshaping Systolic Tile Array Using Planarized Matrix Reordering Techniques
AU - Kim, Hoseong
AU - Park, Daejin
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Recently, demands and usages of artificial intelligence are growing more and more. The various of electrical computation devices, such as mobile devices, are being built with AI technology. But the AI operation's load is very heavy. The AI data processing method, usually a convolutional neural network, mainly consists of the matrix convolution and the matrix multiplication-actually, they are the same effective, depending on how the operation sequences are arranged. The matrix multiplication is not suited to sequential data processing. During matrix convolution or matrix multiplication, the memory access pattern is non-linear. This is difficult to use the parallelized data processing and, actually, the original sequential processor has limit hardware unit resources for parallelized data processing. Even in sequential processing, the non-linear access pattern needs a conditional branch test and jump instruction for address calculation. Eventually, these situations induce large power consumption and poor performance during AI operation. Therefore, it is necessary that the appropriate hardware-software system, that system is the matrix structure planarize to linear access and uses the parallel data processing based on MAC tile, is implemented for efficient AI algorithm computation. In this paper, a parallel data processing structure based on systolic tile array is used for fast operation time and efficient ALU resources usage in matrix multiplication. For enhanced data processing throughput, the proposed new accelerator is equipped with a heavier processing element-(PE) tile than the original systolic tile PE, making it possible to calculate the multi state partial sums at once. Reordering multidimensional matrix as a linear planarization matrix array is also proposed at the micro-architecture level for diminishing the nonlinear access address calculation. The accelerator's memory system enables multiple elements to commit and reduces the data access times to DRAM memory during operation. In conclusion, this proposal accelerator ends up performing AI CNN operations much faster, with low power consumption.
AB - Recently, demands and usages of artificial intelligence are growing more and more. The various of electrical computation devices, such as mobile devices, are being built with AI technology. But the AI operation's load is very heavy. The AI data processing method, usually a convolutional neural network, mainly consists of the matrix convolution and the matrix multiplication-actually, they are the same effective, depending on how the operation sequences are arranged. The matrix multiplication is not suited to sequential data processing. During matrix convolution or matrix multiplication, the memory access pattern is non-linear. This is difficult to use the parallelized data processing and, actually, the original sequential processor has limit hardware unit resources for parallelized data processing. Even in sequential processing, the non-linear access pattern needs a conditional branch test and jump instruction for address calculation. Eventually, these situations induce large power consumption and poor performance during AI operation. Therefore, it is necessary that the appropriate hardware-software system, that system is the matrix structure planarize to linear access and uses the parallel data processing based on MAC tile, is implemented for efficient AI algorithm computation. In this paper, a parallel data processing structure based on systolic tile array is used for fast operation time and efficient ALU resources usage in matrix multiplication. For enhanced data processing throughput, the proposed new accelerator is equipped with a heavier processing element-(PE) tile than the original systolic tile PE, making it possible to calculate the multi state partial sums at once. Reordering multidimensional matrix as a linear planarization matrix array is also proposed at the micro-architecture level for diminishing the nonlinear access address calculation. The accelerator's memory system enables multiple elements to commit and reduces the data access times to DRAM memory during operation. In conclusion, this proposal accelerator ends up performing AI CNN operations much faster, with low power consumption.
KW - Artificial intelligane accelerator
KW - Convolutional neural network
KW - Matrix or Tensor data processing
KW - multicore processor
KW - parallelized processing
KW - Systolic Array
KW - Verilog
UR - https://www.scopus.com/pages/publications/105032464924
U2 - 10.1109/MCSoC67473.2025.00020
DO - 10.1109/MCSoC67473.2025.00020
M3 - Conference contribution
AN - SCOPUS:105032464924
T3 - Proceedings - 2025 IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2025
SP - 60
EP - 63
BT - Proceedings - 2025 IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2025
Y2 - 15 December 2025 through 18 December 2025
ER -