Abstract
Deep learning (DL)-based recommendation models play an important role in many real-world applications. However, an embedding layer, which is a key part of the DL-based recommendation models, requires sparse memory accesses to a very large memory space followed by the pooling operations (i.e., reduction operations). It makes the system overprovision memory capacity for model deployment. Moreover, with conventional CPU-based architecture, it is difficult to exploit the locality, causing a huge burden for data transfer between the CPU and memory. To resolve this problem, we propose an embedding vector element quantization and compression method to reduce the memory footprint (capacity) required by the embedding tables. In addition, to reduce the amount of data transfer and memory access, we propose near-memory acceleration hardware with an SRAM buffer that stores the frequently accessed embedding vectors. Our quantization and compression method results in compression ratios of 3.95–4.14 for embedding tables in widely used datasets while negligibly affecting the inference accuracy. Our acceleration technique with 3D stacked DRAM memories, which facilitates the near-memory processing in the logic die with high DRAM bandwidth, leads to 4.9×–5.4× embedding layer speedup as compared to the 8-core CPU-based execution while reducing the memory energy consumption by 5.9×–12.1×, on average.
Original language | English |
---|---|
Pages (from-to) | 1-15 |
Number of pages | 15 |
Journal | IEEE Transactions on Emerging Topics in Computing |
DOIs | |
State | Accepted/In press - 2023 |
Keywords
- Compression
- Data transfer
- embedding table
- Energy consumption
- Hardware
- inference
- Memory management
- near-memory processing
- personalized recommendation model
- Quantization (signal)
- Random access memory
- Table lookup