Intel Spoken Language Technologies Summit (iSLTS) 2019 Keynote


Recurrent neural networks (RNNs), including workloads like recommender systems, machine translation, speech synthesis and speech transcription, form a significant proportion of data center deep learning inference. Productionized versions of these models typically contain tens to hundreds of millions of parameters but some have been scaled to billions of parameters given enough data. Increasing the size of a model also increases its compute and memory requirements. Reducing the computational cost of these models translates directly to cost and energy savings for service operators.

In this presentation and associated tutorial/demonstration, will describe how the use of two optimization techniques, sparsity and quantization, can be applied to a speech transcription model, DeepSpeech. This model is representative of today’s deployed recurrent neural networks. Inducing sparsity within a model decreases the total effective number of parameters by explicitly setting some to zero. With suitable hardware support this significantly improves the effective arithmetic intensity of the system as computations involving parameters with zero values can be implicitly executed without a memory access or multiplication. Furthermore, these parameters do not need to be stored, reducing the overall memory requirements. For an FPGA platform, this enables the storage of weights entirely in on-chip RAM. Pruning techniques are an effective class of methods that induce sparsity within a model. Let the importance, or saliency, of a parameter be the size of the increase in the loss or error function if that parameter were to be set to zero. Parameters with higher saliency will cause the error to increase more when removed. Pruning techniques remove the parameters with lowest saliency. However, finding these parameters is a non-trivial problem. will also cover the use of magnitude-based pruning. This is a specific pruning technique that has been successfully used to induce high-levels of sparsity in a variety of neural networks. Quantizing a model reduces the number of bits used to represent each parameter and/or each activation during inference. quantizes both the IEEE 754 single-precision floating-point weights and activations of the original model to 8-bit integers. This reduces the size of the model by a factor of 4 resulting in an immediate 4x improvement in arithmetic intensity and a 4x decrease in data bandwidth requirement.We show how these optimizations enable the model to be deployed on an Intel® Stratix® 10 FPGA using a high performance sparse linear algebra accelerator, achieving very high levels of sparsity (greater than 95%) with minimal loss of accuracy (less than 0.23%). Finally we will compare and contrast measured numbers for this FPGA implementation including performance/watt, latency and raw performance benchmarks with GPUs and CPUs.