[ad_1]
The Massive Language Mannequin (LLM) has showcased exceptional efficacy throughout numerous real-world purposes, together with clever assistants, textual content summarization, translation, and multi-modality duties on cellular gadgets. Nonetheless, the present methodologies for on-device deployment of LLMs are hampered by sluggish inference speeds, resulting in subpar person experiences.
In a brand new paper Transformer-Lite: Excessive-efficiency Deployment of Massive Language Fashions on Cell Cellphone GPUs, researchers from OPPO AI Heart have launched an answer. They current 4 optimization methods and introduce a novel cellular inference engine dubbed Transformer-Lite. This engine outperforms CPU-based FastLLM and GPU-based MLC-LLM, attaining a exceptional over 10x acceleration for prefill velocity and a pair of~3x for decoding velocity.
To streamline LLM deployment on system GPUs whereas making certain effectivity, the crew proposes a fusion of generic cellular inference engines with LLM-specific ones. To deal with this problem successfully, they introduce 4 optimization methods:
Symbolic Expression-based Strategy: Supporting dynamic form mannequin inference entails dynamic form derivation, reminiscence reuse, and execution scheduling.
Operator Optimizations and Execution Precedence Setting: These optimizations intention to boost efficiency and cut back cellphone lagging.
FP4 Quantization Technique (M0E4): This technique minimizes efficiency overhead in dequantization, facilitating extra environment friendly matrix multiplication.
Sub-tensor-based Strategy: This method circumvents KV cache copying from mannequin outputs to inputs after every LLM inference iteration.
Furthermore, the researchers develop the Transformer-Lite engine and combine these optimizations. This engine facilitates LLM deployment utilizing ONNX fashions exported by coaching frameworks corresponding to PyTorch, making certain handy deployment and straightforward help for brand spanking new mannequin varieties.
Of their empirical evaluation, the crew evaluates the efficiency of the proposed engine utilizing two cellphones: the OPPO Discover X7 24GB reminiscence model and the OPPO Discover X7 Extremely 12GB reminiscence model. They choose 5 LLM fashions with various buildings and parameter sizes: Gemma 2B, Qwen1.5 4B, ChatGLM2 6B, Llama2 7B, and Qwen1.5 14B. By evaluating GPU inference with MLC-LLM and CPU inference with FastLLM, they exhibit the prevalence of the Transformer-Lite engine.
Particularly, the Transformer-Lite engine achieves prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for the smaller Gemma 2B, respectively. This represents over a 10x speedup for prefill velocity and a pair of~3x for decoding velocity in comparison with each CPU-based FastLLM and GPU-based MLC-LLM.
The paper Transformer-Lite: Excessive-efficiency Deployment of Massive Language Fashions on Cell Cellphone GPUs is on arXiv.
Creator: Hecate He | Editor: Chain Zhang
We all know you don’t need to miss any information or analysis breakthroughs. Subscribe to our common publication Synced World AI Weekly to get weekly AI updates.
The submit OPPO AI’s Transformer-Lite Delivers 10x+ Prefill and a pair of~3x Decoding Enhance on Cell Cellphone GPUs first appeared on Synced.
[ad_2]
Supply hyperlink