圖片 TensorFlow團隊提供

主題圖片:Designed by Freepik

翻譯 宗諭 審閱 阿吉老師



We are excited to introduce a new optimization toolkit in TensorFlow: a suite of techniques that developers, both novice and advanced, can use to optimize machine learning models for deployment and execution.



While we expect that these techniques will be useful for optimizing any TensorFlow model for deployment, they are particularly important for TensorFlow Lite developers who are serving models on devices with tight memory, power constraints, and storage limitations. If you haven’t tried out TensorFlow Lite yet, you can find out more about it here.

我們期待這個工具包對於部署任何TensorFlow模型,都能做到最佳化,然而,這個工具包對TensorFlow Lite的開發者來說尤其重要!因為,TensorFlow Lite的開發者需要在裝置上運作模型,這些裝置不論在記憶體、電源與儲存容量上都更為吃緊。如果您尚未試用過TensorFlow Lite,可在這裡了解更多相關資訊。


The first technique that we are adding support for is post-training quantization to the TensorFlow Lite conversion tool. This can result in up to 4x compression and up to 3x faster execution for relevant machine learning models.

這個工具包加入的第一項技術是針對TensorFlow Lite轉換工具的訓練後量化(post-training quantization)功能。針對相關的機器學習模型,這項技術最高可達到4倍的壓縮率,並使執行速度最高加快3倍。


By quantizing their models, developers will also gain the additional benefit of reduced power consumption. This can be useful for deployment in edge devices, beyond mobile phones.



Enabling post-training quantization

The post-training quantization technique is integrated into the TensorFlow Lite conversion tool. Getting started is easy: after building their TensorFlow model, developers can simply enable the ‘post_training_quantize’ flag in the TensorFlow Lite conversion tool. Assuming that the saved model is stored in saved_model_dir, the quantized tflite flatbuffer can be generated:


訓練後量化技術已被整合至TensorFlow Lite轉換工具中。如何開始使用呢?很簡單!在建立TensorFlow模型後,開發者只要在TensorFlow Lite轉換工具中啟用「post_training_quantize」旗標就可以了。假設我們把模型放在saved_model_dir 這個資料夾中,就能產生量化後的「tflite flatbuffer」:

open(“quantized_model.tflite”, “wb”).write(tflite_quantized_model)


Our tutorial walks you through how to do this in depth. In the future, we aim to incorporate this technique into general TensorFlow tooling as well, so that it can be used for deployment on platforms not currently supported by TensorFlow Lite.

我們的教學文件會詳細告訴您關於「訓練後量化」這項技術。未來,我們希望把這項技術也納入一般TensorFlow工具中,使它能部署於一些目前TensorFlow Lite尚未支援的平台上。


Benefits of post-training quantization

  • 4x reduction in model sizes

  • Models, which consist primarily of convolutional layers, get 10–50% faster execution

  • RNN-based models get up to 3x speed-up

  • Due to reduced memory and computation requirements, we expect that most models will also have lower power consumption


  • 模型大小最高可減少4倍

  • 主要由卷積層所組成的模型,執行速度可加快10%-50%。

  • 以遞迴神經網路RNN為基礎的模型,最高可加速3倍。

  • 由於減低記憶體和低運算的要求日增,我們期待大多數的模型也能做到低功耗。


See graphs below for model size reduction and execution time speed-ups for a few models (measurements done on Android Pixel 2 phone using a single core).

關於降低模型大小與提高執行速度,請參考下面幾張圖表(使用Android Pixel 2 單核心手機)。

圖1 Model Size Comparison: Optimized models are almost 4x smaller 模型大小比較:最佳化後,模型大小幾乎小了4倍。


圖2 Latency Comparison: Optimized models are 1.2 to 1.4x faster 延遲比較:經過最佳化的模型加速1.2至1.4倍。


These speed-ups and model size reductions occur with little impact to accuracy. In general, models that are already small for the task at hand (for example, mobilenet v1 for image classification) may experience more accuracy loss. For many of these models we provide pre-trained fully-quantized models.

執行加速及降低模型大小對準確性影響不大。總體來說,原本就輕量化、專門針對行動裝置任務的模型(例如用於影像分類的mobilenet v1),準確性可能會損失較多。針對前述這些模型,我們提供預先訓練好的全量化模型

圖3 Accuracy Comparison: Optimized models show negligible accuracy drop, except for mobilenets. 準確性比較:除了mobilenets外,其它最佳化後的模型,準確性幾乎沒有降低。


We expect to continue improving our results in the future, so please see the model optimization guide for the latest measurements.



How post-training quantization works

Under the hood, we are running optimizations (otherwise referred to as quantization) by lowering the precision of the parameters (i.e. neural network weights) from their training-time 32-bit floating-point representations into much smaller and efficient 8-bit integer ones. See the post-training quantization guide for more details.




These optimizations will make sure to pair the reduced-precision operation definitions in the resulting model with kernel implementations that use a mix of fixed- and floating-point math. This will execute the heaviest computations fast in lower precision, but the most sensitive ones with higher precision, thus typically resulting in little to no final accuracy losses for the task, yet a significant speed-up over pure floating-point execution. For operations where there isn’t a matching “hybrid” kernel, or where the Toolkit deems it necessary, it will reconvert the parameters to the higher floating point precision for execution. Please see the post-training quantization page for a list of supported hybrid operations.

這樣的最佳化,將確保在結果模型中精確性降低後的操作定義能符合內核實作,後者在實作混用了定點和浮點數學。這樣可在精確度較低的情況下做到最多的運算,但其中最敏感且精確性較高的模型,通常不會 (或僅微量)因此導致精確度降低,反之如果是純浮點數執行的話,執行速度則有顯著提升。針對非「混合型」內核的運算,或工具包認為有必要,「訓練後量化」會再將參數轉換為較高精度的浮點數。關於「訓練後量化」支援哪些混合運算,請參考「訓練後量化」網頁。


Future work

We will continue to improve post-training quantization as well as work on other techniques which make it easier to optimize models. These will be integrated into relevant TensorFlow workflows to make them easy to use.




Post-training quantization is the first offering under the umbrella of the optimization toolkit that we are developing. We look forward to getting developer feedback on it.



Please file issues at GitHub and ask questions at Stack Overflow.


請在GitHub提出你們的建議,並請在Stack Overflow詢問問題。






發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *