Optimizing mannequin accuracy and latency utilizing Bayesian multi-objective neural structure search

What the research is:

We propose a method for optimizing the tradeoffs between model accuracy and prediction latency on the device in deep neural networks in a sample-efficient manner.

Neural Architecture Search (NAS) aims to provide an automated framework that identifies the optimal architecture for a machine learning model of a deep neural network based on an evaluation criterion such as model accuracy. The ongoing trend towards deploying models on end-user devices such as cell phones has led to an increased interest in optimizing several competing goals in order to strike an optimal balance between prediction performance and computational complexity (e.g. total number of flops), memory footprint, and latency of the model.

Existing NAS methods that are based on reinforcement learning and / or evolutionary strategies can result in prohibitively high computational costs because they require the training and evaluation of a large number of architectures. Many other approaches require the integration of the optimization framework into the training and assessment workflows, which makes it difficult to generalize to different use cases in production. In our work, we close these gaps by providing a NAS methodology that does not require any code changes to the training process of a user and can therefore easily use existing extensive training infrastructures and at the same time offers a highly sample-efficient optimization of several competing goals.

We take advantage of the recent advances in multi-objective and high-dimensional Bayesian optimization (BO), a popular method for black box optimization of computationally intensive functions. We demonstrate the usefulness of our method by optimizing the architecture and hyperparameters of a real-world natural language understanding model used on Facebook.

How it works:

NLU use case

We focus on the specific problem of matching the architecture and hyperparameters of an in-device Natural Language Understanding (NLU) model that is commonly used by conversational agents in most mobile devices and smart speakers. The primary goal of the NLU model is to understand the semantic expression of the user and to convert it into a structured decoupled representation that can be understood by downstream programs. The NLU model shown in Figure 1 is a non-autoregressive (NAR) encoder-decoder architecture based on the state-of-the-art span pointer formulation.

Figure 1: Non-autoregressive model architecture of the semantic parsing of the NLU

The NLU model serves as the first stage in conversation assistants and high accuracy is critical to a positive user experience. Conversation assistants work via the user’s language, possibly in privacy sensitive situations such as sending a message. Because of this, they generally run “on the device” at the expense of limited computing resources. In addition, it is important that the model also achieve a short inference time (latency) on the device to ensure a responsive user experience. While we generally expect a complex NLU model with a large number of parameters to achieve better accuracy, complex models tend to have high latency. Hence, we are interested in exploring the tradeoffs between accuracy and latency by optimizing a total of 24 hyperparameters so that we can choose a model that, by balancing quality and latency, provides an overall positive user experience. In particular, we optimize the 99th percentile of latency on repeated measurements and the accuracy of a held data set.


BO is typically most effective in search spaces with fewer than 10 to 15 dimensions. In order to scale to the 24-dimensional search space in this work, we use newer work on high-dimensional BO [1]. Figure 2 shows that the. proposed model [1], which uses a sparse axis-aligned subspace (SAAS) in front of and fully Bayesian inference, is critical to achieving good model fits and outperforms a standard Gaussian process model (GP) with maximum a posteriori (MAP) inference for both accuracy and also latency target.

Figure 2: We illustrate the leave-one-out cross-validation performance for the accuracy and latency goals. We observe that the SAAS model fits better than a standard GP with MAP.

To efficiently examine the tradeoffs between multiple goals, we use the parallel acquisition function Noisy Expected Hypervolume Improvement (qNEHVI) [2]which enables the parallel evaluation of many architectures (we use a batch size of 16 in this work) and naturally handles the observation noise that is present in both latency and accuracy metrics: prediction latency is subject to measurement errors and accuracy is subject to chance in NN training through Optimization of parameters with stochastic gradient methods.


We compare the optimization performance of the BO search with the Sobol search (quasi-random). Figure 3 shows the results with the goals normalized to the production model with the reference point equal to (1, 1). In 240 evaluations, Sobol was only able to find two configurations that exceeded the reference point. On the other hand, our BO method was able to examine the tradeoffs between goals and improve latency by more than 25% while improving model accuracy.

Figure 3: On the left we see that the Sobol search (quasi-random) is an inefficient approach that only finds two configurations that are better than the reference point (1,1). On the right, our BO method is much more sample efficient and can examine the tradeoffs between accuracy and latency.

Why it matters:

This new method has unlocked this natural language understanding model on the device as well as several other models on Facebook. Our method does not require any code changes to the existing training and assessment workflows, which makes it easy to generalize to various architecture search use cases. We hope that machine learning researchers, practitioners and engineers will find this method useful for their applications and fundamental to future research on NAS.

Read the full paper:



[1] Eriksson, David and Martin Jankowiak. “High-dimensional Bayesian optimization with sparse axis-aligned subspaces.” Conference on Uncertainty in Artificial Intelligence (UAI), 2021.

[2] Daulton, Samuel, Maximilian Balandat and Eytan Bakshy. “Parallel Bayesian Optimization of Multiple Noisy Objectives with expected hypervolume improvement.” ArXiv-Preprint arXiv: 2105.08195, 2021.

Try it yourself:

Check out our tutorial in Ax that shows how to use the open source implementation of built-in qNEHVI with GPs with SAAS operations to optimize two synthetic targets.

Watch tutorial

Comments are closed.