Linear programming to optimize ML fashions
Whether iteration on Facebook’s newsfeed ranking algorithm or delivering the most relevant ads to users, we’re constantly exploring new features to improve our machine learning (ML) models. Every time we add new functionality, we create a challenging data engineering problem that requires us to think strategically about the decisions we have made. More complex functions and sophisticated techniques require additional storage space. Even for a company the size of Facebook, the capacities are not infinite. If the check box is left unchecked, accepting all functions would quickly overwhelm our capacity and slow our iteration speed, reducing the efficiency of the models’ execution.
To better understand the relationship between these functions and the capacity of the infrastructure that must support them, we can describe the system as a linear programming problem. In this way we can maximize the performance of a model, examine the sensitivity of its performance to various infrastructure constraints, and examine the relationships between various services. This work, carried out by data scientists involved in our engineering team, demonstrates the value of analytics and data science in ML.
Function development support
It is important to continually introduce features that make best use of new data to obtain high-performance models. New features are responsible for most of the incremental model improvements. These ML models are useful for our ad delivery system. They work together to predict a person’s likelihood of taking a specific action in relation to the ad. We are working to continuously improve our models so that our systems only deliver the ads that are relevant to a user.
As our techniques become more sophisticated, we develop more complex functions that demand more from our infrastructure. A function can use different services depending on its purpose. Some features have a higher memory cost, while others require additional CPU or take up more memory. It is important to use our infrastructure wisely to maximize the performance of our models. We need to be able to intelligently allocate resources and quantify the tradeoffs of different scenarios.
To address these problems, we design our system as a linear programming problem that maximizes the metrics of our model. We use this framework to better understand the interaction between our functions and services. With this knowledge, we can automatically select the best features, identify infrastructure services to invest in, and maintain the health of our models and services.
Formulate our problem
To get a grip on our framework, we first introduce a model problem. Suppose we have several functions, all of which take up some space (the height of the rectangles) and add some gain to our models (the teal squares), and we are unable to fit them all into our limited capacity.
A naive solution would be to just select the functions with the most profit (teal squares) until you run out of capacity. However, you may not get the most out of your resources if you just prioritize profit. For example, if you are recording a large feature with a large gain, you could take up space that two smaller features with less gain could use instead. Together, these two smaller features would give you more bang for your buck than the big feature.
If we got a little less naive, we could instead look for features that make us the most money per dollar – features that have the greatest profit per space. However, if we select functions from this perspective only, we could end up leaving out some less efficient functions that we still have room for.
We looked at a very simplified view of the infrastructure, but the reality is a little more complex. For example, features often do not only occupy one resource, but many – such as RAM, CPU or memory in other services. We can make our example a little more sophisticated by adding Service B and saying that orange features take up space in both Service A and Service B.
Choosing which features we use isn’t the only way to control how our infrastructure is used. We can also use various techniques to make our feature storage more efficient. This sometimes comes at a cost, either by the function itself or by the capacity of a service. In this case, let’s assume that we can cut the storage cost of some functions (outlined in pink) in half, but only at the cost of reducing the profit of the function and using some of the limited capacity in Service B.
We’ll end the example here, but that’s enough to get the general message across – infrastructure can be a complicated, interconnected system with various constraints. In reality, our capacity is not set in stone. We can move resources when it is justified. Besides, we’re not just working on features. There are many other projects and workflows competing for the same resources. Not only do we need to choose the features that will maximize our profits, but we also need to be able to answer questions about how our system responds to changes:
- Which functions do we choose to optimize the gain?
- Is it worth the function compression? More importantly, is it worth an engineer’s time to implement?
- How does profit change if we add more capacity to Service A?
- How do service dependencies interact? If we increase the capacity of Service B, can we use less of Service A?
Scaling the problem
Let’s take a step back and check the terms of our model problem:
- We want to maximize our profit.
- We are limited by the capacity of Service A.
- We are also limited by the capacity of Service B, to which only a few features contribute.
- Some features may be compressed, but:
- They suffer a loss to their gain.
- Part of the capacity of Service B must be used.
We can express all of these constraints as a system of linear equations.
Let π₯ be a vector with 0 or 1 that indicates whether we select the feature and let π be a vector that stores the gain of the feature. The indices and π indicate whether we are specifying a full cost or a compressed characteristic. For example π₯f denotes full, uncompressed features that we selected for inclusion, and πc represents the cost of compressed features.
Given these definitions, our goal is to maximize:
We can now add our constraints that model the constraints of our infrastructure:
- Features are either selected and compressed, selected but not compressed, or unselected. We shouldn’t choose the compressed and uncompressed versions of the same function.
- Let π be the storage cost of the feature and the indices π΄ and π΅ represent service A and B, respectively. For example π π΄π represents the storage cost of compressed functions in service A. We are limited by the capacity of the two services.
- Part of Service B must be used to enable compression. Let’s represent that as a few features that need to be selected.
We have now fully specified our problem in a few equations and can solve them with linear programming techniques. Of course, since we are interested in automating and producing this, it can easily be specified in the code. For this example, we’ll do this in Python with the excellent NumPy and CVXPY Packages.
import cvxpy as cp import numpy as np import pandas as pd # Assuming data is a Pandas DataFrame that contains relevant feature data = pd.DataFrame (…) # These variables contain the maximum capacity of various services service_a = … service_b = … selected_full_features = cp.Variable (data.shape[0], boolean = True) selected_compressed_features = cp.Variable (data.shape[0], boolean = True) # Maximize the feature gain feature_gain = (data.uncompressed_feature_gain.to_numpy () @ selected_full_features + data.compressed_feature_gain.to_numpy () @ selected_compressed_features) Constraints = [
# 1. We should not select the compressed and uncompressed version
# of the same feature
selected_full_features + selected_compressed_features <= np.ones(data.shape[0]), # 2.Functions are limited by the maximum capacity of the services data.full_storage_cost.to_numpy () @ selected_full_features + data.compressed_storage_cost.to_numpy () @ selected_full_features <= service_a, data.full_memory_cost.to_numpy + data ._ @ selected_features () @ selected_features compressed_memory_cost.to_numpy () @ selected_compressed_features
<= service_b,
# 3. Some features must be selected to enable compression
selected_full_features >= data.special_features.to_numpy (),]
Use the frame
Now we have a framework that we can use to express our questions and hypotheses. If we want to find out how an increase in Service A translates into a function gain, we can run the above optimization problem with different values ββfor the capacity of Service A and graph the gain. In this way we can directly quantify the rate of return for each incremental increase in capacity. We can use this as a strong signal as to which services we should invest in in the future and directly compare the return on investment for more functional memory, computing or storage.
We can also look at the relationships between the services. We just vary the capacity of services A and B while keeping the gain constant. We can see that as the capacity of Service B increases, less of Service A is needed to produce the same profit. This can be used when one service is overused compared to another.
Linear programming as a framework for automating decisions
Feature approval used to be a manual process, with teams spending valuable time calculating how many features we could support and analyzing how the return on investment came from increasing the capacity of our services. In a company like Facebook – where we have multiple models that are continuously iterated – this approach is not scalable. By framing our services as a system of linear equations, we take a complex, interconnected system and simplify it into simple relationships that are easy to communicate. This enables us to make smarter decisions about the features we provide and the infrastructure we invest in.
Comments are closed.