5 issues I realized about engaged on content material high quality at Instagram | by Brunno Attorre
Recommended content that pops up in places like Explore or hashtags is a central part of people’s experience on Instagram. When users browse this “disconnected content” from accounts they are not yet linked to on Instagram, it is extremely important to identify and deal with content that violates our Community Guidelines or that viewers consider objectionable or inappropriate will. Last year we formed a team dedicated to finding both harmful and potentially offensive content on these disconnected surfaces of Instagram and taking steps to protect our community.
This work differs from traditional platform work. Platform teams at Facebook traditionally focus on solving a problem on a number of surfaces, such as: B. in newsfeeds and stories. Explore and hashtags, however, are particularly complicated ecosystems. We decided to develop a tailor-made solution that builds on the work of our platform teams and applies them to these complex surfaces.
Now, a year later, we share the lessons we learned from these efforts. These changes are critical to our ongoing commitment to keeping people safe on Instagram. We hope they can also help shape the strategies of other teams thinking about how to improve the quality of content in their products.
One of the biggest challenges this year was figuring out how to accurately measure the quality of content. There is no industry standard when it comes to deterministically measuring quality.
When measuring the quality of experiments and A / B testing of multiple development teams, attempting to hand-label each test group subset from our experiments was time consuming and unlikely to produce statistically significant results. Overall, this was not a scalable solution.
We’ve switched to many different types of metrics: from using deterministic user signals to evaluating test and control groups for all experiments. This transition from metrics to experimentation required significant effort and resulted in us spending many iteration cycles trying to understand the results of our experiments.
Trying to manually tag each experiment just wasn’t scalable. We have often seen results like the ones above: large overlapping confidence intervals and no directional intuition of your experiment.
In the end, we decided to combine manual calibration labels and software generated scores to get the best of both worlds. By relying on both human labels and a classifier for calibration, we were able to scale the calibrated classifier rating (in other words, the likelihood of a content violation for a given rating) across the experiment. This allowed us to get a more statistically significant approximation of the effects when compared to human markers and classifiers alone.
Conclusion: Don’t try to solve quality without operationalizing your metrics and make sure your engineers have a reliable online metric to refer to in their experiments. When thinking about quality, you should also think about how you can rely on classifier values and manually labeled data to approximate the direction and size of your starts.
In the past we’ve always used classifiers that predict whether a piece of content is good or bad at the time of upload. These are called “write path classifiers”. A write path classifier has the advantage of being efficient, but it has one major drawback: it can only display the content itself (i.e. pixels and labels). It cannot contain real-time functions that give a lot of information about whether a medium is good or bad, such as: B. Comments or other engagement signals.
Last year we started working on a “reading path model”. This “reading path model” is a real-time classifier at the impression level for detecting unwanted content (photos, videos), which combines both the upload time signals and the real-time engagement signals at the media and author level. So this particular model runs every time a user makes a request to view a page in Explore, with each candidate being rated in real time at the request level.
This model proved extremely successful. By using real-time engagement signals in combination with the content capabilities, bad behaviors related to content infringement could be captured and understood.
Our first suggestion with the well-being team to use both write path and read path models proved extremely effective in reducing unwanted content in Explore.
Conclusion: If you are considering applying quality signals into your ranking model, using a reading path model trained with both content-level and engagement-level features can be a more reliable and precise means of getting better results.
We know that reading path models are important for filtering offensive and potentially inappropriate content from disconnected surfaces at the ranking level, we found that A basic level of protection at the sourcing level is still required. This is where classifiers at the write path level come into play.
But what does it do? Precedence and Sourcing level mean? At Instagram, we have two steps to deliver content to our community on Explore and hashtag pages:
- The Procurement step represents the queries required to find appropriate content to show someone with context to that person’s interests.
- The Ranking step takes legitimate content and arranges it according to a certain algorithm / model.
We learned the following when it came to finding suitable content at the sourcing level:
- You need filters at the sourcing level for low frequency Problems. Low prevalence violations represent a very small volume of your training data, which means that content from your reading path models may be overlooked. Therefore, using an upload path classifier in these cases makes a lot of sense and provides protection for these low prevalence problems.
- You need high-precision filters to ensure basic protection on all surfaces. If you only use “bad” content as a source and do the filtering in the ranking step, you will not have a lot of content for the ranking, which reduces the effectiveness of your ranking algorithms. Hence, it is important that you maintain a good standard of sourcing to ensure that most of the content you are sourcing is harmless.
Conclusion: The combination of basic protection when sourcing, finely tuned filtering when ranking and a reading path model enabled us to maintain a high quality standard for content at Explore. However, it is important to note that your protection should always be high-precision and low-level during procurement to avoid mistakes.
This goes beyond engineering, but it is a key to our work. When working on quality, it is important that you measure the performance of the models you are using in production. There are two reasons:
- By calculating a precision and recall measurement on a daily basis, you can quickly determine when your model will expire or when you have a problem with the performance of any of the underlying characteristics. It can also help you draw attention to a sudden change in the ecosystem.
- Knowing how your models work can help you understand how you can improve. A low accuracy model means your users may have bad experiences.
These metrics and the ability to visualize the content marked “bad” were a huge improvement for our team. Using these dashboards, our engineers can quickly identify any movement in metrics and visualize the types of content violations needed to improve the model, thereby speeding up feature development and model iteration.
Bottom line: monitor your precision and recall curve daily and make sure you understand the type of content that is being filtered out. This allows you to identify problems and quickly improve your existing models.
We learned a lot by using raw sleepers as filters and adjusting accordingly. Facebook is a complex ecosystem and models have many dependencies that can affect the upstream functionality of your model. This in turn can affect the distribution of points.
Scores can be very volatile (like the scores above) so it is important that you be ready and prepared when changes occur in the distribution.
Overall, the problem with using raw thresholds is that they are too volatile. Any small change can cause unexpected fluctuations on surfaces, especially if you suddenly have a large metric movement from one day to the next.
As a solution, we recommend a calibration data set for the daily calibration of your models or a percentile filter mechanism. We recently switched both our content filtering and ranking frameworks to use percentiles to allow for a more stable infrastructure, and we want to set up a calibration framework in the coming months.
Conclusion: Use a percentile framework instead of raw thresholds or calibrate your scores against a dataset updated daily.
Keeping Instagram safe is essential to our mission as a company, but it’s a tough area in our industry. It is important for us to take new approaches to solving quality problems in our service and not rely on approaches learned in more traditional ML ranking projects. Finally, here are some of our key takeaways:
- Operationalizing a quality metric is important, and you should always think about ways to rely more on machine learning when scaling your human labels.
- Always think holistically about how to apply quality assurance to your ranking flow and try to incorporate models across multiple levels of your system for the best results.
- Always remember that the experience of those using your service is your number one priority and make sure you have tools to visualize, monitor and calibrate the models you use in production to ensure the best experience possible.
If you would like to learn more about this work or would like to join one of our engineering teams, please visit our careers page, follow us on Facebook or Twitter.