Counterfactual Evaluation for Recommendation Systems

Start Here Writing Speaking Prototyping About

Counterfactual Evaluation for Recommendation Systems

Контрфактическая оценка рекомендательных систем

[ recsys eval machinelearning ] · 8 min read

[ recsys eval machinelearning ] · 8 мин чтения

When I first started working on recommendation systems, I thought there was something weird about the way we did offline evaluation. First, we split customer interaction data into training and validation sets. Then, we train our recommenders on the training set before evaluating them on the validation set, usually on metrics such as recall, precision, and NDCG. This is similar to how we evaluate supervised machine learning models and doesn’t seem unusual at first glance.

Когда я только начинал работать над рекомендательными системами, мне казалось, что в том, как мы проводим оффлайн-оценку, есть что-то странное. Сначала мы делим данные о взаимодействиях пользователей на обучающую и валидационную выборки. Затем обучаем рекомендатели на обучающей выборке и оцениваем их на валидационной, обычно по таким метрикам, как recall, precision и NDCG. Это похоже на то, как мы оцениваем модели обучения с учителем, и на первый взгляд не вызывает вопросов.

But don’t our recommendations change how customers click or purchase? If customers can only interact with items shown to them, why do we perform offline evaluation on static historical data?

Но разве наши рекомендации не меняют то, на что пользователи кликают или что покупают? Если пользователи могут взаимодействовать только с теми товарами, которые им показали, почему мы проводим оффлайн-оценку на статических исторических данных?

Observational vs. interventional problems

Наблюдательные и интервенционные задачи

It took me a while to put a finger on it but I think this is why it felt weird: We’re treating recommendations as an observational problem when it really is an interventional problem.

Мне потребовалось время, чтобы сформулировать, в чём дело, но, кажется, именно поэтому это казалось странным: мы рассматриваем рекомендации как наблюдательную задачу, тогда как на самом деле это интервенционная задача.

Problems solved via supervised machine learning are usually observational problems. Given an observation such as product title, description, and image, we try to predict the product category. Our model learns P(category=phone|title=“…”, description=“…”, image=image01.jpeg).

Задачи, решаемые с помощью обучения с учителем, обычно являются наблюдательными. По наблюдению — например, названию товара, описанию и изображению — мы пытаемся предсказать категорию товара. Наша модель учит P(category=phone|title="…", description="…", image=image01.jpeg).

On the other hand, recommendations are an interventional problem. We want to learn how different interventions (i.e., item recommendations) lead to different outcomes (i.e., clicks, purchases). By using logged customer interaction data as labels, the observational offline evaluation approach ignores the interventional nature of recommendations.

С другой стороны, рекомендации — это интервенционная задача. Мы хотим понять, как разные интервенции (т. е. рекомендации товаров) приводят к разным исходам (т. е. кликам, покупкам). Используя залогированные данные о взаимодействиях пользователей в качестве меток, наблюдательный подход к оффлайн-оценке игнорирует интервенционную природу рекомендаций.

As a result, we’re not evaluating if users would click or purchase more due to our new recommendations; we’re evaluating how well the new recommendations fit logged data. Thus, what our model learns is P(view3=iphone|view1=pixel, view2=galaxy) when what we really want is P(click=True|recommend=iphone, view1=pixel, view2=galaxy).

В итоге мы оцениваем не то, будут ли пользователи кликать или покупать больше из-за наших новых рекомендаций; мы оцениваем, насколько хорошо новые рекомендации соответствуют залогированным данным. Таким образом, наша модель учит P(view3=iphone|view1=pixel, view2=galaxy), тогда как на самом деле нам нужно P(click=True|recommend=iphone, view1=pixel, view2=galaxy).

Evaluating recsys as an interventional problem

Оценка recsys как интервенционной задачи

The straightforward way to evaluate recommendations as an interventional problem is via A/B testing. Our interventions (i.e., new recommendations) are shown to users, we log their behavior attributed to our new recommendations, and then measure how metrics such as click-thru-rate and conversion change. However, it requires more effort relative to offline evaluation, experiment cycles may be long as we need enough data to make a judgement, and there’s the risk of deploying terrible experiments. Also, we may not have easy access to A/B testing we’re working on the research side of things.

Самый простой способ оценить рекомендации как интервенционную задачу — это A/B-тестирование. Наши интервенции (т. е. новые рекомендации) показываются пользователям, мы логируем их поведение, связанное с новыми рекомендациями, и затем измеряем, как меняются метрики вроде click-through rate и конверсии. Однако это требует больше усилий, чем оффлайн-оценка, циклы экспериментов могут быть долгими — нужно набрать достаточно данных, чтобы сделать вывод, — и существует риск выкатить заведомо плохой эксперимент. К тому же у нас может не быть простого доступа к A/B-тестированию, если мы работаем на исследовательской стороне.

The less direct approach is counterfactual evaluation. Counterfactual evaluation tries to answer “what would have happened if we show users our new recommendations instead of the existing recommendations?” This allows us to estimate the outcomes of potential A/B tests without actually running them.

Менее прямой подход — контрфактическая оценка. Контрфактическая оценка пытается ответить на вопрос: «Что бы произошло, если бы мы показали пользователям наши новые рекомендации вместо существующих?» Это позволяет нам оценивать исходы потенциальных A/B-тестов, не запуская их.

The most widely known technique for counterfactual evaluation is Inverse Propensity Scoring (IPS). It’s sometimes also referred to as inverse probability weighting/sampling. The intuition behind it is that we can estimate how customer interactions will change—by reweighting how often each interaction will occur—based on how much more (or less) each item is shown by our new recommendation model. Here’s the IPS equation.

Самая известная техника контрфактической оценки — Inverse Propensity Scoring (IPS). Иногда её также называют inverse probability weighting/sampling. Интуиция в том, что мы можем оценить, как изменятся взаимодействия пользователей, — путём переvзвешивания того, как часто будет происходить каждое взаимодействие, — на основе того, насколько чаще (или реже) каждый товар показывается нашей новой рекомендательной моделью. Вот формула IPS.

Breakdown of the Inverse Propensity Score estimator

Разбор оценщика Inverse Propensity Score

Let’s try to understand it by starting from the right. In section 1, r represents the reward for an observation. This is the number of clicks or purchases or whatever metric is important to you in the logged data.

Попробуем разобрать её, начиная справа. В разделе 1 r обозначает награду за наблюдение. Это количество кликов, покупок или любая другая метрика, важная для вас, в залогированных данных.

Next is the importance weight. The denominator (section 2a) represents our existing production recommender’s (π0) probability of making a recommendation (aka action a) given the context x; the numerator (section 2b) represents the same probability but for our new recommender (πe). (π stands for recommendation policy.) For a user-to-item recommender, x is the user; for an item-to-item recommender, x is an item.

Далее идёт importance weight. Знаменатель (раздел 2a) — это вероятность того, что наш текущий продакшен-рекомендатель (π0) сделает рекомендацию (она же action a) при контексте x; числитель (раздел 2b) — та же вероятность, но для нашего нового рекомендателя (πe). (π обозначает рекомендательную policy.) Для user-to-item рекомендателя x — это пользователь; для item-to-item рекомендателя x — это товар.

With the importance weight, we can compute how often a recommendation is made via the new model relative to the existing model. We can then use the ratio to update our logged rewards. For example, we have an old model (π0) and new model (πe) that recommend iPhone on the Pixel detail page, but with different probabilities:

С помощью importance weight мы можем вычислить, насколько часто рекомендация делается новой моделью относительно существующей. Затем мы можем использовать это отношение, чтобы обновить наши залогированные награды. Например, у нас есть старая модель (π0) и новая модель (πe), которые рекомендуют iPhone на странице товара Pixel, но с разными вероятностями:

π0(recommend=iPhone|view=Pixel) = 0.4

πe(recommend=iPhone|view=Pixel) = 0.6

π0(recommend=iPhone|view=Pixel) = 0.4 πe(recommend=iPhone|view=Pixel) = 0.6

In this scenario, the new model will recommend iPhone 0.6/0.4 = 1.5x as often as the old model. Thus, assuming a non-zero reward (i.e., the user clicked or purchased), we can reweight the logged reward to be worth 1.5x as much.

В этом сценарии новая модель будет рекомендовать iPhone в 0.6/0.4 = 1.5 раза чаще, чем старая. Поэтому при ненулевой награде (т. е. пользователь кликнул или купил) мы можем переvзвесить залогированную награду, считая её в 1.5 раза ценнее.

Finally, we average over our data (section 3) to get the IPS estimate (section 4) for our new recommender. This IPS estimate suggests how much reward (i.e., clicks, purchases) the new recommender would get relative to the production recommender if the new recommender was shown to users.

Наконец, мы усредняем по нашим данным (раздел 3) и получаем IPS-оценку (раздел 4) для нашего нового рекомендателя. Эта IPS-оценка показывает, сколько награды (т. е. кликов, покупок) получит новый рекомендатель относительно продакшен-рекомендателя, если новый рекомендатель будет показан пользователям.

But how do we get the probability of making a recommendation (a) given the context (x)? Well, we can normalize the raw scores for each recommendation (via Plackett-Luce) to get each recommendation’s probability. Alternatively, if our recommendations are pre-computed, we can count the frequency of each recommendation in our recommendation store. My preferred approach is to use the impression count for each recommendation—I believe this is the most direct measure of the probability of making a recommendation and best adjusts for the presentation bias.

Но как получить вероятность того, что будет сделана рекомендация (a) при контексте (x)? Можно нормализовать сырые скоры каждой рекомендации (через Plackett-Luce) и получить вероятность для каждой рекомендации. Альтернативно, если наши рекомендации предвычислены, можно посчитать частоту каждой рекомендации в нашем хранилище рекомендаций. Мой предпочтительный подход — использовать количество показов (impression count) для каждой рекомендации: я считаю, что это самая прямая мера вероятности сделать рекомендацию, и она лучше всего учитывает presentation bias.

This dependence on recommendation probabilities or impressions likely explains why counterfactual evaluation isn’t more widely adopted in academic papers—most public datasets don’t include them. One exception is the Open Bandit Dataset which includes the recommendation probability (action_prob) for each recommendation observation.

Эта зависимость от вероятностей рекомендаций или количества показов, вероятно, объясняет, почему контрфактическая оценка не получила более широкого распространения в академических работах: большинство публичных датасетов их не включают. Исключение — Open Bandit Dataset, в котором есть вероятность рекомендации (action_prob) для каждого наблюдения рекомендации.

Sample rows in the Open Bandit Pipeline with the action probabilities

Примеры строк в Open Bandit Pipeline с вероятностями действий

However, IPS has its pitfalls. One challenge is insufficient support. This happens when our new recommender being evaluated (πe) makes a recommendation (a) that our existing production recommender (π0) didn’t make. Thus, π0’s probability of a is zero and we can’t compute the importance weight. We can mitigate this by deliberately showing random samples of non-recommended items on a sliver of traffic to log interactions for potential recommendations. (Spoiler: PMs might not like this.) A more palatable approach is ensure that all eligible items have a non-zero recommendation probability and then sample based on that probability. This gives all items a chance to be recommended.

Однако у IPS есть свои подводные камни. Один из них — insufficient support. Это происходит, когда наш оцениваемый новый рекомендатель (πe) делает рекомендацию (a), которую существующий продакшен-рекомендатель (π0) не делал. Тогда вероятность a у π0 равна нулю, и мы не можем посчитать importance weight. Это можно смягчить, сознательно показывая случайные выборки не рекомендованных товаров на небольшой части трафика, чтобы залогировать взаимодействия для потенциальных рекомендаций. (Спойлер: продактам это может не понравиться.) Более приемлемый подход — обеспечить, чтобы у всех подходящих товаров была ненулевая вероятность рекомендации, и затем сэмплировать по этой вероятности. Это даёт всем товарам шанс быть рекомендованными.

IPS can also suffer from high variance when the new model (πe) recommends very differently from the old model (π0). Suppose π0 makes a recommendation (a) with a probability of 0.001 and we logged a single click. If πe makes the same recommendation (a) with a probability of 0.1, we would reweight that single click by 100x—this is likely a severe overestimation. One solution is to ensure that the new recommenders being evaluated don’t differ too much from the production recommender, thus preventing the importance weight from exploding.

IPS также может страдать от высокой дисперсии, когда новая модель (πe) рекомендует сильно иначе, чем старая (π0). Предположим, π0 делает рекомендацию (a) с вероятностью 0.001, и мы залогировали один клик. Если πe делает ту же рекомендацию (a) с вероятностью 0.1, мы переvзвесим этот единственный клик в 100 раз — это, скорее всего, серьёзное завышение. Одно из решений — следить, чтобы оцениваемые новые рекомендатели не слишком отличались от продакшен-рекомендателя и importance weight не взрывался.

Another solution is Clipped IPS (CIPS). CIPS lets us set a maximum threshold for the importance weight. For example, if our threshold is 10, an importance weight greater than 10 is clipped to it. However, tuning the clipping parameter can be tricky.

Другое решение — Clipped IPS (CIPS). CIPS позволяет задать максимальный порог для importance weight. Например, если наш порог 10, то importance weight больше 10 обрезается до этого значения. Однако настройка параметра обрезки может быть непростой.

Clipped Inverse Propensity Score

Another approach is Self-Normalized IPS (SNIPS). SNIPS divides the IPS estimate by the importance weight. This rescaling prevents overinflated IPS estimates. Relative to CIPS, SNIPS is simpler and doesn’t require setting a parameter.

Ещё один подход — Self-Normalized IPS (SNIPS). SNIPS делит IPS-оценку на importance weight. Это перемасштабирование предотвращает завышение IPS-оценок. По сравнению с CIPS, SNIPS проще и не требует настройки параметра.

Self-normalized Inverse Propensity Score

Which works better? At a recent RecSys 2021 tutorial, Yuta Saito compared various methods via experiments on synthetic data generated via Open Bandit Pipeline with 10 possible actions. He also assessed the direct method (DM) which we didn’t discuss. In a nutshell, DM trains a model to impute missing rewards. Think of it as similar to building an environment model for reinforcement learning, such as OpenAI gym or Criteo reco-gym, which we can then use to train and evaluate our RL models.

Что работает лучше? На недавнем туториале RecSys 2021 Юта Сайто сравнил различные методы на экспериментах с синтетическими данными, сгенерированными через Open Bandit Pipeline, с 10 возможными действиями. Он также оценил direct method (DM), который мы не обсуждали. Если кратко, DM обучает модель импутировать пропущенные награды. Считайте, что это похоже на построение модели среды для обучения с подкреплением — такой как OpenAI gym или Criteo reco-gym, — которую мы затем можем использовать для обучения и оценки наших RL-моделей.

Comparing various IPS estimators and the Direct Method

Сравнение различных IPS-оценщиков и Direct Method

He found that IPS outperformed DM as the amount of logged data increases, and that CIPS didn’t perform much better than IPS. Overall, SNIPS performed the best (i.e., had the least error) and without the need for any parameter tuning. The tutorial goes on to discuss other estimators such as Doubly Robust (combining DM and SNIPS) as well as counterfactual learning—highly recommend checking it out.

Он обнаружил, что IPS обгоняет DM по мере роста объёма залогированных данных, а CIPS работает не сильно лучше IPS. В целом, SNIPS показал лучший результат (т. е. наименьшую ошибку) и без необходимости настраивать какие-либо параметры. В туториале также обсуждаются другие оценщики — например, Doubly Robust (комбинация DM и SNIPS), — а также контрфактическое обучение; настоятельно рекомендую ознакомиться.

Nonetheless, one downside of SNIPS is that it requires computing the importance weight for all observations; in IPS, we only need the importance weight for observations with non-zero reward. If we consider how most recommendations have zero reward (<10% CTR or conversion), SNIPS increases storage requirements of recommendation probabilities and computation of importance weights by 10x or more. That said, the authors of SNIPS found that the increase in computation is made up for via faster convergence.

Тем не менее, недостаток SNIPS в том, что он требует вычисления importance weight для всех наблюдений; в IPS importance weight нужен только для наблюдений с ненулевой наградой. Если учесть, что у большинства рекомендаций нулевая награда (<10% CTR или конверсии), SNIPS увеличивает требования к хранению вероятностей рекомендаций и вычислению importance weight в 10 и более раз. Тем не менее авторы SNIPS обнаружили, что рост вычислений компенсируется более быстрой сходимостью.

Another tool in our recsys evaluation toolbox

Ещё один инструмент в арсенале оценки recsys

Let me conclude by clarifying that I’m not suggesting for us to stop training and evaluating recsys models via the observational paradigm. Despite its limitations, it has several benefits. First, it’s an established evaluation framework with many public datasets and standard metrics. This makes it easier to compare various techniques. Second, we can collect training and evaluation data even before deploying our first recommender. Customer interaction data is generated organically when customers use our platforms. Thus, the conventional offline evaluation approach is a good place to start.

В заключение поясню: я не предлагаю прекращать обучать и оценивать recsys-модели в наблюдательной парадигме. Несмотря на свои ограничения, у неё есть несколько достоинств. Во-первых, это устоявшаяся фреймворк оценки с множеством публичных датасетов и стандартных метрик. Это упрощает сравнение различных техник. Во-вторых, мы можем собирать данные для обучения и оценки ещё до развёртывания нашего первого рекомендателя. Данные о взаимодействиях пользователей генерируются органически, когда пользователи пользуются нашими платформами. Поэтому традиционный подход к оффлайн-оценке — хорошая отправная точка.

Nonetheless, if you’re keen to try a new evaluation approach, or find your offline metrics diverging from online A/B testing outcomes, consider counterfactual evaluation via SNIPS. In addition, though I’ve been discussing counterfactual evaluation in the context of recsys, it’s also applicable to other use cases where you want to simulate A/B tests offline.

Тем не менее, если вам хочется попробовать новый подход к оценке или ваши оффлайн-метрики расходятся с результатами онлайн A/B-тестов, рассмотрите контрфактическую оценку через SNIPS. Кроме того, хотя я обсуждал контрфактическую оценку в контексте recsys, она применима и к другим задачам, где нужно симулировать A/B-тесты оффлайн.

Thanks to Arnab Bhadury, Vicki Boykis, and Yuta Saito for reading drafts of this.

Спасибо Arnab Bhadury, Vicki Boykis и Yuta Saito за чтение черновиков.

References

Ссылки

The Central Role of Propensity in Observational Studies for Causal Effects (IPS)

Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization (CIPS)

The Self-Normalized Estimator for Counterfactual Learning (SNIPS)

SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning

RecSys2021 Tutorial Counterfactual Learning and Evaluation

Open Bandit Pipeline: A Research Framework for Off-Policy Evaluation & Learning

Simulating A/B tests offline with counterfactual inference

The Central Role of Propensity in Observational Studies for Causal Effects (IPS) Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization (CIPS) The Self-Normalized Estimator for Counterfactual Learning (SNIPS) SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning RecSys2021 Tutorial Counterfactual Learning and Evaluation Open Bandit Pipeline: A Research Framework for Off-Policy Evaluation & Learning Simulating A/B tests offline with counterfactual inference

If you found this useful, please cite this write-up as:

Если это оказалось полезным, пожалуйста, цитируйте эту заметку как:

Yan, Ziyou. (Apr 2022). Counterfactual Evaluation for Recommendation Systems. eugeneyan.com. https://eugeneyan.com/writing/counterfactual-evaluation/.

or

или

@article{yan2022counterfactual, title = {Counterfactual Evaluation for Recommendation Systems}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2022}, month = {Apr}, url = {https://eugeneyan.com/writing/counterfactual-evaluation/} }

Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.

Присоединяйтесь к 11 800+ читателей, получающих обновления о машинном обучении, RecSys, LLM и инженерии.