Building machine learning models is fairly easy nowadays, but often, making good predictions is not enough. On top, we want to make **causal statements** about **interventions**. Knowing with high accuracy that a customer will leave our company is good, but knowing what to do about it — for example sending a coupon — is much better. This is a bit more involved, and I explained the basics in my other article.

I recommend reading this article before you continue. I showed you how you can easily come to causal statements whenever your features form a **sufficient adjustment set**, which I will also assume for the rest of the article.

The estimation works using so-called **meta-learners.** Among them, there are the S- and the T-learners, each with their own set of disadvantages. In this article, I will show you another approach that can be seen as a tradeoff between these two meta-learners that can give you better results.

Let us assume that you have a dataset (*X*, *t*, *y*), where *X* denotes some features, *t* is a distinct binary treatment, and *y* is the outcome. Let us briefly recap how the S- and T-learners work and when they don’t perform well.

## S-learner

If you use an S-learner, you fix a model *M* and train it on the dataset such that *M*(*X*, *t*)* *≈* y*. Then, you compute

Treatment Effects = M(X, 1) – M(X, 0)

and that’s it.

The problem with this approach is that the mode could choose to ignore the feature *t* completely. This typically happens if you already have hundreds of features in X, and *t* drowns in this noise. If this…