Notes on the Elastic Net
Almost two months ago today, I submitted my final undergraduate paper, Targeted Direct Mailings: An Application of Elastic Net Regularization to Marketing Strategy. In it, I presented a quantitative approach to marketing strategy for the Paralyzed Veterans of America group (PVA) based on the well-known KDD Cup 1998 dataset. While the paper itsn’t something I plan to share publicly, there are some ideas in it that I’ve been wanting to post here, if only for my own records.
From the KDD Cup abstract, here’s a five second rundown of the problem:
The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.
In practice, this meant most competitors focused on training binary classifiers that predicted whether a past donor was likely to respond to a new funding request. With such a model, they could then recommend exactly who should be sent a mailer to maximize expected donation profits (a small up-front cost was associated with mailing each request).
Coming from a business school background, I’ve encountered this kind of churn problem a lot. Fortunately, it’s an awesome exercise. I’ve grown to love how expansive the problem can be and how it always seems to provide an opportunity to try out something new. So without
further too much more ado, here’s one of my most recent attempts at it, an experiment in application of elastic net regularization.
Thinking About the Whys
Disclaimer: I’m not a machine learning, statistics, finance, or marketing expert. If you’re here for a lecture on theory, you’ve made a wrong turn. All the same, I hope that by talking about a problem full stack, I can better record some of the more important aspects of how I thought about integrating the elastic net into my data science toolkit.
Ready, set… wait, wait. There are some things I really need to mention before jumping into modeling. The PVA problem exists in the real world, is sponsored by a real organization with real goals, and concerns real people. So we ought to think really hard about what we want to do and why we want to do it with the elastic net.
From my paper:
Identifying a target cohort is an important aspect of any concentrated marketing campaign. This holds especially true for operations that necessitate up-front, per-unit costs to interact with prospective consumers. Furthermore, while anyone can be adversely affected by poor targeting, the budgetary austerity of non-profit groups often behooves even more focused planning.
That seems like a good start. Now there’s a reason to care about the efficiency of a good model, but why the elastic net?
The key focus here is the width of our data set. If we only have ten data points on each consumer, we leave ourselves a lot of options when it comes to building a regression or applying machine learning methods. On the other hand, if we begin with a couple hundered attributes, feature selection becomes a necessary pre-modeling task when building an appropriately complex classifier. The PVA set has 481 columnar variables, so what should we do?
Certainly the headache of feature selection is nothing new. In both machine learning and traditional statistics circles, countless approaches exist and are widely used for pre-pruning wide design matrices. But it just doesn’t always feel right to discard information like that. Isn’t there a way to make feature selection and model training one in the same?
Here’s my summary of how a few researchers answered that call:
The elastic net regularization method enables us to combine the benefits of the lasso penalty and the ridge penalty. Under the penalization, becomes effectively sparse as many coefficients converge to zero, while the penalization controls for multicollinearity. […] We can expect two key benefits from this approach: automated variable selection and handling of groups of correlated variables.
So in short, we’re going to try out the elastic net because it can help us go from “too many” variables to “just enough” (thanks to the lasso) without making too many assumptions up front (thanks to ridge regression).
And that, I think, responsibly sets the stage for introducing the bread and butter of this post. I’m not going to consider the PVA case from this point on (the data is a bit too unwieldy for a blog post), but I’ll jump back in with another dataset that should be just as “real.”
The Statlog German Credit data set from the UCI Machine Learning Repository is a nice little example that we can use for a simpler test case. The data set classifies debtors of a bank based on their history as either a good or bad credit risk. Inside, it gives us 1,000 independent observations with which to work.
Moving forward, let’s pretend our job is to build a model that will help the bank’s financial advisors decide whether an applicant is at high-risk for defaulting on a loan.
As noted in the source abstract, it should be clear that classifying a borrower as low-risk when in fact they will be a bad debtor is bad; we’ll imagine this scenario means the bank loses its entire investment. On the other hand, granting a loan to a good debtor is constructive, so we want to do it as much as possible; let’s say in this scenario the bank profits 15% of their cash advance through interest. For simplicity, we won’t consider the opportunity costs of denying would-be good loans.
The following cost matrix should sum up all of these assumption:
|True High-Risk (0)||True Low-Risk (1)|
|Predicted High-Risk (0)||-||-|
|Predicted Low-Risk (1)||-1.00||+0.15|
Satisfyingly, the Statlog data set is relatively clean, with no
NA values. Originally, the repo came with 21 attributes, including the binary classifier
risk. However, I have made some minor changes to the data; you can check out my mutations in the script here. Similarly, take a look at the data dictionary for a deeper look at what the feature set describes (hint: it’s personal info and past credit history).
To avoid overfitting, I’ve further partitioned the data into training and validation sets. The training partition, representing 80% of the available data, will be used for model learning and cutoff optimization. I’ll save the remaining 20% of data as a held out partition upon which we will assess the performance of our model. To create these partitions, I’ve used stratified sampling without replacement in order to preserve the original class imbalance and not leak information across samples. Here’s the breakdown:
|High-Risk (0)||Low-Risk (1)|
To follow along from here, you can grab the clean
valid partitions as data frames from the .rds file available in this post’s R project directory.
Modeling with Elastic Nets
To begin our in-sample modeling process, we need to define an initial design matrix for the regression. While logistic methods do not compel some major assumptions of linear regression regarding variable relationships and error bias, we should usually give careful consideration to model multicollinearity. However, as we will soon discover, the elastic net internalizes a sort of feature selection by penalizing correlated independent predictors.
For rhetorical purposes, let’s then not only avoid pre-modeling feature selection, but in fact expand the dimensionality of our model to consider all non- and two-way interaction variables. Our design matrix should then expand to over (in this case, 1,063) covariates.
We can do this in R using:
t_mm <- model.matrix(risk ~ . + .^2 - 1, data = train)
But don’t be confused — we can still define this initial regression model in matrix form as usual with:
Or even more simply, using the expression:
Of course, since this is a binary classification problem, we should remember that in the design matrix is really the probability that = 1. This should be implicit in regressions using the logit link function. In the general form, we can express the probability as:
From here, a model could be developed as usual by simply estimating each coefficient in the design matrix by using the traditional ordinary least squares function. However, today we will prefer the following penalized objective function, otherwise known as the elastic net:
The elastic net regularization method enables us to combine the benefits of the lasso penalty and the ridge penalty. Under the penalization, becomes effectively sparse as many coefficients converge to zero, while the penalization controls for multicollinearity. Furthermore, in some implementations, we can control the effect of each penalty through a parameter . We should expect two key benefits from this approach: automated variable selection and handling of groups of correlated variables.
I typically implement elastic net models using the
glmnet R package developed by Trevor Hastie (he and Hui Zou first proposed and described the elastic net in this paper). Using the software, we are able to k-fold cross validate a model on our training set. Additionally, the software natively enables us to search for optimal values.
library(glmnet) set.seed(1017) cv_net <- cv.glmnet( t_mm, y = train$risk, alpha = 0.5, family = "binomial", type.measure = "auc" )
In fact, if processing time is of little concern, we could even search across a grid of candidate and pairs. If (a scenario in which the elastic net excels), this additional search may reveal striking differences in the value of
type.measure across coordinates.
But let’s take a step back and see how our model is evolving behind the scenes. Here’s what the search looks like for our arbitrarily-chosen = 0.5 model:
glmnet heuristically chooses a starting , incrementally dialing down the parameter as the program searches for an optimal value (in this case, based on finding the point that maximizes AUC). In the plot above, we can follow along with the process by tracing the ribbon of points from right to left, noting the peak around = 0.046.
Okay, so somehow the model gets better as we trace along , but what’s actually changing? To understand this we can take a look at the following plot of covariates and their associated at a given :
Note that each line in the plot represents a single covariate in our design matrix. But what is going on? Recall that weights the model regularization term, . When this value is zero, there is no penalty and all of our variables are included in the final model. On the other hand, as approaches 1, coefficients are more heavily penalized and begin to converge to zero; therefore, for “large” values of (for example, 0.2), we notice most = 0 so only a few actually determine .
To reiterate: First, we consider a design matrix where is potentially very wide. Then, we investigate different penalties, searching for a value that optimizes our metric of choice. Finally, using that “best” penalty, we can retrieve a model that, in all likelihood, is much sparser. Internalized feature selection!
A bit more involved than OLS, huh?
While that just about covers the defining characteristics of elastic nets, it should be obvious that a “real” problem can’t stop here. For example, one of our next steps ought to be identifying an optimal , the threshold for making a positive prediction. But few of these steps are necessarily specific to using the elastic net, so I won’t describe them here.
However, I want to concede that this post insufficiently discusses a few statistical assumptions related to the elastic net. Perhaps the most important of these is interpretation of the elements of ; while is regular in how it “contributes” a covariate’s information to a prediction, its associated p-value is not so ordinarily interpretable. In fact, debate continues as to the reliability of traditional hypothesis testing on penalized or adaptive models. Another missing note is post-implementation assessment of model multicollinearity; while the elastic net objective function aims to address this, a prudent analysis ought to explicitly examine it.
All the same, I think these small examples (and certainly more rigorous academic applications) exhibit an apparent usefulness of elastic net regularization in the wild.