Home
Examples Overview
A fake logistic regression example predicting wine sales
A linear model with interactions
Examples
More
This example is based on the training data set found here. We build a model to predict:
The input features are:
These are some parameters controlling the aggregate predictive comparisons:
# We will transitions starting at each of 500 random rows
numForTransitionStart <- 500
# ... going to each of 10,000 other random rows:
numForTransitionEnd <- 10000
# ... keeping only the nearest 100 pairs for each start:
onlyIncludeNearestN = 100
And for the random forest:
# 100 trees for random forest
ntree = 100
The distribution of the inputs (after removing some outliers to make things more manageable):
We'll use a random forest for this example:
set.seed(1)
# Turning the response to type "factor" causes the RF to be build for classification:
credit$SeriousDlqin2yrs <- factor(credit$SeriousDlqin2yrs)
rfFit <- randomForest(SeriousDlqin2yrs ~ ., data=credit, ntree=ntree)
set.seed(1)
apcDF <- GetPredCompsDF(rfFit, credit,
numForTransitionStart = numForTransitionStart,
numForTransitionEnd = numForTransitionEnd,
onlyIncludeNearestN = onlyIncludeNearestN)
This is a table of the aggregate predictive comparisons:
## |Input | PerUnitInput.Signed| PerUnitInput.Absolute| Impact.Signed| Impact.Absolute|
## |:-------------------------------------|--------------------:|----------------------:|--------------:|----------------:|
## |RevolvingUtilizationOfUnsecuredLines | 8.265e-02| 1.590e-01| 0.024470| 0.04707|
## |age | -3.263e-04| 2.496e-03| -0.004750| 0.03633|
## |NumberOfTime30.59DaysPastDueNotWorse | 3.873e-02| 7.057e-02| 0.014864| 0.02708|
## |DebtRatio | 8.705e-03| 1.888e-01| 0.001614| 0.03500|
## |MonthlyIncome | 3.270e-07| 9.794e-06| 0.001104| 0.03307|
## |NumberOfOpenCreditLinesAndLoans | 1.634e-03| 7.746e-03| 0.007147| 0.03387|
## |NumberOfTimes90DaysLate | 1.511e-01| 1.889e-01| 0.021397| 0.02676|
## |NumberRealEstateLoansOrLines | 2.855e-03| 2.294e-02| 0.002085| 0.01675|
## |NumberOfTime60.89DaysPastDueNotWorse | 9.203e-02| 1.853e-01| 0.007781| 0.01567|
## |NumberOfDependents | 4.221e-03| 1.849e-02| 0.004173| 0.01828|
Since the impact values all have the same units, it makes sense to chart them on the same axes:
PlotPredCompsDF(apcDF)
Note (for example) that average absolute value of the change in probability associated with changes in age is much larger than the magnitude of the signed average. This indicates either an interaction effect between age and other inputs, or a non-linear (non-monotonic) relationship between age an probability of default. We'll go into more detail on that in a bit.
It wouldn't make sense to chart the average predictive comparisons (which are per unit change in input) in this way since they don't share units, but we can chart the inputs corresponding to a number of times late for various periods:
PlotPredCompsDF(apcDF[grep("NumberOfTime", apcDF$Input), ],
variant = "PerUnitInput")
As you'd expect, this shows that the effect size (per additional incident) increases with the length of lateness, while the previous chart shows that NumberOfTime30.59DaysPastDueNotWorse
makes more overall difference to the model than NumberOfTime60.89DaysPastDueNotWorse
. This is because its variation is larger (it's non-zero more often). We can see this more clearly looking at the impact chart for just these inputs:
PlotPredCompsDF(apcDF[grep("NumberOfTime", apcDF$Input), ])
The difference between the absolute and signed versions show that there must be some observations where adding an incident of lateness decreases the default probability. This is most pronounced for NumberOfTime30.59DaysPastDueNotWorse
, so we'll look at this input in a bit more detail.
NumberOfTime30.59DaysPastDueNotWorse
in more detailRecall that the data from which the summarized predictive comparisons are computed consists of groups of rows, where with in each group only the input of interest varies (for the point we imagine transitioning to) and the rest are held constant. We can work directly with this data, visualizing it in more detail to better understand our model.
I'll plot the predicted default probability vs. NumberOfTime30.59DaysPastDueNotWorse
, holding the other inputs constant. Each line corresponds to one choice of values for the other inputs:
set.seed(6)
pairs <- GetComparisonDF(rfFit, credit,
u="NumberOfTime30.59DaysPastDueNotWorse",
numForTransitionStart = 20,
numForTransitionEnd = numForTransitionEnd*10,
onlyIncludeNearestN = onlyIncludeNearestN*10)
pairsSummarized <- pairs[c("OriginalRowNumber", "NumberOfTime30.59DaysPastDueNotWorse.B", "yHat2", "Weight")] %.%
group_by(OriginalRowNumber, NumberOfTime30.59DaysPastDueNotWorse.B, yHat2) %.% summarise(Weight = sum(Weight))
ggplot(pairsSummarized, aes(x=NumberOfTime30.59DaysPastDueNotWorse.B, y=yHat2, color=factor(OriginalRowNumber))) +
geom_point(aes(size = Weight)) +
geom_line(size=.2) +
scale_x_continuous(limits=c(0,2)) +
scale_size_area() +
xlab("NumberOfTime30.59DaysPastDueNotWorse") +
ylab("Prediction") +
guides(color = FALSE)
I've made the size of the points proportional to weight that the point receives. The summarized predictive comparisons give more weight to points with more weight, and so should we.
The relationship is mostly as you'd expect, but let's examine the highlighted one in more detail:
Why does the default probability decrease (for this one choice of other input values) when NumberOfTime30.59DaysPastDueNotWorse
increases? Values for all the other inputs are held constant, so let's see what they are:
## | RevolvingUtilizationOfUnsecuredLines| age| DebtRatio| MonthlyIncome| NumberOfOpenCreditLinesAndLoans| NumberOfTimes90DaysLate| NumberRealEstateLoansOrLines| NumberOfTime60.89DaysPastDueNotWorse| NumberOfDependents| NumberOfTime30.59DaysPastDueNotWorse|
## |-------------------------------------:|----:|----------:|--------------:|--------------------------------:|------------------------:|-----------------------------:|-------------------------------------:|-------------------:|-------------------------------------:|
## | 1| 27| 0.1543| 4800| 4| 1| 0| 0| 0| 0|
Hmm, note that NumberOfTimes90DaysLate
(almost always \(0\)) is \(1\) in this case. Looking at the definition of the target variable (SeriousDlqin2yrs
):
SeriousDlqin2yrs
: “Person experienced 90 days past due delinquency or worse.”
As we'd expect, previous instances of 90-days-late are a strong indicator of future ones. But adding a previous 30-days-late (but not more) to someone with a previous 90-days-late seems to decrease the chance of future 90-days-lates. This is sensible – with the 30-days-late (but not more), we have evidence that when you're late, you at least sometimes pay back in under 60 days.
Further exploratory analysis would further improve our understanding:
NumberOfTimes90DaysLate
and NumberOfTime30.59DaysPastDueNotWorse
, or are other inputs involved? We could vary the other inputs and see if we still get this effect.Age
For one more example, let's examine the Age
input in more detail.
set.seed(3)
pairsAge <- GetComparisonDF(rfFit, credit,
u="age",
numForTransitionStart = 20,
numForTransitionEnd = numForTransitionEnd*10,
onlyIncludeNearestN = onlyIncludeNearestN*10)
pairsSummarizedAge <- pairsAge[c("OriginalRowNumber", "age.B", "yHat2", "Weight")] %.%
group_by(OriginalRowNumber, age.B, yHat2) %.%
summarise(Weight = sum(Weight))
ggplot(pairsSummarizedAge, aes(x=age.B, y=yHat2, color=factor(OriginalRowNumber))) +
geom_point(aes(size = Weight)) +
geom_line(size=.2) +
xlab("age") +
ylab("Prediction") +
guides(color = FALSE)
This is a bit of a mess, but we can at least see see that both interaction effects and non-monotonicity are present.
Further exploration would look into which other inputs age is interacting with to determine these differently shaped curves.