Linear association scatter plot

Just using PDPs may not be enough to find non-linear relationships. Depending on your dataset, different models may be better at capturing the underlying non-linear relationships. In this analysis, we have used a random forest but you can use any non-linear such as XGBoost or a neural network. This is because PDPs are a model agnostic technique. The choice of model is also not that important. An underfitted model may not capture the relationships and an overfitted model may show relationships that are not actually there. However, the better your model the more reliable your analysis will be. The goal is to visualise non-linear relationships and not make accurate predictions. In fact, the accuracy of the model is not that important. The model is not perfect but it does a fairly good job of predicting car price. Looking at Figure 10, you can get an idea of the accuracy of the random forest used to create these PDPs. Hence, as we are plotting predictions, we are able to strip out the effect of statistical variation. Secondly, the random forest will model the underlying trends in the data and make predictions using these trends. That is how predictions change due to changes in this feature. Firstly, by holding the other feature values constant, we can focus on the trend of one feature. These plots provide clearer visualisations of the trends for two reasons. These are the predictions made by the random forest given the feature values. In the last column, we can see the predicted price of the second-hand car. In Table 2, we have two rows in our dataset used to train the model. Specifically, we use a random forest with 100 trees. To create a PDP we first have to fit a model to our data. Ultimately, to clearly see relationships we need to strip out the effect of other features and statistical variation. We can already see this in the charts above and, in a real dataset, this will be even worse. This and the presents of statistical variation means the points will be spread around the underlying trends. In reality, the target variable will have relationships with many features.

For each chart, we are visualising the relationship between the target variable and only one feature.

Scatterplots like these are a simple way to visualise non-linear relationships but they will not always work. To be precise these would also include interactions but we focus on those types of relationships in another article.įigure 4: scatterplot of linear relationships Ultimately, any relationship that cannot be summarised by a straight line is a non-linear relationship. That is, the probability of an accident decreases and then later increases with age. The age-accident relationship given above could be quadratic. You can see some examples of these in Figure 2. On the other hand, for non-linear relationships, the change in variable Y due to a change in variable X would depend on the starting value of X. Another way of looking at this is that an increase in the variable X will result in the same increase in Y regardless of the starting value of X. In this case, we have a positive linear relationship. The line can have either a positive or negative slope but the slope will always remain constant. If two variables have a linear relationship, we can summarise that relationship with a straight line. Figure 1: Example of a linear relationship