In this vignette, we present a global variable importance measure based on Partial Dependence Profiles (PDP) for the random forest regression model.
We work on Apartments dataset from DALEX
package.
#> m2.price construction.year surface floor no.rooms district
#> 1 5897 1953 25 3 1 Srodmiescie
#> 2 1818 1992 143 9 5 Bielany
#> 3 3643 1937 56 1 2 Praga
#> 4 3517 1995 93 7 3 Ochota
#> 5 3013 1992 144 6 5 Mokotow
#> 6 5795 1926 61 6 2 Srodmiescie
Now, we define a random forest regression model and use explain()
function from DALEX
.
library("randomForest")
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_rf <- explain(apartments_rf_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : randomForest ( [33m default [39m )
#> -> data : 9000 rows 4 cols
#> -> target variable : 9000 values
#> -> predict function : yhat.randomForest will be used ( [33m default [39m )
#> -> predicted values : numerical, min = 2121.14 , mean = 3515.047 , max = 5261.62
#> -> model_info : package randomForest , ver. 4.6.14 , task regression ( [33m default [39m )
#> -> residual function : difference between y and yhat ( [33m default [39m )
#> -> residuals : numerical, min = -1227.352 , mean = -3.523581 , max = 2186.873
#> [32m A new explainer has been created! [39m
Let see the Partial Dependence Profiles calculated with DALEX::model_profile()
function. The PDP also can be calculated with DALEX::variable_profile()
or ingredients::partial_dependence()
.
Now, we calculated a measure of global variable importance via oscillation based on PDP.
The most important variable is surface, then no.rooms, floor, and construction.year.
Let created a linear regression model and explain
object.
apartments_lm_model <- lm(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_lm <- explain(apartments_lm_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : lm ( [33m default [39m )
#> -> data : 9000 rows 4 cols
#> -> target variable : 9000 values
#> -> predict function : yhat.lm will be used ( [33m default [39m )
#> -> predicted values : numerical, min = 2231.8 , mean = 3507.346 , max = 4769.053
#> -> model_info : package stats , ver. 3.6.3 , task regression ( [33m default [39m )
#> -> residual function : difference between y and yhat ( [33m default [39m )
#> -> residuals : numerical, min = -733.2516 , mean = 4.177813 , max = 2107.979
#> [32m A new explainer has been created! [39m
We calculated Partial Dependence Profiles and measure.
Now we can see the order of importance of variables by model.