The Magic of XGBoost

Understand and Implement XGBoost in Python

Okan Yenigün
Python in Plain English

--

Photo by Artem Maltsev on Unsplash

XGBoost stands for eXtreme Gradient Boosted trees. It is an ensemble machine learning method. In this type of learning, the base weak learners, working in a chained sequence, learn from each other’s mistakes and try to achieve good results with small improvements.

Ensemble learning algorithms are super powerful and XGBoost is the superstar. It is widely used because of its ease of use, speed, and achievement. It works for both regression and classification problems and there’s a really good chance that it is going to prove to be the best model to fit your data. Therefore, it is a natural choice of starter.

Before continuing, I recommend you refresh your knowledge on the following topics to better understand XGBoost.

Logistic Regression from Scratch

Decision Tree Parameter Explanations

Ridge, Lasso & ElasticNet Regressions From Scratch

AdaBoost from Scratch

Gradient Boosting for Regression from Scratch

Gradient Boosting for Classification

XGBoost is natively written in C++, and additionally, provides interfaces for other languages. We will use Python for demonstration. It is not a part of the Sklearn package, it has its own package.

  • XGBoost avoids over-fitting using regularization techniques. It ensures that the model we are developing will end up with a successful generalization.
  • The model automatically handles missing values in the data, determining the most effective approach to deal with them.
  • It is fast and efficient. It can run in parallel.
  • XGBoost stores the derivatives in the Cache. So that, it can quickly reach them and compute the rest.
  • It is designed to be used with large and complicated data sets. If the data is super huge for the Cache and Memory, XGBoost compresses the data and tries to minimize the actions required for the Hard Drive access. It is also possible to apply sharding techniques to split the data into different data clusters.
  • Cross-validation and early stopping are possible.
  • It can be incrementally trained. It means that you can stop at some point in the iteration, save it, and continue where you left off later.
  • Tree pruning is optimized. It goes deeper and provides deep and optimized results with a special pruning operation.

Let’s explore the use of XGBoost for regression and classification problems.

Regression

Let’s use a dummy data set such as below. x is the feature and y is the target.

Dummy Data

As you will remember from other ensemble learning methods, we need to make an initial estimate in the first step. For example, we used the average value in Gradient Boosting. In XGBoost, we always start with 0.5 by default.

Let’s put the initial value on the graph and residuals will show us how good the prediction is.

A Loss Function is used to quantify how good the prediction is.

Loss Function

Similar to the Gradient Boosting, the tree will fit the residuals. On the other hand, instead of the regular decision trees, regression trees that have a different structure are used.

After collecting all residuals, Similarity Scores are calculated for each.

λ is the regularization parameter. Let’s say that our residuals are -10.5, 6.5, 7.5, -7.5 and λ = 0;

Similarity Score=(-10.5+6.5+7.5+-7.5)²/(4+0) = 4

Now we should find the best split the keep similar residuals together. The first split is between the minimum values of two x points.

Split the branches and calculate the similarities for each leaf.

Later, we will calculate the gain of the split;

Gain = Left Similarity + Right Similarity — Root Similarity = 120.3

The gain for each split is calculated and compared. Split with the highest gain will win. Let’s assume that the above split has won.

Now, since the right leaf has three residuals, we can go on splitting. The left leaf has only one residual, so, we can’t split it further. Just like before, we keep calculating similarity scores for each split and go on with the highest gain.

We can go on with another split for the left leaf, but let’s keep it simple.

The next step is the pruning process. Trees are pruned based on their Gain values. We will use another parameter called gamma γ for it. If the difference between the gain and gamma is negative, then we will remove the branch. We start pruning from the bottom.

Let’s pick gamma as 130.

We can’t remove the first branch. Root gain is lower than gamma, but since we didn’t remove the first branch, we can’t remove the root. If gamma was 150, then we would remove the first branch and root branch.

Above, we didn’t use any regularization (λ was 0). Now, let’s use the regularization effect this time. λ = 1

As you may have noticed, regularization lowers similarity values (and gain). The less residual a leaf has, the higher this reduction value is. That is, the greater the regularization effect. Tree pruning will be more severe due to this decrease due to regularization. It will provide protection against over-fitting.

Note that the gain can be a negative value. In these cases, the algorithm will continue to prune the tree even if the gamma value is set to 0 beforehand. Therefore, setting the gamma to 0 does not stop pruning.

So, next, we should calculate the output values.

Let’s assume λ is 0 again. By calculating the outputs, we completed our first tree. The learning rate is set to default 0.3 in XGBoost implementations.

Prediction = init_pred+ learning_rate * output_from_the_first_tree = 0.5 + 0.3 (-10.5) = -2.65

The initial prediction was 0.5 in the first round. Now, we had a small step to the actual value.

After each iteration,

prediction = init_pred + lr * o1 + lr * o2 + lr * o3 …

Classification

Let’s use a binary classification dummy data for demonstration.

Dummy Data

Same as the regression problem, we will start with 0.5 (50%) as our initial prediction.

The loss function for the classification is:

Loss function

Similarity Score;

For classification, previously predicted probability is being taken into account.

Our residuals are -0.5, 0.5, 0.5, -0.5.

Again, start splitting with the average values of adjacent observations. The higher gain will win.

XGBoost has a threshold for the minimum number of residuals in each leaf. The threshold is determined by the parameter called Cover. It is the denominator of the Similarity Score (minus λ). It depends on the previous probability of each residual in a leaf.

Cover

The default threshold value of Cover is 1. So, if the calculated value is lower than 1, the algorithm doesn’t allow that leaf.

Children are pruned. Let’s go one level up;

Since the cover value is lower than 1, the leaves here are not allowed, too. We only have the root left. But in XGBoost, a tree has to be more than a single root.

We can change the Cover threshold (min_child_weight). Let’s set it to 0.

In the next step, pruning, the difference between gain and gamma is the decision threshold, as in regression problems.

After pruning, the output value of the tree is calculated.

Assume that λ is 0;

Before making a new prediction, we should convert the initial probability to odds.

odds = p / 1-p

Our initial p is 0.5. So, odds = 0.5 / 1–0.5 = 1. Then, log(odds) = 0

Output = log(odds) + learning_rate * new_output

Output = 0 + (0.3 * — 2) = -0.6

Our new prediction:

A small step for XG but a big one for us.

Mathematical Overview

T: number of Terminal nodes in a tree. γ: user-defined penalty for pruning. Pruning takes place after the tree is built.

By increasing λ, the optimal output value gets closer to zero. That’s what we expect from regularization.

Second Order Taylor Approximation is used to solve the equation.

g is used to represent the first derivative and h is used for the second derivative.

Since the loss function part doesn’t include the Output value, we can eliminate that part in the process of minimization.

To find the minimum value that minimizes this equation, let’s take the derivative of it with respect to the output value and set it equal to 0.

For Regression, the loss function;

After replacing and doing the math operations:

For Classification, the loss function;

XGBoost multiplies the equation by -1 to keep it simple. By doing that it reverses the parabola over the horizontal line.

This maximum point is the Similarity Score. We had an O value above (g’s / h’s + λ equation). After replacing O value with this equation and doing some math, we get:

Replace g and h for Regression case, then it becomes;

for Classification;

Approximate Greedy Algorithm

As I mentioned before, XGBoost was developed to be used in large datasets. Therefore, calculating each possible threshold split will take forever on large datasets. Instead, it uses a greedy algorithm. It decides without looking ahead to see if it is the best choice in the long term.

Data is divided into quantiles and each quantile becomes the threshold for the split. Increasing the quantile number increases the accuracy (not forever, of course) but it also increases the computation time. Therefore, there is a trade-off. By default, about 33 (weighted) quantiles are used in the XGBoost model.

Weighted Quantiles

XGBoost splits the data into small pieces and distributes them in a computer network. The Quantile Sketch Algorithm combines the values from each computer and plots a histogram. The histogram has the quantiles and these quantiles have weights. The sum of the weights is equal in each quantile.

An example of weighted quantiles

These weights are derived from the Cover parameter, the 2nd derivative of the Loss function. Recall that the 2nd derivative of the loss function for Regression was 1, so, weights are 1 for the Regression case and the quantiles are regular and contain an equal number of observations. For classification, weights are calculated from the previous prediction probabilities.

If our predictions are not confident enough (around 0.5), their weights become relatively large, and it’s vice versa for the opposite case. With this ingenious idea, we take the predictions that we are not sure of into separate bins.

Note that, weighted quantiles are used only when the data is huge.

Sparsity-Aware Split Finding

We can still calculate residuals even if we have missing data in our data set. We subtract the actual values from our initial prediction value of 0.5. We don’t have anything to do with features yet.

Once we build the trees and start making the splits, we divide the data into 2, the ones with value and the ones that are missing.

The residuals of the table with the missing values are added to the leaves of the trees of the table with the non-missing data during the gain calculation (in each split, each calculation, both left and right). The maximum gain split will be the default path for all future observations that are missing X values.

Thanks to Sparsity-Aware Split, it takes care of the missing data problem while training the model. If the incoming observation contains missing data when we need to make a new prediction, it continues to work without any problems by sending it in the same way.

Python

pip install xgboost

Let’s take a look at some important parameters before starting to implement the model. You can find the documentation and all the parameters here.

  • objective: default = req:squarederror

We select the approach the model will follow when it is evaluating how good it is. For classification, we can choose reg:logistic for example.

  • base_score: default=0.5

The initial prediction.

  • eval_metric: the default value changes depending on the objective.

This is the evaluation metric. We can choose rmse (root mean square error), or logloss (negative log-likelihood), for example.

  • booster: default=gbtree

You can specify which base model tree to use.

  • eta: default=0.3

The learning rate

  • gamma: default=0

This is min_split_loss. A pruning parameter.

  • max_depth: default=6

Allowed maximum depth of a tree can have.

  • min_child_weight: default = 1

The minimum sum of weight value a child leaf needs.

  • lambda: default=1

L2 regularization term.

  • alpha: default=0

L1 regularization term.

Code

#Regression error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE : % f" %(rmse))
#RMSE : 76.228877#Classification reportconfusion_matrix(y_test, y_pred)
#array([[1506, 7],
[ 17, 1470]])

Thank you for your attention so far. To stay updated on my new blog posts, feel free to follow me. If you have any questions or comments, please don’t hesitate to reach out to me!

Read More

References

https://xgboost.readthedocs.io/en/stable/

https://en.wikipedia.org/wiki/XGBoost

https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf

https://www.youtube.com/watch?v=OtD8wVaFm6E&t=2s

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, and Discord.

--

--