CatBoost: A machine learning library to handle categorical (CAT) data automatically MACHINE LEARNING

来源:互联网 发布:好伙伴物流软件 编辑:程序博客网 时间:2024/05/27 21:50

Introduction

How many of you have seen this error while building your machine learning models using “sklearn”?

I bet most of us! At least in the initial days.

This error occurs when dealing with categorical (string) variables. In sklearn, you are required to convert these categories in the numerical format.

In order to do this conversion, we use several pre-processing methods like “label encoding”, “one hot encoding” and others.

In this article, I will discuss a recently open sourced library ” CatBoost” developed and contributed by Yandex. CatBoost can use categorical features directly and is scalable in nature.

“This is the first Russian machine learning technology that’s an open source,” said Mikhail Bilenko, Yandex’s head of machine intelligence and research.

P.S. You can also read this article written by me before “How to deal with categorical variables?“.

Table of Contents

  1. What is CatBoost?
  2. Advantages of CatBoost library
  3. CatBoost in comparison to other boosting algorithms
  4. Installing CatBoost
  5. Solving ML challenge using CatBoost
  6. End Notes

1. What is CatBoost?

CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It can work with diverse data types to help solve a wide range of problems that businesses face today. To top it up, it provides best-in-class accuracy.

It is especially powerful in two ways:

  • It yields state-of-the-art results without extensive data training
    typically required by other machine learning methods, and
  • Provides powerful out-of-the-box support for the more descriptive
    data formats that accompany many business problems.

CatBoost” name comes from two words “**Cat**egory” and “**Boost**ing”.

As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data.

“Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good result with relatively less data, unlike DL models that need to learn from a massive amount of data.

Here is a video message of Mikhail Bilenko, Yandex’s head of machine intelligence and research and Anna Veronika Dorogush, Head of Tandex machine learning systems.

原创粉丝点击