Renthop Kaggle competition Summary

来源:互联网 发布:整理照片的软件 编辑:程序博客网 时间:2024/06/05 21:55

The works for kaggle competition, as the others always said, mainly contains two parts: feature engineering and ensemble. So I would like to do a summary mainly in these two parts.

As a freshmen in kaggle, things I done in feature engineering is little caused I got totally no idea about how to do feature engineering, even after I had read some slides from the masters. So mostly of the features are inspired by the accessible kernels and some of others ideas.

The features I had used are:

Raw features from the dataset, including numericel features and categorical features.

Some basic features from the baseline given by srk, including:
count of features for the house, description length -> simple describing a feature(feature set) by counting
created day, hour, month and so on -> spliting the time in different units

And I did not used the tf/idf for the ‘features’ feature, caused I had found a better way of using it, say mapping the few import features into one-hot encoding. I did this because after peering at the data, I found most of the features appered in the column would only appear few times, except for a very few important ones. A CV was done in choosing the threshold.

Besides these basic features, the upcoming features, I think, are what feature engineering should be focused on.

The most important feature, manager skill, is given by target encoding. It is a encoding for the manager by their performance for the cases in training set. The performance is evaluated simply by 2*high+medium. The performance evaluation is used instead of encoding directly by the labels because there is a relationship between the proportion for ‘high’ and the proportion for ‘medium’. This is done by using CV-style method in order to avoid overfitting, as it might also be viewed as naive Bayesian classifier simply based on the manager_id.

Some statistics based on the manager skill, including the mean, max, median, min of bedroom numbers for the cases of a manager and things like that.

Some made categorical features, including clustering on spatial for all the cases, house type and the street name for the address. This kind of features can be done by binning/clustering the numerical numbers, or combining the numerical(mostly integer)/categorical features together.

I had perform the target encoding and the statistics for other categorical features as well, including the made ones. Some of them worked and some of them did not, while none of the performing did better than the features based on the managers.

This is highly related to the business. Someone in the group once mentioned that, once in a interview, the founder of the Renthop said that brokers are very important in the house-renting business. So there is no wonder that why it is so important to describe a manager by all these statistical features. Thus when doing the feature engineering, it is very important to think based on the business. Still, statistical for other categorical features is not a bad trying. The other categorical feature based on which the statistical feature also takes some effects are the house type, the cluster id on the map and the cluster id on the map. These features also got a sense in the real business.

Some of the slides of the masters mentioned that it is good to try using some features describing the trend while there is a temporal information, so I also tried target encoding for a given time window, say 3 days, a week and so on. This did developed a bit, but not much.

In Faron’s solution (https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32148) he gives out these features:

Clustering & Proxies for Points of Interest: filtering out the rows with certain features like ‘subway’,’park’ etc. and get the location for which by clustering(and also EDA I suppose)

Simple Text Features:uppercase and lowercase ratios and appering times of characters like ‘*’, ‘!’, ‘$’, ‘<’. I did little in the NLP for decription, though I have tried the uppercase ratio once but got no improvement.

Completeness Score: simply checking if each field in the original dataset is filled well. This is a statistics feature in row. Really reasonable.

Embedding for the descrition: this could be a good method when encountered with the text-content features. Except for xgb embedding, nn might also be used.

In Little Boat’s solution(https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32123), he used gridding for the spatial information instead of clustering. And he used bedroom to representing the house type. This action is more correspondent to the real business I suppose.

It should be noted that in Little Boat’s solution he used two sets of features. In his following posts he also mentioned how to choose features and how to get multiple feature sets. His method in doing feature engineering is adding new features into the best single model if they improves while also keeping all the other generated features. Then figure out the relations/correlations of them with existing features, or specifically find out why including them hurts the model while intuitively it should give better results. So it is necessary to have a quick pipeline to generate new reasonable features and try them, adding the effective ones into the best feature set while keeping all of them for latter diversity in ensembling.

In Plantsgo’s solution(https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32163), he used log transformation to transform the features, in order to acquire new features sets with some diversity for blending.

In Qianqian’s solution(https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32156), he used some time-related features such as the gap of the manager posting a case and how much averagely does a manager post in 24 hours.

So what I learned from this competetion in feature engineering are mainly this:
1)Keep trying meaningful/reasonable features. Focus on the business.
2)Construct features by statisticals. If using the label/target, then cv-style generating is a must.
3)Construct features by feature interaction, especially the meaningful ones. For example, combining (#bathrooms,#bedrooms) to get house type or computing price/#rooms to get average price for each room.
4)Apply clustering/griding for spatial features.
5)Apply row statistics, which might imply in what situation the record was generated.
6)The text part might be embedded by some complicated models
7)Apply transformations, including log, sqrt and others non-linear transformations to get diversity for ensemble.

The second important part for kaggle would be the model ensembling. The models I used in this competetion are:
xgb, my single best model
lgbm,rf and et using the same feature set as xgb
logestic regression and log transformed features for logestic regression, using the same feature set as xgb but normalized for most of the features
knn with 4,8,16,32 neighbours, not sure if they helped => seems not helpful but harmful, which I’ve tried after the competition

And the second layer meta-predictor is an xgbooster.

I had also tried another bagging method, similar with Qianqian’s method. However my trying failed. I supposed that’s because

1) I tried to generate divesity by cv-style method in using different splits of data, while he acquired diversity by using different columns/features

2)He averaged the bagging result with the single meta predictor while I just used the bagging result

3) He applied his method on the first layer and I applied it on the second layer, hoping to have a third layer stacking.

And in Plantsgo’s and Faron’s solutions they used not only multiple classification but also regression(as there is a sequential relationship for the labels) and one v.s. all classification, which provides much diversity while for the base models. Other than that, with mission for classification, using different metrics or different optimization loss target might also be a good choice for making diversity, say, accuracy, log loss or f1-score or so. Also this might be applied to regression mission, using l1 distance, l2 distance and any other metrics.

And most of them used multiple kinds of feature sets. It is done by extracting features in Qianqian’s method, and Plantsgo add his own features to others’ kernels, while Little Boat keeps all the features and construct the set by those not used in the main model. All this methods are interesting and useful.

For the participation of the competition itself, I also got some reflections.

While doing the feature engineering, it is really frustrating when the new feature did not improve the model.(Yes, just like the comic ahead.) So I had been tried the feature again and again by changing the parameters for the models. This, apparently, is totally useless at all. The time I spent on this could be used for trying some other new features. And it would be wise to save the failed features to make diversity in ensemble.

And I had also been too eager to see a leap in the LB, which leads too much and too early parameter tuning and long term running for a single model.

And I did not save the result for features transformation. Every time I ran the predictor I must start all over again. Hopefully this is corrected in the late time of the competition. Something on feautre managerment should be done before next competetion.

And I could also use a small set of training data to see if the transformation for feature or cross-validation pipeline is working well, then let the machine run by itself. This would save a lot of meaningless waiting time, and also take advantage of times while I am doing other things. Thus I suppose it is necessary to make a feature generation and cross-validation pipeline. Also it is important to check if the feature is generated well. I met several times of coding bugs which generated Nan features and apparently this will waste time and is very misleading.

And in the late days I was too anxious on the competition, which even leads to procastination in doing things, acting thoughtlessly pretending to be hardworking.

It is necessary to view the discussion board and communicate with the others. Many features and ideas were inspired by the kernels, discussions and others.

Although I just got a 6% and a brazen for this competition, it still makes me very exciting. I hope this article would be helpful for those who want to take part in future competetions, including me myself and others.

0 0