Identifying bots in an online auction
来源:互联网 发布:轩辕剑伏曦殿进阶数据 编辑:程序博客网 时间:2024/05/17 05:00
Identifying bots in an online auction
Is a bidder in an online auction a bot or a human?
The data consists of a list of bid events (auction id, user id, time, IP, location) and a table, X, with the bidder id's, the hashed contact and payment addresses and whether the bidder is a robot or a human.
The interesting part of this problem is to find ways to characterize bidding behaviour - to generate features that can be inserted into X. Then X can be used to train a classification algorithm to distinguish between bots and humans.
The most useful features I identified were: the median time between a user's bid and that user's previous bid, the mean number of bids a user made per auction, the entropy for how many bids a user placed on each day of the week, the maximum number of bids in a 20 min span, the total number of bids placed by the user, the average number of bids a user placed per referring URL, the number of bids placed by the user on each of the three weekdays in the data, and the minimum and median times between a user's bid and the previous bid by another user in the same auction.
Here's the script I used to process the data and generate predictions.
A quick look at the data
- A sample auction
Another useful observation is that users which place bids from multiple countries often only place bids from one country in a given auction - this suggests that multiple people may be placing bids from the same account, but that only one individual on the account usually follows any one auction.
Time information
- Bidding action over time
In the bids/unit time histogram, each of the three chunks contains about three periods worth of bidding activity, which suggests that each chunk of data is probably three days. The durations of each chunk are the same (+/- 1 mystery unit) and the durations of the gaps between each chunk are also the same, which suddenly made it very easy to calculate the length of a day. The duration of the entire data set is 31 times one_day. Had the data chunks not all been so suggestively the same duration, the positions of peaks in the autocorrelation function of the bid histogram could have been used to calculate the length of a day.
- Characterizing the distribution of times at which users place bids
It would be reasonable to expect to need to shift the bids/time histogram along the time axis depending on what time zone the bidder was in. This is tricky though, since IP is not a reliable indicator of location. And the daily spike in human bidding activity suggested that shifting the bids/time histogram might not actually be very instructive.
- Bidding strategy over the course of an auction
Using entropy to characterize variety in weekday, IP, referring URL
Entropy also turned out to be a useful way to characterize how a user's bidding activity is distributed over each of the three days of the week as well as over the referring URLs.
Other useful features
- Auction duration
I was really stumped about why there were hardly any bids placed by robots between 11 and 14 days before the end of the auction. I wondered if this behaviour might have helped to explain why we only got to see three days out of every two weeks in the data....
Classification
- Choosing the algorithm
- Assessing the quality of predictions
Things I tried that didn't work
- Weekly bidder entropy
- Using a cluster algorithm to group users which use similar sets of IPs
- Feature-weighted linear stacking
- Using payment_account and address data
- Identifying bots in an online auction
- IDENTIFYING AND NON-IDENTIFYING RELATIONSHIPS in Database
- Programming Spiders, Bots, and Aggregators in Java
- Auction
- Auction
- Programming Spiders, Bots, and Aggregators in Java by Jeff Heaton
- 翻译&学习|ACL2017|Modeling Situations in Neural Chat Bots
- An Auction That Google Was Content to Lose
- Identifying Hierarchical Structure in Sequences: A linear-time algorithm 1997
- 【Oracle问题集4】ORA-00205 error in identifying controlfile
- ORA-01565: error in identifying file '+DATA/spfilerac.ora'
- 文献阅读-Identifying driver mutations in sequenced cancer genomes
- Turn Wordpress Into An Online Store
- Creating an Index Online
- Online DDL in MySQL 5.6
- Identifying Keywords
- IDENTIFYING AND NON-IDENTIFYING RELATIONSHIPS
- Identifying and non-identifying relationships
- sequelize调用mysql函数
- TOMCAT多站点配置
- 5年一晃就过去
- bias & variance
- UEditor编辑文章时粘贴内容的时候导致原来的图片不能显示
- Identifying bots in an online auction
- Odoo中一些方法
- 自动化专业英语常用词汇集锦
- ubuntu16.04安装mysql报错解决
- Lua 关于类的理解整理
- awk命令
- CSDN 博客转载
- 矩阵键盘可调时钟
- MySQL索引背后的数据结构及算法原理