Online advertising
flourishes as the ideal choice for both small and large
businesses to target their marketing campaigns to the appropriate
customers on the fly. An advertiser provides an advertising
commissioner with its advertisements, plans a budget, and sets a
commission for each customer action. The content publishers, on
the other hand, make a contract with the commissioner to display
advertisements on their websites. However, since publishers earn
revenue based on impressions and clicks they drive to
advertisers, there is an incentive for dishonest publishers to
inflate the number of impressions/clicks their sites generate—a
phenomenon known as click fraud. Click fraud hinders the
reliability of online advertising system, and the market for
online advertising will eventually contract in a long-term.
Moreover, it may lead to expensive litigations from unsatisfied
advertisers and bad reputation for the commissioner. It is
important for the commissioner to proactively prevent click fraud
so as to convince their advertisers the fairness of their
accounting practices. Accordingly, a reliable click fraud
detection system is needed to help identify dishonest publishers
and maintain the commissioner’s credibility.
Prize
The primary entries from
all participants will be ranked in decreasing order of their
respective evaluation scores (see Evaluation section), computed
on the test set. The top three entries will receive the following
cash prizes:
First
place - SGD 4,000
Second
place - SGD 2,000
Third
place - SGD 1,000
Workshop
A workshop for the
competition results will be held in conjunction with the Asian
Conference on Machine Learning (ACML) 2012. Registration waivers
will be given to the speakers of five selected/winning
teams. For other attendees, the following registration fees apply
(for a single workshop):
Richard
J. OENTARYO, Singapore Management University
Ee-Peng
LIM, Singapore Management University
Feida
ZHU, Singapore Management University
David
LO, Singapore Management University
Kok
Fung LAI, BuzzCity Pte. Ltd.
Web Administrators:
Juan
DU, Singapore Management University
Philips
K. PRASETYO, Singapore Management University
Overview
This competition involves advertisement data
provided by BuzzCity Pte. Ltd. BuzzCity is a global mobile
advertising network that has millions of consumers around the
world on mobile phones and devices. In Q1 2012, over 45 billion ad
banners were delivered across the BuzzCity network consisting of
more than 10,000 publisher sites which reach an average of over
300 million unique users per month. The number of smartphones
active on the network has also grown significantly. Smartphones
now account for more than 32% phones that are served
advertisements across the BuzzCity network.
You can download the data sets including publishers_train.zip,
clicks_train.zip, publishers_validation.zip and
click_validation.zip from our repository. Click here
to get an Email for accessing that repository. The Email will be
sent to the Email address that you've registered.
Dataset
The “raw” data used in this competition has two types: publisher
database and click database, both provided in CSV format.
The publisher database records the publisher’s (aka partner’s)
profile and comprises several fields:
publisherid – Unique identifier of a publisher.
bankaccount – Bank account associated with a publisher
(may be empty)
address – Mailing address of a publisher (obfuscated; may
be empty)
status – Label of a publisher, which can be the following:
“OK” - Publishers whom BuzzCity deems as having healthy traffic
(or those who slipped their detection mechanisms)
“Observation” – Publishers
who may have just started their traffic or their traffic
statistics deviates from system wide average. BuzzCity does not
have any conclusive stand with these publishers yet
“Fraud” – Publishers who are deemed as fraudulent with clear
proof. Buzzcity suspends their accounts and their earnings will
not be paid
On the other hand, the click database
records the click traffics and has several fields:
id – Unique identifier of a particular click
numericip – Public IP address of a clicker/visitor
deviceua – Phone model used by a clicker/visitor
publisherid – Unique identifier of a publisher
adscampaignid – Unique identifier of a given advertisement campaign
usercountry – Country from which the surfer is
clicktime – Timestamp of a given click (in YYYY-MM-DD format)
publisherchannel – Publisher's channel type, which can be the
following:
ad
– Adult sites
co – Community
es – Entertainment and
lifestyle
gd – Glamour and dating
in – Information
mc – Mobile content
pp – Premium portal
se
– Search, portal, services
referredurl - URL where the ad banners were clicked
(obfuscated; may be empty). More details about the HTTP Referer
protocol can be found in this
article
Note: The raw data provided is encrypted for privacy protection, and the format of some fields (e.g., partner id,
bankaccount, campaign id, referer URL) may change over time due to
different encryption schemes. So predictions made should not
depend on specific field format or encryption method.
Task
The goal of this competition is to build a
data-driven methodology for effective detection of fraudulent
publishers. In particular, the task is to detect "Fraud"
publishers (positive cases) and separate them from "OK" and
"Observation" publishers (negative cases), based on their click
traffic and account profiles. This would help shed light on
several key areas:
What
is the underlying click fraud scheme?
What
sort of concealment strategies commonly used by fraudulent
parties?
How
to interpret data for patterns of dishonest publishers and
websites?
How
to build effective fraud prevention/detection plans?
For training and evaluation, we provide
three sets of click and publisher data:
Training
set - For building your prediction model
Validation
set - For evaluation of your model in the public leaderboard
(see Evaluation).
This can be treated as holdout set
Testing
set - For final evaluation of your model and thus determination
of the competition winners
Note: Each dataset comes from a different time window, so
there could be publishers (publisherid) who appear across different
sets and whose status label may change. You can infer the time
windows from the timestamps of the click dataset.
Participation
You may participate in the competition as an individual or as part
of a team of up to five participants. Only two
submissions per day from each team are allowed. You may only be
involved in one "team" in the competition (i.e. either
as an individual or as a member of a single team, but not both).
For team registrations, the team must select a team leader, who
will provide the team name, corresponding email address, and team
affiliation during registration. Each team is required to register
for a user account and log into the competition site. The e-mail
addresses of the participants will not be publicly displayed
during the competition.
Format
You might download the sample file
(baseline_result.zip) from our repository by sftp like the data
sets.
The submission format consists of a two-column CSV file. The first
column contains the partner id (sorted lexicographically in
ascending order). The second column contains the prediction or
confidence scores for the fraud (positive) class label, typically
normalized within [0, 1] or [-1, 1]. The average precision is then
calculated by ranking the scores and measuring precision at
different cutoffs (see Evaluation).
Submission
To submit your new results, please enter your file below (you need
to login first). After uploading your file, it takes a few seconds
for scoring, and then you should see your result. Your entry in
the public leaderboard (see Evaluation) shall be updated
if your current result beats your previous best result. Note:
The order of the partner IDs (i.e. the first column) in the
submitted file should match that in the sample file.
Final Submission
You can submit your final results on the test dataset below. Note
that the submission deadline is on 1 Oct 04:00am UTC, and
you are only allowed to submit max two times by then. Any
subsequent submission shall be disregarded. Out of the two
submissions, the best score will be computed automatically
and used for selection of winners. You will receive a separate
email showing your and other teams’ final scores after the
competition ends.
Evaluation
Submissions will be
evaluated using the average precision criterion. Full
details of the criterion are given in this article.
Briefly, suppose there are k publishers in a list ordered
by their prediction score, the average precision (AP) is
computed as follows:
where Precision(i) denotes the precision at cutoff i
in the publisher list, i.e., the fraction of correct fraud
prediction up to the position i, and m is the number
of actual fraud publishers. Note that, when the ith
prediction is incorrect, Precision(i) = 0.
Leaderboard
The public leaderboard below displays the best score and
corresponding submission time of each team on the validation
set. Please note that there will be a separate testing set
to be used for final evaluation and determination of the winners.
Hence, algorithm which performs best on the validation set does
not always perform best on the testing set; this depends on its
generalization ability on unseen data.
Workshop Program
Venue:
Seminar Room 2.2 School of Accountancy Singapore Management University 60 Stamford Road, Singapore 178900
Schedule
Session
13:30-14:00
Introduction & Prize
Giving to Competition Winners
14:00-14:30
Presentation by Winner 1 Feature Engineering for Click Fraud Detection
Speaker: Dr. Clifton PHUA
Institute for Infocomm Research, Singapore
14:30-15:00
Presentation by Winner 2 A Novel Approach Based on Ensemble Learning for Fraud Detection in Mobile Advertising
Speaker: Kasun S. PERERA and Bijay NEUPANE
Masdar Institute of Science and Technology, Abu Dhabi, UAE
15:10-15:30
Coffee Break
15:30-16:00
Presentation by Winner 3 Hybrid Models for Click Fraud Detection in Mobile Advertising
Speaker: Wei CHEN
National University of Singapore, Singapore
16:00-16:30
Presentation by Runner-up Hierarchical Committee Machines for Fraud Detection in Mobile Advertising
Speaker: Manoj Prasanna KUMAR
Ericsson India Global Services Pvt. Ltd., Tamil Nadu, India
16:30-17:30
Poster presentation (by
Top 10 Teams)
Each presentation should take 20 min + 10 min questions and answers. Please go to your room 10-15 mins before your session to meet the session chair, and try out the audio-video system. Rooms will be equipped with standard projectors, as well as a Windows-PC for those who don't have their own laptop.
Each presentation is accompanied by a poster session on the same day as the talk, from 16:30-17:30pm. Posters should follow the A1-portrait format. Please remove your poster after the session. Any posters that remain after 10:00am on the next day will be discarded.
Papers and Posters
Winner I: Feature Engineering for Click Fraud Detection
Clifton Phua, Eng Yeow Cheu, Minh Nhut Nguyen, Ghim Eng Yap, and Kelvin Sim
Institute of Infocomm Research (I2R), Singapore
Abstract: For our winning entry based on the test dataset, we applied Generalized Boosted regression Models (GBM) to 118 predictive features, which consisted of 67 click behavior, 40 click duplication, and 11 high-risk click behavior features. Our most important data insight, from both domain knowledge and experimentation, is that invalid or fraudulent clicks have particular temporal and spatial characteristics which make them distinguishable from normal clicks. We were surprised that simple statistical features can retain powerful predictive power over time and reduce the chances of overfitting the training data. We used R for feature engineering and classi_cation (GBM and RandomForest), MySQL for storing thedata, and WEKA for trying out alternative classification schemes.
Full Paper: Poster:
Winner II: A Novel Approach Based on Ensemble Learning for Fraud Detection in Mobile Advertising
Kasun S. Perera, Bijay Neupane, Mustafa Amir Faisal, Zeyar Aung, and Wei Lee Woon
Masdar Institute of Science and Technology, United Arab Emirates
Abstract: By diverting funds away from legitimate partners, click fraud represents a serious drain on
advertising budgets and can seriously harm the viability of the internet advertising market.
As such, fraud detection algorithms which can identify fraudulent behavior based on user
click patterns are extremely valuable. In this paper we propose a novel approach for click
fraud detection which is based on a set of new features derived from existing attributes.
The proposed model is evaluated in terms of the resulting precision, recall and the area
Under the ROC curve. A final method based on 6 different learning algorithms proved to
be stable with respect to all 3 performance indicators. Our final method shows improved
results on training, validation and test dataset, thus demonstrating its generalizability to
different datasets.
Full Paper: Poster:
Winner III: Hybrid Models for Click Fraud Detection in Mobile Advertising
Chen Wei, and Dhaval Patel
National University of Singapore, Singapore
Abstract: Advertising plays a vital role in supporting free websites and smart phone apps to target
their marketing campaigns to the appropriate customers. Advertisers pay the publisher
(typically a website owner) when the ad is clicked (pay per click). However, Malicious
publishers generate clicks that do not have genuine interest in the advertisements, which is
called “Click Fraud”. It results in advertising revenue being misappropriated by click spammers.
It is important to take active measures to block click spam today. BuzzCity
provides a snapshot of their click and publisher database as dataset for the competition.
The goal of this competition is to detect fraudulent publishers. This paper describes the
solution of the National University of Singapore team. We exploited a diverse of models
(decision tree models, neural network models, models etc). Our final submission is a blend
of different learning algorithms. The algorithms are trained consecutively and they are
blended together to achieve 62.12% and 46.15% in average precision on the validation
and testing dataset separately.
Full Paper: Poster:
Runner Up I: Random Forests for the Detection of Click Fraud in Online Mobile Advertising
Daniel Berrar
Tokyo Institute of Technology, Japan
Abstract: Click fraud is a serious threat to the pay-per-click advertising market. Here, we analyzed the
click patterns associated with 3081 publishers of online mobile advertisements. The status
of these publishers was known to be either fraudulent, under observation, or honest. The
goal was to develop a model to predict the status of a publisher based on its individual click
profile. In our study, the best model was a committee of random forests with imbalanced
bootstrap sampling. The average precision was 49.99% on the blinded validation set and
42.00% on the blinded test set. Our analysis also revealed interesting discrepancies between
the predicted and actual status labels.
Full Paper: Poster:
Runner Up II: Hierarchical Committee Machines for Fraud Detection in Mobile Advertising
S. Shivashankar and P. Manoj
Ericsson Research, India
Abstract: In recent years, a significant amount of attention is being paid towards fraud clicks in online
advertisements. Researchers have started paying an equal amount of attention to it as towards other
problems such as placing right ads on a page, personalizing it for an user, etc. In this paper, we elaborate
the method we used to predict fraud clicks in mobile ads. The dataset was provided by BuzzCity as part of
the machine learning contest held in conjunction with Asian Conference on Machine Learning 2012. As
in the case of any fraud detection problem, this particular challenge also involves class imbalance issues.
More importantly, as repeatedly said by experts feature engineering is the key for good performance.
We built a lot of derived attributes which played a critical role in improving the performance. We used
hierarchical committee machines to combine a set of diverse cost sensitive classifiers built using different
set of attributes (datasets). More details about feature engineering and methods used can be found in later
sections of the paper.
Full Paper: Poster:
Important Dates
August
15 – Registration opens
September
1 – Competition begins
September
30 – Competition ends
October
7 – Winners notified
October
21 – Paper submission
October
28 – Early registration deadline
November
4 – Late registration deadline
November
4 – Competition workshop
Sponsors:
Contacts
Richard J. OENTARYO Program Chair Email:
roentaryo@smu.edu.sg
Juan DU Web Administrator Email: juandu@smu.edu.sg