Reference
@inproceedings{benevenuto@ceas10,
author = {Fabr\'{\i}cio Benevenuto and Gabriel Magno and Tiago
Rodrigues and Virg\'{\i}lio Almeida},
title = {Detecting spammers on Twitter},
booktitle = {Proceedings of the 7th Annual Collaboration, Electronic
messaging, Anti-Abuse and Spam Conference (CEAS)},
year = {2010},
location = {Redmond, USA}
}
If you want to use our
dataset, please let us know by email at benevenuto AT gmail.com.
Database
The dataset used in the paper Detecting
Spammers on Twitter, published in Annual Collaboration, Electronic
messaging, Anti-Abuse and Spam Conference (CEAS). Redmond, Washington, USA.
July, 2010, is available.
The file is in LibSVM input file format. Each line represents a user
from our test collection. The first column is the user class (i.e., 1 for
non-spammers and 2 for spammers) and the subsequent columns are numbered from 1
to 62 and represent the user characteristics as explained in the list below.
- Number
of followers per followees
- Fraction
of tweets replied
- Fraction
of tweets with spam words
- Fraction
of tweets with URLs
- Existence
of spam words in the screen name
- Number
of hashtags per number of words on each tweet (mean)
- Number
of hashtags per number of words on each tweet (median)
- Number
of hashtags per number of words on each tweet (min)
- Number
of hashtags per number of words on each tweet (max)
- Number
of URLs per number of words on each tweet (mean)
- Number
of URLs per number of words on each tweet (median)
- Number
of URLs per number of words on each tweet (min)
- Number
of URLs per number of words on each tweet (max)
- Number
of characters per tweet (mean)
- Number
of characters per tweet (median)
- Number
of characters per tweet (min)
- Number
of characters per tweet (max)
- Number
of hashtags per tweet (mean)
- Number
of hashtags per tweet (median)
- Number
of hashtags per tweet (min)
- Number
of hashtags per tweet (max)
- Number
of mentions per tweet (mean)
- Number
of mentions per tweet (median)
- Number
of mentions per tweet (min)
- Number
of mentions per tweet (max)
- Number
of numeric characters per tweet (mean)
- Number
of numeric characters per tweet (median)
- Number
of numeric characters per tweet (min)
- Number
of numeric characters per tweet (max)
- Number
of URLs on each tweet (mean)
- Number
of URLs on each tweet (median)
- Number
of URLs on each tweet (min)
- Number
of URLs on each tweet (max)
- Number
of words per tweet (mean)
- Number
of words per tweet (median)
- Number
of words per tweet (min)
- Number
of words per tweet (max)
- number
of times the tweet has been retweeted (mean), counted by the presence of
”RT @username” on the text
- number
of times the tweet has been retweeted (median), counted by the presence of
”RT @username” on the text
- number
of times the tweet has been retweeted (min), counted by the presence of
”RT @username” on the text
- number
of times the tweet has been retweeted (max), counted by the presence of
”RT @username” on the text
- Number
of followees
- Number
of followers
- Number
of tweets
- Nnumber
of followees of a user’s followers
- Number
of times mentioned
- Number
of times the user was replied
- Number
of times the user replied
- Number
of tweets of a user’s followees
- Time
between posts (mean)
- Time
between posts (median)
- Time
between posts (min)
- Time
between posts (max)
- Number
of posted tweets per day (mean)
- Number
of posted tweets per day (median)
- Number
of posted tweets per day (min)
- Number
of posted tweets per day (max)
- Number
of posted tweets per week (mean)
- Number
of posted tweets per week (median)
- Number
of posted tweets per week (min)
- Number
of posted tweets per week (max)
- Age
of the user account