Reference

@inproceedings{benevenuto@ceas10,

   author = {Fabr\'{\i}cio Benevenuto and Gabriel Magno and Tiago Rodrigues  and Virg\'{\i}lio Almeida},

   title = {Detecting spammers on Twitter},

   booktitle = {Proceedings of the 7th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS)},

   year = {2010},

   location = {Redmond, USA}

}

 

If you want to use our dataset, please let us know by email at benevenuto AT gmail.com.



Database

The dataset used in the paper Detecting Spammers on Twitter, published in Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS). Redmond, Washington, USA. July, 2010, is available.

The file is in LibSVM input file format. Each line represents a user from our test collection. The first column is the user class (i.e., 1 for non-spammers and 2 for spammers) and the subsequent columns are numbered from 1 to 62 and represent the user characteristics as explained in the list below.

  1. Number of followers per followees
  2. Fraction of tweets replied
  3. Fraction of tweets with spam words
  4. Fraction of tweets with URLs
  5. Existence of spam words in the screen name
  6. Number of hashtags per number of words on each tweet (mean)
  7. Number of hashtags per number of words on each tweet (median)
  8. Number of hashtags per number of words on each tweet (min)
  9. Number of hashtags per number of words on each tweet (max)
  10. Number of URLs per number of words on each tweet (mean)
  11. Number of URLs per number of words on each tweet (median)
  12. Number of URLs per number of words on each tweet (min)
  13. Number of URLs per number of words on each tweet (max)
  14. Number of characters per tweet (mean)
  15. Number of characters per tweet (median)
  16. Number of characters per tweet (min)
  17. Number of characters per tweet (max)
  18. Number of hashtags per tweet (mean)
  19. Number of hashtags per tweet (median)
  20. Number of hashtags per tweet (min)
  21. Number of hashtags per tweet (max)
  22. Number of mentions per tweet (mean)
  23. Number of mentions per tweet (median)
  24. Number of mentions per tweet (min)
  25. Number of mentions per tweet (max)
  26. Number of numeric characters per tweet (mean)
  27. Number of numeric characters per tweet (median)
  28. Number of numeric characters per tweet (min)
  29. Number of numeric characters per tweet (max)
  30. Number of URLs on each tweet (mean)
  31. Number of URLs on each tweet (median)
  32. Number of URLs on each tweet (min)
  33. Number of URLs on each tweet (max)
  34. Number of words per tweet (mean)
  35. Number of words per tweet (median)
  36. Number of words per tweet (min)
  37. Number of words per tweet (max)
  38. number of times the tweet has been retweeted (mean), counted by the presence of ”RT @username” on the text
  39. number of times the tweet has been retweeted (median), counted by the presence of ”RT @username” on the text
  40. number of times the tweet has been retweeted (min), counted by the presence of ”RT @username” on the text
  41. number of times the tweet has been retweeted (max), counted by the presence of ”RT @username” on the text
  42. Number of followees
  43. Number of followers
  44. Number of tweets
  45. Nnumber of followees of a user’s followers
  46. Number of times mentioned
  47. Number of times the user was replied
  48. Number of times the user replied
  49. Number of tweets of a user’s followees
  50. Time between posts (mean)
  51. Time between posts (median)
  52. Time between posts  (min)
  53. Time between posts (max)
  54. Number of posted tweets per day (mean)
  55. Number of posted tweets per day (median)
  56. Number of posted tweets per day (min)
  57. Number of posted tweets per day (max)
  58. Number of posted tweets per week (mean)
  59. Number of posted tweets per week (median)
  60. Number of posted tweets per week (min)
  61. Number of posted tweets per week (max)
  62. Age of the user account