Reference

@article{costa@eis2014,

author = {Helen Costa and Luiz Henrique de Campos Merschmann and Fabricio Barth and Fabricio Benevenuto},

title = {Pollution, Bad-mouthing, and Local Marketing: The Underground of Location-based Social Networks},

journal = {Elsevier Information Sciences},

year = {2014}

}

 

If you want to use our dataset, please let us know by email at benevenuto AT gmail.com.



Database

The dataset used in the paper Pollution, Bad-mouthing, and Local Marketing: The Underground of Location-based Social Networks, published in Elsevier Information Sciences, to appear in 2014, is available.

The file is in Weka ARFF input file format. Each line represents a review from our test collection. The attributes are separated by commas and represent the review characteristics as explained in the list below. The penultimate attribute is the review class (i.e., "spam" and "non-spam"). The last attribute is the review class with additional information about the subclasses of spam (i.e., "pollution", "bad-mouthing", "local marketing", and "non-spam").

  1. Clicks on the link "This tip helped me"
  2. Clicks on the link "Report abuse"
  3. Number of places registered by the user
  4. Number of tips posted by the user
  5. Number of photos posted by the user
  6. Number of clicks on the place page
  7. Number of tips on the place
  8. Place rating
  9. Clicks on the link "Thumbs down"
  10. Clicks on the link "Thumbs up"
  11. Similarity score (avg)
  12. Similarity score (max)
  13. Similarity score (min)
  14. Similarity score (median)
  15. Similarity score (sd)
  16. Number of distinct 1-gram
  17. Fraction of 1-gram
  18. Number of distinct 2-gram
  19. Fraction of 2-gram
  20. Number of distinct 3-gram
  21. Fraction of 3-gram
  22. Number of spam words and spam rules
  23. Number of capital letters
  24. Number of numeric characters
  25. Number of phone numbers on the text
  26. Number of email addresses on the text
  27. Number of URLs on the text
  28. Number of contact information on the text
  29. Number of words
  30. Number of words in capital
  31. Distance among all places reviewed by the user (avg)
  32. Distance among all places reviewed by the user (max)
  33. Distance among all places reviewed by the user (min)
  34. Distance among all places reviewed by the user (median)
  35. Distance among all places reviewed by the user (sd)
  36. Number of offensive words on the text
  37. Value of “Has offensive word”
  38. Clustering coefficient
  39. Reciprocity
  40. Number of followers (in-degree)
  41. Number of followees (out-degree)
  42. Fraction of followers per followees
  43. Degree
  44. Betweenness
  45. Assortativity (in-in)
  46. Assortativity (in-out)
  47. Assortativity (out-in)
  48. Assortativity (out-out)
  49. Pagerank
  50. Number of different areas where the user posted a tip
  51. Tip focus of a user
  52. Tip entropy of a user
  53. SentiWordNet
  54. Emoticons
  55. PANAS-t
  56. SASA
  57. SenticNet
  58. Happiness Index
  59. SentiStrength
  60. Combined-method
  61. Class 1 (spam or non-spam)
  62. Class 2 (pollution, bad-mouthing, local marketing or non-spam)