Spam is a huge problem on Twitter. In certain areas it can account for the majority of tweets. This can get in the way of delivering quality results when you try collecting tweets for aggregation sites or data mining. To get a good idea of what tweet spam looks like, try running a Twitter search for weight loss. If you watch this stream for a while, you will see bursts of identical spam tweets coming from dozens of accounts at the same time. This is a spambot network.
If you study the accounts in these spam networks, you’ll see an interesting pattern. They alternate between sending out normal, conversational tweets, and spam tweets. But the spam tweets often come in a wave, with all of the accounts sending the same tweet. Obviously this network is run by a central script that cycles through the accounts, sending out the normal and spam tweets on an automated basis.
Because of this network phenomenon with spamming, I have found that the key to blocking spam is automating the discovery of spam accounts, and then creating a blacklist to exclude those accounts from any data collection I do for clients. Once an account is identified as a spammer, I ignore all of their tweets. I have seen reductions of spammy looking tweets by as much as 50% just by blocking a hundred or so accounts. Of course, new spam accounts are being created as fast as Twitter can suspend the old ones, so it is important to add a level of automation to this process. When clients see how much cleaner their tweet data is, they understand that this blacklist is a valuable resource that can be applied to all of their future tweet collection.
Here is a basic list of techniques I typically use for building a spam account blacklist. It is important to realize that this approach will reduce your flow of tweets, that is the goal after all, and you will block some non-spam tweets. If you make the following techniques tunable, you can adjust the level of spammy activity you use to blacklist someone. Then you can test these settings until you get the highest yield of good tweets while blocking as much spam as possible. In my opinion gettng a tweet stream that has most of the spam blocked, while losing about 10% of the good tweets, is much better than a stream with 50% spam.
Create a list of users who send tweets in your space
The goal of this process is to identify spam accounts, not just to block individual tweets, so you need to build a database table with all users who sent the tweets you are collecting. Then you can collect scores for each user on several blacklist criteria. I typically use a spam score field that counts the number of times they use spam words, and a duplicate count field that records the number of times they send duplicate tweets.
Hold back tweets on new accounts
I do blacklisting in two levels. First I block all tweets from accounts that look too new or have a few signatures of a spammer. For example, if the creation date is within the first month, or the default avatar, often called the “egg” on Twitter is being used. I also block tweets from users who have only a dozen or so tweets. A common spam technique is to create an account, tweet for a week or so, and then abandon it. This doesn’t give me enough time to detect their activity, so I just keep their tweets out of the results I deliver. These users are reevaluated every 24 hours. Once they get past this initial block, and if their activity is not spammy, I let their tweets into the tweet stream. The second level of blacklisting is based on their tweets over a longer period, usually about 3 or 4 days.
Create a list of spam words
These will vary for each subject, but generally words like free and coupon are a good starting point. Programmers are usually good at spotting trends while skimming through lots of data by hand, but for a really thorough analysis you should collect 20,000 to 40,000 tweets and create a word frequency report. Then you narrow the list down to the most common words in spam.
Score tweets and accumulate spam scores for each user
With a good set of spam words you can score each tweet as it arrives. Just rejecting a tweet or blacklisting a user for a single use of a word like free is too extreme. The best approach is to record the number of times a user has one of these words in their tweets, and then blacklist them when they pass a set threshold.
Test each tweet for duplicates
The simplest test for duplicate tweets is to record a checksum, like an MD5, for the text of each tweet. Then you can compare each new tweet with the MD5s of all the tweets received over the past few days. If you get a match, you can call this a duplicate, and increment the duplicate count for the sending user. Obviously, retweets will be caught with this test, so I don’t look for duplicates with a retweet. For some reason I hardly ever see spammers retweeting, so this works out well.
Evaluate each user for spam score and dup score
After a few days of scoring spam words and duplicates, you can review the user accounts, and mark them as blacklisted if their scores are too high. I find that clients like to have control over this, so I generally add a few features to a spam control system. I provide a report that lets the client review all candidates for blacklisting before this user is blocked. I also set up an admin page that lets clients set the threshold levels used for spam word score and duplicate score blacklisting. Finally I create a rejected tweet log that records all tweets that were kept out of the final stream based on blacklisting. With this level of control it generally takes only a week or two to refine the blacklisting system to deliver the highest flow of tweets with the minimum amount of spam.