By October 25, 2012 Read More →

Introducing tweetScroll, a PHP class for archiving Twitter timelines

Full source code can be downloaded from GitHub

For the past week, I’ve been working on a semantic prediction model involving vast amounts of Twitter data. I’ve found Twitter’s API to be cumbersome and discouraging; for whatever reason, Twitter has erected some pretty substantial barriers to downloading Tweet data. For that reason, I put a lot of time into building out a tool to retrieve Tweets robustly — Twitter imposes a 150 query-per-hour rate limit on data access, so downloading a user’s entire timeline can take many access sessions spread across multiple hours if that user is a prolific tweeter.

tweetScroll is the product of this effort: it is a Twitter timeline “archiving” tool which aims to be very simple to implement and execute. tweetScroll was built to run hourly on a cron job to download entire user timelines (and in so doing, compile a large Twitter dataset to be used in semantic analysis). It is contained in one file — tweetScroll.class.php — and is fairly well documented in-line.

Before uploading tweetScroll.class.php to your server, open the file and change these variables to their appropriate values (tweetScroll uses MySQL):

Source code    
private $dbName = 'YOUR MYSQL DATABASE NAME';
private $dbHost = 'YOUR MYSQL DATABASE HOST';
private $dbUser = 'YOUR MYSQL DATABASE USERNAME';
private $dbPass = 'YOUR MYSQL DATABASE PASSWORD';

Class Methods

To create an instance of tweetScroll, execute the following in a separate PHP file (after downloading tweetScroll.class.php and putting it in the appropriate directory):

Source code    
require_once "tweetScroll.class.php";
$ts = new tweet_scroll($include_retweets=false, $include_entities=false);

The constructor takes two parameters: include_retweets and include_entities, which are explained in detail in the Twitter user_timeline API documentation. The constructor builds the tables used by tweetScroll if they don’t already exist.

To crawl a user’s Twitter timeline and save that user for future polling, use the followUser(‘username’) method (don’t include the @ at the beginning of the user’s username). This method adds the user to the database and queries one page of the user’s tweets. One “page” of tweets is 100, set by the count variable in the constructor and passed through the API url. Twitter accepts a maximum count of 100; if you want to receive fewer tweets per page, set the count variable to a lower value in the constructor.

To scroll through a user’s entire Twitter timeline and save tweets that don’t already exist in the database, use the scrollTweets(‘username’) method (again without the @ pre-fix). This method starts from the minimum tweet ID stored in the database and recursively crawls backwards until it reaches the user’s first tweet. It then starts from the user’s most recent tweet and recursively crawls backwards until all tweets that are not yet stored in the database are archived. If the user’s entire timeline is archived, it flags that user as being “complete” in terms of their Twitter timeline and puts them at the bottom of the queue for future archiving. If the user specified in the scrollTweets() method isn’t stored in the database, the method runs followUser() for that user first. followUser() doesn’t need to be run before scrollTweets() for a user.

To get a list of all users that are not “complete”, use the getUsers() method, which returns an array of objects containing user information.

The reason the Twitter API is so difficult to work with when compiling data is that it limits queries to 150 per hour. tweetScroll effectively saves “state” when querying a user’s timeline, so if a crawl is interrupted before a user’s complete timeline is archived (because the rate limit has been reached), tweetScroll will pick up where it left off with that user the next time it crawls. It sorts users by completion status, by username in a queue; if set up to run regularly (each hour) in a cron job, tweetScroll will cycle through users, back-filling their timelines first and then polling new tweets since last update. If a new user is added later, that user will be back-filled before other users are polled.

Three methods return query counts: uniqueTweets() returns the number of new tweets added to the database in the most recent crawl session (where a crawl session is whatever crawling is accomplished with the tweetScroll object before the rate limit is reached). uniqueUsers() returns the number of unique users that were able to be crawled fully (ie their entire timelines are now “complete”) in the last crawl session. duplicateTweets() returns the number of tweets crawled that already existed in the database and were thus discarded. Because crawling through a Twitter timeline involves iterating with the max_id cursor, some redundant tweets will be crawled.

Usage Examples

If I wanted to begin tracking three Twitter users, I’d execute the following:

Source code    
require_once "tweetScroll.class.php";
$ts = new tweet_scroll($include_retweets=false, $include_entities=false);
$ts->followUser('twitter_user_1');
$ts->followUser('twitter_user_2');
$ts->followUSer('twitter_user_3');
$usernames = $ts->getUsers();
for ($i = 0; $i < sizeof($usernames); $i++) {
    $ts->scrollTweets($usernames[$i]->username);
}
echo $ts->uniqueTweets() . " tweets added to database";

This code would add the three users to the database with followUser(), inserting each user into the tracking table and querying one page (100 tweets) of each user’s timeline. It would then create an array of user objects and iterate through that array, trying to back-fill each user’s complete timeline until the query limit was reached. The last line will print out the number of tweets crawled in the session.

After running the above script once, I would remove the followUser() methods; although followUser will do nothing if the user indicated already exists in the tracking database, it is performing an unnecessary query.

If, after executing the above script through a cron job for a while, I wanted to add a new user to the tracking database, I’d use the followUser method again, like this:

Source code    
require_once "tweetScroll.class.php";
$ts = new tweet_scroll($include_retweets=false, $include_entities=false);
$ts->followUser('twitter_user_4');
$usernames = $ts->getUsers();
for ($i = 0; $i < sizeof($usernames); $i++) {
    $ts->scrollTweets($usernames[$i]->username);
}
echo $ts->uniqueTweets() . " tweets added to database";

Once this script had been run once, I could remove the followUser() method and that user would be crawled each time the script was run again through the scrollTweets() method.

The user objects stored in the usernames variable above contain the following elements:

Source code    
twitter_users.id AS id,
twitter_users.username  AS username,
twitter_users.user_real_name AS real_name,
twitter_users.user_lang AS user_lang,
twitter_users.user_location AS user_location,
twitter_users.url_slug AS url_slug,
twitter_users.poll_ID AS poll_ID

What is stored in the database?

The script creates two tables: tweets and twitter_users. Twitter_users stores meta data about the users tracked; its columns are:

Source code    
id BIGINT(25) -- the twitter user id
addedDate DATETIME -- the date the twitter user was added to the tracking database
username VARCHAR(100) -- the twitter username
user_real_name VARCHAR(100) -- the user's real name
user_location VARCHAR(100) -- the user's location (as they indicate in their profile)
user_lang VARCHAR(5) -- the user's tweet language (as they indicate in their profile)
url_slug VARCHAR(100) -- a url-friendly manipulation of the user's username
poll_ID INTEGER(1) -- an integer that determines if the user's timeline was "completed". It is set to 1 when it is complete in the current user list iteration; it is 0 when it has not yet been completed in the current user list iteration.
found_first_tweet BOOLEAN -- indicates if the user's timeline has been back-filled to the first tweet yet.

Tweets stores user tweets; its columns are:

Source code    
id BIGINT(25) -- the tweet id
addedDate DATETIME -- the date the tweet was archived
username VARCHAR(100) -- the username of the tweet's author
date VARCHAR(100) -- the date the tweet was published
text VARCHAR(160) -- the content of the tweet
geo VARCHAR(30) -- whatever geographic data that was included with the tweet

Known Issues

tweetScroll probably won’t work well for very large twitter user lists; for one, the twitter_users table needs to be pared down to include no partial dependencies, and the counts should be moved to mapping tables to avoid table updates. Also, the 150 query per hour limit means a full crawl of a massive user list would take a very long time. This is, unfortunately, a fact of life unless Twitter makes its data more accessible.

Disclaimer: I’m not a software engineer; tweetScroll may have bugs in it, and it may not be optimized. I haven’t tested edge case scenarios, only my very specific use case. The purpose of tweetScroll is to accumulate a large corpus of Twitter data on which to perform textual sentiment analysis; I had no intentions of collecting tweets in real-time or keeping tweet archives up-to-date. If you find a bug or can suggest an improvement, feel free to leave a note in the comments below.

Downloading

tweetScroll may be downloaded from this GitHub repository.

Reactions to this Post

You may comment on this post using the form below

Posted in: Big Data, Data Science

About the Author:

I am a quantitative marketing and mobile user acquisition specialist. My specific interests include analytics, user acquisition, freemium economics, and programmatic statistical methods for prediction. ufert.se is where I discuss issues facing the freemium mobile industry as well as general trends in data science and analytics. Feel free to email me using the contact form on the About page.