TweetRoost is now using Twitter Site Streams in Production for paid users. We are very excited about this: It gives much better performance for a Twitter client like TweetRoost, especially for an organization with more than a few users simultaneously signed onto one Twitter Account.

So what, you ask, are Twitter Site Streams? They are a new way to gather data from Twitter by allowing a client such as TweetRoost to open one persistent connection between Twitter and TweetRoost, which ‘asks’ Twitter to send all data for people who we manage as a single stream. Instead of making hundreds or thousands of REST API calls, with Site Streams Twitter simply sends the appropriate data stream to TweetRoost after a Site Stream connection is opened, and TweetRoost saves that data locally for the time it is needed. Twitter’s overhead to support applications becomes much lower since one open connection is much cheaper for Twitter than thousands of API calls. And TweetRoost can get data for screens like the Home Timeline from its local disk instead of calling Twitter over the internet, so this is a much faster and more reliable method.

We started to program the Site Streams API a few months ago. We open a stream, and announce which users we want to get data for. As the data arrives, it has to be saved to disk super-fast. That’s all that the stream ‘consumer’ does — it reads Tweets in real time and saves them locally. Then, when data needs to be displayed by TweetRoost, it can read its local disk, and display the data as if it got it directly from Twitter.

Here is an outline of how this works and the work involved: First, calls need to be made to the Site Streams API with a list of Twitter Accounts to be managed via the Streams. For us, this means going through all customers in our main mysql database, determining which Twitter Accounts they connect to, and then starting the Stream going. Next we read the arriving data in the stream and classify them into Tweets, Mentions, Messages, etc, determine which Twitter accounts they are for, and we save all this immediately to the local fast database (in sqlite). Then when a user makes a request to TweetRoost for something, like a timeline, or all their mentions, etc, TweetRoost reads the fast database — which is in the same format (more or less) as the data from the usual REST API, and the data is shown onscreen.

That was the (pretty) easy part. This part is harder: If a new customer signs up (paid) or a paid user adds a Twitter Account for management via TweetRoost, a Stream needs to be opened for that new user or new account. Since Site Streams allow any number of accounts in the data stream (even 1), a new Stream for even one user or account can be opened. But Site Streams specs require that many Streams not be kept open for a long time. Also, the Site Stream connection needs to be tested periodically so we know that it hasn’t gone down. Otherwise, those Tweets would be lost. And the sqlite database would grow forever as the Site Streams just keep dumping data into the database, so there is a maintenance/cleanup function required. Lastly, when Tweets are deleted, their space needs to be given back.

To implement this part, we used cron. (cron is a background process which can be scheduled to run whenever you like). We wrote a master cron which runs every minute. First, it checks if the main Streams process is running. If not, it starts it. This handles bootup time and also when a Twitter Streams connection to our Streams has failed. At 4am, it checks the processes running on the servers, and it kills all Streams (including those temporary Streams for a new paid user). Then it goes through the sql-lite database and trims out any Tweet list over 800 entries (this sounds costly, but it is only one or two sql statements, and it is very fast). Then it ‘vacuums’ the database to give back freed up space. Also fast. The first time we ran this on some test accounts (in test mode, so it was not 4am), the database size went from 130megs to 9megs! Then it starts a new main Streams reader which picks up the new paid accounts.

So why is this really good? There are few reasons: First, you get the data much faster. If you want your home timeline for instance, TweetRoost with Site Streams just reads the local disk ‘fast database’ with one simple sql call (no fancy over the internet API calls to Twitter). Second, if Twitter is intermittently down, but the Site Streams are up (which happened earlier this week), you won’t be affected adversely; when that happened, TweetRoost’s Site Streams connection was working perfectly. Third, if Twitter and the Site Streams are both down, you can still see the most recent data in Twitter which is important to you, since we saved it to our disk before Twitter went down. And last, but also important: More people using Site Streams means that Twitter will be more stable and fast in general which is good for everyone.

You can try out TweetRoost at mediaroost.com, see how saved Tweets, long Tweets, Roles, Scheduling, etc can help your organization.

Mark