This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Twitter uses a Real-Time Delivery Architecture. Twitter is well-known for breaking news at a speed faster than even a news industry. Another need for Twitters Real-Time Delivery is because of applications such as Dataminr which is built upon Twitter's API and includes real-time access to Twitter Firehose through which their analytics engine transforms Twitter's streams into actionable signals for clients in the financial and government sectors, providing one of the earliest warning systems for market-relevant information, noteworthy events and emerging trends (Dataminr, 2012). Therefore, time sensitivity is a critical issue for Twitter, and with growing number of users and increasing traffic, Twitter's Architecture should be designed in a way to continue to provide Real-Time Delivery.
Twitter's standard 4-layered model was centered around a single Ruby-on-Rails structure, also known as the Monorail and was the largest of its kind. The Monorail singlehandedly performed the Routing, Presentation and Business Logic functions of Twitter, with the only function that it performed separately being Back-Up Storage & Retrieval. However, Twitter looked to break the application up because the company was not structured as a monolithic engineering team. When one team made core changes, this would interfere with the other team's work since no clear ownership was present . This step to move from Ruby to Java enabled Twitter to improve their developer innovation speed, as well as the site speed and reliability since the responsibilities and concerns were isolated.
There are four types of Timelines at Twitter -
Pull-Based Timelines wherein requests are issued
Targeted - A request sent in by the user to access his/her Home Timeline.
Queried - A 'Search' option or an explicit query where Twitter traverses its database to return information requested for.
Push-Based Timelines where
Targeted - Tweets that are sent by users with mentions hit the database are pushed to the targeted timeline. This also includes Mobile Push where users receive tweets in the form of SMS' on their mobile phones if they have opted to Fast Follow a user.
Queried - Track Stream includes Twitter Firehose where every single public tweet on a particular topic is streamed out to the user that requests for it. However, this is chargeable and is mostly used by businesses such as Dataminr. Also, Follow Stream pushes tweets to the Timelines of a user who follows them.
---------------------------- Asynchronous Path
---------------------------- Synchronous Path
---------------------------- Query Path
---------------------------- Read Path
---------------------------- Write Path
The above diagram depicts the various stages that a tweet goes through depending on whether it is a Pull Based Timeline or Push Based Timeline.
Application Programming Interface (API)
An application-programming interface (API) is a set of programming instructions and standards for accessing a Web-based software or a Web Tool (Doos, 2012). Twitter bases its API off the Representational State Transfer (REST) Architecture (Strickland, 2011). Due to its REST Architecture, Twitter is able to function with most Web Syndication Formats which include Really Simple Syndication (RSS) and Atom Syndication Format (Atom). Twitter also makes its API partially available for software developers to design products that are powered by Twitter's service. Some of the third-party applications that incorporate Twitter's services include -
Twitterlicious and Twitterific For access on desktop computers
Twessenger Integrates Live Messenger
Twittervision Integrates Twitter Feed into Google Maps
Flotzam Integrates Twitter with Facebook, Flickr
Since its launch, Twitter has made use of API v1. However, in March 2013, Twitter announced that it will officially retire API v1 on 7th May, 2013 making way for API v1.1 instead. The new version will use OAuth, an authentication tool to provide authorized access to its API (Twitter Developers, 2013).
Fanout, Redis & Timeline Service - Pull Based Delivery
As soon as a tweet is sent, the first step is to fan-out the message to the followers of that user. A series of caches are maintained in the datacenters known as Redis instances. A user's Timeline and all the tweets that make the user's Home Timeline are kept in these Redis instances. These are replicated so that the Home Timelines are not lost when the machines malfunction. Fanout uses a Social Graph Service which picks out the Home Timelines of all the user's that the tweet is targeted at, and inserts them into their Timelines when the targeted user opens up the Web. A fanout is pipelined to reach 4,000 destinations at a time only.
In Redis itself, the Tweet ID, User ID and miscellaneous BIT fields are stored. Retweets are stored in pointers attached to the parent tweet itself, to ease the operation.
The only step left is delivery of the actual tweet. Timeline Service gathers the information of all the users from the Redis instance picked and posts the tweet on the Home Timeline of all the targeted users. Only users that are in cache are targeted in Redis and takes 40-50 milliseconds for a tweet to reach them. For users that are out of cache, it takes about 2-3 seconds for a tweet to deliver once they open the Web.
Ingester, Earlybird & Blender - Pull Based Delivery
Ingester inspects the tweet text and puts it into a Search Index which is a series of optimized Lucine instances called Earlybird.
Unlike Fanout, where the tweet is replicated in different Redis instances, Ingester puts the tweet text in only one Earlybird instance, and replicates it for backup.
Blender hits all the Earlybirds or at least one replica of each to check for information that matches a query. It then sorts and re-ranks the results before presenting them. Blender also takes into account a user's following, previous searches, geo location. Blender powers the search experience, as well as the Discover service at Twitter which tells user's of the interesting things happening around them.
HTTP Push & Mobile Push - Push Based Delivery
Hosebird takes every tweet on Twitter and figures out how to route it. All events including social graph changes are sent to Hosebird.
Track / Follow Streams push tweets of those that the user is following to the user's Home Timeline. User Streams try replicate the Home Timeline experience. Twitter uses a Follow Match and the Social Graph to push tweets similar to those that a user is following. It is much easier and cheaper for Twitter to filter tweets similar to a user's Social Graph and push them to the user, even if the number of tweets pushed are more than that the user looks for. This is because it works out more expensive if Twitter were to use Blenders and access all the Earlybirds to Search for information in the database.
Mobile Push queries the Social Graph service to identify the Mobile Following and Follow Service. For those who Fast Follow or Mobile Follow a user, tweets are sent in the form of SMS' through mobile carriers.
Hadoop - Push Based Delivery
Hadoop is not a part of the real-time delivery architecture, but still a part of the delivery. Nightly batch analytics are run in order to send out summarization emails to users. Around tens of millions of emails are sent per day to users worldwide. Tweets that users have already seen are filtered out from these emails.
BREAKDOWN OF THE ARCHITECTURE
Synchronous Path - Responds to the user within 50 milliseconds and queues in the Asynchronous Path
Asynchronous Path - At this point, the user is decoupled from Twitter and the Fanout / Ingestion process begins.
Query Path - Once on the other side, this is how the user requests for data from Twitter itself.
Read Path - This depends on whether the user wants a Search or Home Timeline request
Write Path - Computation occurs immediately upon a tweet being written
(Twitter Developers, 2013).
Twitter users Open Source Tools that are made up of -
Rails on the front that handles rendering, cache composition, DB querying and synchronous inserts
C, Scala and Java in the middle that uses Memcached, Varnish for page caching, Kestrel - Message Queue
MySQL for storing data
The proper use of cache is important for large sites. The sites respond to user requests and the reaction rate is a major factor that affects the user experience. The reading and writing of the hard disk (DISK IO) is important to understand the impact speed.
The table below comapres the memory (RAM), hard disk (Disk) and a flash memory (Flash). Therefore, it is important to cache as much data as possible in order to improve the speed of the site. However, a back up on a hard disk is always suggested in order to prevent losses due to power outages.
Twitter engineers believe that responses should be complete in an average of less than 500ms, but the ideal figure is 200ms - 300ms. Twitter employs a multi-level multi-way cache for its large-scale application.
The cache space to store Tweet IDs is known as Vector Cache and has a hit rate of 99%. Another write-through Row Cache contains database records, users and tweets. It uses Cache Money and has a 95% hit rate. The higher the hit rate, the greater the contribution to the cache.
Twitter is accessed mainly through browsers, but also through mobile phones. Based on this, there are two type of users, Apache Web Server Web Portal Channel and the API Channel which accounts for 80-90% of Twitter's usage.
A read-through Fragment Cache contains serialized versions of tweets that are accessed through API with a 95% hit rate. However, the Page Cache that is used to cache the profiles of popular authors whose Home Timelines are frequented had a hit rate of only 40%. Therefore, the Page Cache was moved into its own pool, and the cache misses dropped by about 50%.
HTTP Accelerator Varnish is an open source project that was used as a tool to reduce the pressure of search by caching key words and corresponding search results.
The main task of the Apache Web Server is to parse HTTP and distribution tasks. The Mongrel Rails Server is used to contact the Vector Cache and Row Cache to read data.
Twitter claimed that the use of Varnish reduced the load of Twitter's website by 50% with the use of Cache Money and libmemcached.
Cache is a clever tool employed by Twitter, but so is the Message Queue. The Message queue is used to take the peak and iron it out over time, so that Twitter can skip having to add extra hardware. Twitter's Message Queue is based on the Memcached protocol, no ordering of jobs, no shared state between servers, all is kept in RAM and transactional. Initially, the Message Queue was written in Ruby, but would often crash. A decision to move to Java was made and currently uses only 1,200 lines of code on 3 servers. (Avram, 2009).
This was used to optimize the cluster load. The current client being used is libmemcached, which is a C client library for interfacing to a memcached server. Based on it, the Fragment Cache Optimization over one year led to 50 times more page requests served per second. In order to deal with requests fasters, the data is precomputed and stored on RAM, instead of computing it each time on the server. This allows it to run almost completely from the memory (Avram, 2009).