Computing for Good
Inneractive's mobile ad exchange supports a broad range of inventory sources. Our technological platform is all about scale. For example, at any given minute, we process over 2.5 million ad requests. This number can increase to as many as 5 million requests at peak times. This high volume and its constant fluctuation mean that it is essential for us to enable up / down scaling. Without it, we cannot ensure our ability to process such a high volume in real-time. We also avoid overpaying for servers when the volume goes down by utilizing up / down scaling.
We use Spotinst for our cloud computing optimization. Running on AWS, spot instances are billed per hour, the minimum time unit for billing purposes. Our traffic fluctuations mean that there may be anywhere between 300 to 600 spot instanced terminated each day. The terminations can leave us with 20 to 30 unused minutes in the billing hour. These minutes? They add up! Over the course of a day, we can accumulate 4 to 12 days of unused CPU time.
With this CPU power already paid for we didn't want it to go to waste. We chose to use this computing power to do good and decided to donate it to a worthy cause via the Boinc Project.
The Boinc Project is an IBM initiative that allows anyone, companies or individuals, to utilize idle time on a computer, regardless of its operating system, to help with scientific research such as finding cures for diseases or better understanding global warming to name a few. With the project, companies or individuals can contribute computing power in an easy and secure manner to help further advance research. We chose to donate our unutilized days of CPU time to IBM's World Community Grid (WCG).
Donating the CPU time was for a good cause, yet it was not free of challenges. We decided to invest the time needed to find solutions ourselves as it was for a good cause after all. Our first challenge was to find a way to address the way Spotinst terminates instances during downscaling. With the help of the Spotinst team, we created a new feature, so that a session would not be automatically terminated. With the use of a different feature that sends a message to the AWS SNS service which flags the instance as to be terminated at the end of the billing hour. By using these two features, we were able to resolve our first challenge and accumulate the paid for unused time.
Our second challenge was due to the significant amount of time Boinc calculations require complete. The most common instance types may require over an hour. In the first days after installing the Boinc agent, on the downscaled instances, we saw that the total calculation time only took up several hours per day. We found out that for any instance before calculations could be made project data had to be downloaded first. This meant that Instances with low-CPU, or a short time until termination (TTL), couldn't complete a calculation and upload it to the WCG servers.
Unfortunately, no solution was available online. So, we went ahead and created our own using an NFS share. The Boinc agent data directory, located in 'var/lib/boinc-client', is where all of the partially completed calculation data is saved. By installing an NFS server on a t2.micro instance (which is free or almost free) it could be shared by the part-calculations. For every down-scaled instance, this machine installed the Boinc agent, deleted its Boinc data directly and then linked to the NFS share instead. This way, calculations are continuous instead of starting from scratch for each new instance.
Third and last was the challenge of concurrency. As we mentioned before, there were 300 to 600 down-scales a day that could be utilized. When simultaneous down-scales occurred, both would attempt connecting to the same NFS share and use the same Boinc data directory. We had to find a solution to this.so that instances could work concurrently.
So, we turned to scripting, creating a Python script that returns a number from 1 to X by using a round robin algorithm. This script checks whether the number is in a directory under the aforementioned NFS share and then returns the number with a true / false, accordingly. This is how the downscaled instance “knows” whether it needs to install the Boinc agent and copy the data from within its directory to the NFS share, or if it simply needs to use the shared data from the NFS share.
By solving these challenges, we were able to fully utilize all of our AWS spot instances to support science research without spending an extra penny. This way, we have the possibility to donate our unused computing power for an important cause, as we continue processing millions of requests per minute.