You experimented with a temporary cluster and big data crunching. Now you have use-cases that benefit from a permanent cluster, i.e., to give business teams access to big data or long-running computations. Which setup is efficient, avoiding upfront capital investment, and achievable with in-house know-how?
In my last post, Democratize Big Data With Hive, I described why at Rangespan we moved away from a transient Amazon Web Service Elastic MapReduce (EMR) cluster to a permanent one. The decision was born out of increasing demand for computing time and the lack of interactivity of a setup that required long startup time and had no user-friendly interface to work with. We initially worked with transient AWS EMR clusters. Our cluster is comparatively modest, merely four m1.large EC2s on EMR, and there is significant uncertainty around how fast and large it will grow in the future. Will we double our computing needs in the next year or raise it by a magnitude? This depends on many variables of our success.
As a startup we can only invest capital on proven products and rather spend operational expenditure when possible to retain the agility to grow, shrink, abandon, or raise architectures quickly in reaction to customer demand and product development.
The effortless solution would have been to continue with EMR and keep the cluster running 24/7. This was undesirable for two reasons. First, EMR costs $0.06/h per machine, which comes to $2,102.40 for our four machines per year. Second, and more importantly, EMR is simple at the expense of flexibility.
Compare it with a distribution like Cloudera. It provides the latest software version and flexibility like simplified installation of additional services, e.g., Hue, Yarn, Zookeeper, HBase, Flume, and Impala. In particular, Hue, a browser-based interface to Hadoop and its services like Hive, was a service we wanted. It proved very beneficial to opening access to our data, our cross-team development process, and improving business intelligence. Lastly, Cloudera comes with the Cloudera Manager, which streamlines managing clusters -- installing services or upgrading software clusterwide.
Consequently, we installed Cloudera on four m1.large EC2 instances using a m1.micro for the manager installation. We mostly use Hue, Hive, Oozie, and Sqoop at the moment, but use-cases for Flume and other services are already being discussed. The hassle-free installation of services with the Cloudera Manager is an added bonus when we want to experiment with them.
An EC2-based cluster is not a cheap proposition. An on-demand setup as described costs $9,285.60 (4 x $2,277.60 + $175.20) per year. Alternatively, buying reserved high-utilization instances for the m1.larges and light reserved for the m1.micro drops the cost to $5,490.68 (4 x $1,340.64 + $128.12), a reduction of more than 40 percent.
An alternative would be a mixed cluster with spot and on-demand instances or a full spot-instance cluster. This requires that you can deal with losing a cluster (or parts of it) for a period of time. Spot instances are pulled from you without a warning when your bid price is below market rate. Such a setup can be implemented by retaining checkpoint data on S3 for example. In this case you can achieve a cost as low as $2,295.12 (4 x $560.64 + $52.56) per year in the best-case scenario (current floor price of $0.064/h for m1.large in EU-West). That is a potential saving of more than 75 percent over on-demand non-reserve instances. In the long run, we will discuss whether owning the hardware is not a more cost-effective solution. At the moment, however, we appreciate the flexibility we have with AWS.
Lastly, such a setup does not hamper the needs of a startup. Companies or departments trialing new products or changing architectures have the opportunity to pilot them with modest funds before applying for substantial investments. Furthermore, complete electronic service companies work in the cloud, as Netflix, Amazon's poster child, demonstrates -- as reported in InformationWeek. It operates nearly its whole business on EC2 using Hadoop and Cassandra clusters, growing and shrinking them with demand.
User Rank: Exabyte Executive 1/26/2013 | 4:26:49 PM
Re: Startups gain with the cloud The cloud computing market hardware and software costs evolved enough where startups can function as various other types of companies in Big Data. Costs only naturally develop to employ cloud services to function making such services key customers of cloud service providers. Overall growth and profit potential for cloud, big data have expanded beyond and into different market segments.
Re: Startups gain with the cloud @Daniel, the flexibility is indeed the biggest opportunity. Certainly, comparable setups have become much cheaper. At the same time big(ger) data means many business require more storage, bandwidth, and computing power, which cost significant amount of money again. The chance to unload the expense on opex rather than capex and pivot quickly is truly amazing.
User Rank: Petabyte Pathfinder 1/18/2013 | 12:44:54 AM
Re: Startups gain with the cloud Before enormous financial commitments were required to acquire hardware need to power those huge IT requirements. Today, blades with processors, memory and storage, almost everything that we need are now fit in racks accessible via cloud, you pay only for the usage, its quite good because you don't need to install those huge expensive IT infrastructure.
User Rank: Petabyte Pathfinder 1/18/2013 | 12:35:28 AM
Re: Startups gain with the cloud The battle of thew cloud has just morphed into a much greater one , its also about the battle of the cloud price. And more and more companies are doing some great advances in their cloud offering. Amazon is the only tech company that making buzz and a good start in offering cloud services in a very low price. They completely revolutionize the world of computing.
Startups gain with the cloud My, how times have changed since the dot-com bubble, circa 2000. I recall seeing start-up business plans in those days where a significant amount of the venture funding was spent on building out IT infrastructure, i.e. physical data centers (often bi-coastal and bi-continental). So instead of 100s of thousands of dollars per year, we're talking about the modest price ranges in the excellent overview by @Christian. Cloud costing means that so many more new ideas can be explored, albeit with both success and failure.
Re: The crystal ball You could, for example, compile your own JARs/upload Python scripts and just execute them with EMR jobflows which would be very low impact and without any dependencies. The data could be parked and pulled from S3 for the experiments. Considering how scalable the approach is I would say that it is a very low effort approach with no unnecessary dependencies.
User Rank: Exabyte Executive 1/16/2013 | 9:52:06 AM
Re: The crystal ball From algorithm prototyping perspective, majority analysts prefer least amount of insfrastructure set up and development work. I am just wondering if it is the case that prototyping can be relatively easily done without worrying about Big Data infrastructure with these vendor's service?
Re: The crystal ball That would be interesting. There is a reason why Amazon is so prominent in the space though. They are the biggest provider and also have a very extensive and growing set of services. EC2 is only the tip of the iceberg. Once you start using SQS, RDS, EMR and other service you don't want to miss them.
User Rank: Exabyte Executive 1/15/2013 | 4:43:55 PM
Re: The crystal ball Wondering in order to run a pilot study of big data analytics algorithm, other than EC2, what other options available with a budget constraint. That might help others understand when to pilot study and how far the study can go without meeting computational difficulty? Then the cost of setting up the infrastructure definitely help business understand when would be the right time to set up them.