How We Built A Data Center With Commodity Hardware And FOSS
Who are we and what we do?
We are a startup named QuickoLabs based out of Bangalore, India. Our product SearchEnabler, is on-demand SEO software which crawls and analyzes user’s website to provide recommendations, helping them improve their website ranking in search engine results.
Our goal is to make SEO easy, affordable & measurable for start-ups and small businesses. To realize our goal, we wanted to ensure minimum cost is incurred in our operations without compromising on product capability.
Our Data Needs
Today our infrastructure holds more than 8TB of data collected from web and processes nearly 250 GB of data everyday. It consists of more than 700 Million unique URLs and analyzed more than 35 million webpages. This numbers will grow quickly as customer base increases.
Our infrastructure currently manages:
- 2 Applications Servers
- 5 Cassandra Nodes
- 4 Task Trackers
- 9 Data Nodes
Cost-Benefit Analysis Of Having Our Own Infrastructure
We have to do a lot of web crawling and data processing to provide metrics and analytics to our customers. We need servers and web crawlers that run 24 x 7.
- Cost Factor – We explored the cloud services like Amazon EC2 and Microsoft Azure and almost all of them charge based upon the compute cycles. Our web crawlers will be running every second which eats up huge amounts of compute cycles resulting in higher costs.
The cost of third party infrastructure increases linearly as we scale higher but it nearly stabilizes if we can build and maintain our own data center.
- Building Capability – As we keep on working with our own set of infrastructure, we can come to know and tackle all the possible problems. Typically, it is very hard to shift your setup from a third party infrastructure to your private one. It will also be easier for us to scale when the need to expand our infrastructure arises.
How we built our own data center?
We designed our data center with the goal of maximum availability using redundancy in just about every thing. So that, in case if some thing goes down, availability should not be an issue.
- Servers Built Using Commodity Hardware – All our servers use desktop based components such as Intel Core i3 processors, 16 GB of RAM and 3 Tera Bytes of storage space in each server. We have used multiple hard disk drives, Ethernet cards, Routers and Switches in our hardware setup for maximum availability.
- Multiple UPS and Power Generators – Frequent power cuts happen in India and we are aware of that. We cannot afford to bring down our servers due to power cuts, so we have set up multiple UPS and an addition of power generator if those UPS can’t withstand a longer than expected power cut.
- Multiple Internet Leased Lines – Our servers and crawlers should be able to carry out their tasks 24×7 and even though we have a leased line, there are rare chances that internet connection could break down. We have redundancy for that too, we have internet leased line from two different service providers.
Rack 1 in Data Center.
Rack 2 in Data Center.
2. Tools For Hardware Monitoring and Software Configuration
Automating installations, configurations and monitoring is crucial to save routine maintenance effort.
- Automated Installation and Provisioning – Mondo and Puppet are used for the automation of the configuration management for our systems.
Mondo is used for a bare metal installation to ready the server for Puppet. Then, Puppet installs, configures and maintains server as per the defined role.
- Monitoring and Alerts – Nagios, Munin and WinPower are the tools which we use for the infrastructure monitoring of our private cloud.
Nagios performs frequent checks on hosts/services and raises alerts via Email, Chat & SMS. We have used a Bluetooth dongle along with a discarded mobile phone to send SMS notifications generated from Nagios.
- Data Backup – All the data in our servers is replicated on multiple hard disks. But still, critical data backup is performed using a separate system and an external storage. We use rsync along with BackupPC to perform full backup every week and incremental backup daily. Weekly backup is timely transferred to the external hard disk drive.
3. Setting Up Crawl Infrastructure
We have used following open source software’s to setup 24×7 crawling, distributed storage and processing.
- Hadoop HDFS – Apache Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
- Cassandra NoSQL Storage – Apache Cassandra is an open source distributed database management system. It is designeded to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure.
- Hadoop Map-Reduce Tasks – MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. Hadoop MapReduce is used to do distributed computing on clusters of computers with HDFS.
- Zookeeper for co-ordination – Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.
- Apache Nutch Search Engine – Nutch is an open source web search engine based on Lucene and Java for the search and index component.
- Text Processing Through Lucene – Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is suitable for full-text searching when working on cross-platform.
- Total Infrastructure Cost – The whole setup did cost us around $12K, which includes the price of servers, cooling and power backup components. We buy individual components from the local vendors and assemble ourselves helping us to mange per server cost to $500. In terms of effort, it takes 2 to 3 hours right from assembling to serving our users (1 hour to assemble, 30 mins to restore image via Mondo, 1 hour for service installations and configurations via Puppet)
For visual monitoring, we use an LED TV ( which costs around $450) to display, graphs for memory usage, IO status, disk usage and crawl status. We use a discarded mobile phone with a Bluetooth dongle (around $4).
We had an initial investment of $1800 for the power backup solutions.
- Monthly Infrastructure Operating Cost – Monthly cost for operation of the whole infrastructure is about $1K which includes the cost of monthly internet leased lines, electricity and space rentals.
The operation cost for two separate leased lines is around $250 per month. We use a $1 quarterly SMS plan with the phone as a part of monitoring.
- Maintenance Operation Time – We do not spend more than 30 minutes daily for monitoring of infrastructure and its maintenance.
In this way, we have been successful in building our own data center and cloud network using commodity hardware and keeping the initial as well as maintenance costs low.
Some of our well-wishers have mentioned, that we can provide infrastructure service to other start-ups. We are glad to receive complements, but we feel, its just a start for us and we would continue to focus on our objective. Once we reach a good scale, even our infrastructure capability would mature and that might be right time to consider other opportunities. Meanwhile, we are happy to help other start-ups with our learning. Feel free to reach us.
We hope this article might be helpful to other startups, who want to build their own infrastructure while keeping the costs down.
HackerNews disucssion: http://news.ycombinator.com/item?id=4207439