New Data Stack Workshop: Building A Scalable Internet Datacenter

Here are few notes I took during yesterday conference in Stanford. The organizer (=Accel) did a very good job in bring some of the leading minds in the new emerging field of ‘Scalable’ and ‘Cloud’. Right… all of the speakers were ‘Accel companies’ – but they are the one who pushing the technology forward – so we cool with that.

The first company was NorthScale (and their chief architect is the same guy that we helped a year ago – with some real world data from high gear media servers). They have a very impressive open source project – Membase What is Membase you ask? Well, “Membase is an open-source (distributed, key-value database management system optimized for storing data behind interactive web applications. These applications must service many concurrent users; creating, storing, retrieving, aggregating, manipulating and presenting data in real-time. Supporting these requirements, membase processes data operations with quasi-deterministic low latency and high sustained throughput.”

In a nutshell:

Membase it’s like memcached but better.
Simple – get/set works with memcache client.
Fast – ram, ssd and disk layer.
Predictable latency.
Availability – All nodes are created equal.
Replication between data centers.

The second company was Cloudera – The bring to the table a full stack of Analytical Data Platform (ADP).

It’s not only a map/reduce solution but the full stack of ‘watching the wheels moving’ and understanding where you want to steer them.
BI is science for profit.
View of the world / answer question to make money
Pose hypothesis and validate them.
A/b testing – feeding the business with real world hypothesis and their verifications.
HDFS and map/reduce
Jim gray – ‘fourth paradigm’
HBase – base on google bigTable.
Hive – SQL intercede for Hadoop.
What is Hadoop? Hadoop is the popular open-source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets
A good paper from Google: “The unreasonable effectiveness of Data’
Beautiful data – buy the book.

Last (for me) was Facebook. Here the story is very simple… scaling from 4-5M in 2006 up to 400M now is putting some challenges on the development team.

Scaling to 400M (efficiency is taking a hit at the beginning)
Need all the data all the time.
You / 100 friends / 100 objects-friends = 10’s of possible objects.
Web Server / memcached / MySQL – and make sure you can replicate boxes in each layer without any changes to your code.
Testing – Testing and some more unit testing, A/B Testing, System testing… you got the point.
Push a new version every week. Don’t let your software ‘get old and stick’.
Monitor EVERYTHING and when there is a problem always do what every you can in order to understand the root cause. Yes – even after you fixed it and everything is working now.
No single point of failure (sometimes – your software will be the single point of failure).

Overall, it was very productive 3h that will make us try few new opensource projects. Good times.

Discover more from Ido Green

Subscribe to get the latest posts sent to your email.

Discover more from Ido Green

Rate this:

Share only with good friends: