How to Install Spark on Google Compute Engine

What is Google Compute Engine?

Compute Engine is an infrastructure as a service that lets you run your large-scale computing workloads on virtual machines hosted on Google’s infrastructure. Btw, if you wish to have a new machine under your arms in less than 5 minutes – It can be done in 5 easy steps.

What is Spark?

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

So in order to enjoy from both worlds, we can leverage the great options of large-scale cloud that Compute engine offer us and install Spark on it. Here are the few steps you will need to follow in order to do it.

Installation steps

Create a CentOS image on GCE
1. “A journey of a thousand miles begins with a single step.” and in our case this is the first one.
2. (!) Important – use at least 3.8 GB memory because it won’t compile on less.

ssh to your new machine. For example:

 gcutil --service_version="xxx1" --project="spark-testing-123" ssh --zone="europe-west1-a" "spark-box-3g"

Install Java – sudo yum install java-1.7.0-openjdk-devel
1. Make sure you have the ‘devel’ version – so it’s the full sdk. You can see what packages are out there with:
  
  yum search java | grep ‘java-‘
2. You might wish to have 1.7 or 1.6 base on other requirements you might have.
3. Another option is to make sure you install python, scala and java.

Install Git – yum install git
wget on of the packages from the download page

Run sbt/sbt assembly
Run sbt/sbt package
You are good to go! Try some of the examples under the spark directory.
1. ./spark-shell
2. Run one of the example under examples/

Misc

Discover more from Ido Green

Subscribe to get the latest posts sent to your email.