What is Google Compute Engine?
Compute Engine is an infrastructure as a service that lets you run your large-scale computing workloads on virtual machines hosted on Google’s infrastructure. Btw, if you wish to have a new machine under your arms in less than 5 minutes – It can be done in 5 easy steps.
What is Spark?
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
So in order to enjoy from both worlds, we can leverage the great options of large-scale cloud that Compute engine offer us and install Spark on it. Here are the few steps you will need to follow in order to do it.
Installation steps
-
Create a CentOS image on GCE
-
“A journey of a thousand miles begins with a single step.” and in our case this is the first one.
-
(!) Important – use at least 3.8 GB memory because it won’t compile on less.
-
- ssh to your new machine. For example:
gcutil --service_version="xxx1" --project="spark-testing-123" ssh --zone="europe-west1-a" "spark-box-3g"
-
Install Java – sudo yum install java-1.7.0-openjdk-devel
-
Make sure you have the ‘devel’ version – so it’s the full sdk. You can see what packages are out there with:
yum search java | grep ‘java-‘
-
You might wish to have 1.7 or 1.6 base on other requirements you might have.
-
Another option is to make sure you install python, scala and java.
-
-
Install Git – yum install git
-
wget on of the packages from the download page
-
Run sbt/sbt assembly
-
Run sbt/sbt package
-
You are good to go! Try some of the examples under the spark directory.
-
./spark-shell
-
Run one of the example under examples/
-
Misc
Discover more from Ido Green
Subscribe to get the latest posts sent to your email.