Spark Cluster on Google Compute Engine

What is Spark and Why?

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. In the past, I’ve wrote an intro on how to install Spark on GCE and since then, I wanted to do a follow up on the topic but with more real world example of installing a cluster. Luckily to me, a reader of the blog did the work! So after I got his approval, I wanted to share with you his script.

Installing Spark Cluster

In order to install it you just need to:

1. Install gcutil and authenticate your project.

2. Open a terminal and get the git repository with the python script in it.

$ git clone https://github.com/sigmoidanalytics/spark_gce.git $ cd spark_gce $ python spark_gce.py

You will need to create a new project in the Google Developer Console before you running ‘spark_gce.py’ and make sure to add all the params.
Here is an example:spark_gce.py project-name slaves slave-type master-type identity-file zone cluster-name

project-name: One of the hardest thing in software… Choosing good names. Here we wish a good name for our cluster.
slave: how many machines we will have in the cluster as slaves.
slave-type: Instance type. For example: n1-standard-1
master-type: Instance type for the master. Choose something powerful (e.g. n1-standard-1 and above).
identity-file: Identity file to authenticate. It will be around: ~/.ssh/google_compute_engine once you authenticate using gcutils.
zone: Specify the zone where you are going to launch the cluster. For example: us-central1-a
cluster-name: Name the Spark cluster.

That’s it.

Misc

Discover more from Ido Green

Subscribe to get the latest posts sent to your email.

6 thoughts on “Spark Cluster on Google Compute Engine”

AkhilDas says:

Cool. Now that looks so simple 🙂

May 13, 2014 at 2:41 pm
- greenido says:
  
  Easy to setup and install… But in the end of the day, you do need to invest time to ask the right questions in order to gain value from the answers.
  
  May 13, 2014 at 8:29 pm
  - AkhilDas says:
    
    Just curious to know, did you faced any problems while/after setting the cluster up?
    
    May 14, 2014 at 9:46 am
  - greenido says:
    
    No.
    But I didn’t tried to push it on a big cluster (yet).
    
    May 14, 2014 at 5:01 pm
  - AkhilDas says:
    
    Ok, Cool.
    
    May 14, 2014 at 7:02 pm
Raghu says:

I tried to install following the instruction but I am facing some errors.The errors something like

NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
ERROR: (gcloud.compute.instances.create) Some requests did not succeed:
– Invalid value ‘Test’. Values must match the following regular expression: ‘(?:(?:[-a-z0-9]{1,63}.)*(?:a-z?):)?(?:[0-9]{1,19
}|(?:a-z?))’

Could you help me to resolve this.

I am passing the following

python spark_gce.py ‘Test’ ‘3’ ‘n1-standard-4’ ‘n1-standard-4’ ‘/home/raghuram_mundru_meredith_com/.ss
h/google_compute_engine’ ‘us-central1-c’ ‘spark-test’

May 26, 2015 at 11:21 pm

Comments are closed.