What is Spark and Why?
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. In the past, I’ve wrote an intro on how to install Spark on GCE and since then, I wanted to do a follow up on the topic but with more real world example of installing a cluster. Luckily to me, a reader of the blog did the work! So after I got his approval, I wanted to share with you his script.
Installing Spark Cluster
In order to install it you just need to:
1. Install gcutil and authenticate your project.
2. Open a terminal and get the git repository with the python script in it.
$ git clone https://github.com/sigmoidanalytics/spark_gce.git
$ cd spark_gce
$ python spark_gce.py
You will need to create a new project in the Google Developer Console before you running ‘spark_gce.py’ and make sure to add all the params.
Here is an example:
spark_gce.py project-name slaves slave-type master-type identity-file zone cluster-name
- project-name: One of the hardest thing in software… Choosing good names. Here we wish a good name for our cluster.
- slave: how many machines we will have in the cluster as slaves.
- slave-type: Instance type. For example: n1-standard-1
- master-type: Instance type for the master. Choose something powerful (e.g. n1-standard-1 and above).
- identity-file: Identity file to authenticate. It will be around: ~/.ssh/google_compute_engine once you authenticate using gcutils.
- zone: Specify the zone where you are going to launch the cluster. For example: us-central1-a
- cluster-name: Name the Spark cluster.
That’s it.
Misc
Discover more from Ido Green
Subscribe to get the latest posts sent to your email.
Cool. Now that looks so simple 🙂
Easy to setup and install… But in the end of the day, you do need to invest time to ask the right questions in order to gain value from the answers.
Just curious to know, did you faced any problems while/after setting the cluster up?
No.
But I didn’t tried to push it on a big cluster (yet).
Ok, Cool.
I tried to install following the instruction but I am facing some errors.The errors something like
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
ERROR: (gcloud.compute.instances.create) Some requests did not succeed:
– Invalid value ‘Test’. Values must match the following regular expression: ‘(?:(?:[-a-z0-9]{1,63}.)*(?:a-z?):)?(?:[0-9]{1,19
}|(?:a-z?))’
Could you help me to resolve this.
I am passing the following
python spark_gce.py ‘Test’ ‘3’ ‘n1-standard-4’ ‘n1-standard-4’ ‘/home/raghuram_mundru_meredith_com/.ss
h/google_compute_engine’ ‘us-central1-c’ ‘spark-test’