When learning Spark and testing with small datasets, I can simply run a local Spark instance with the following command which essentially creates a local Spark instance using all cores. This local instance has no worker and the driver will handle all jobs and tasks.
spark = SparkSession.builder.getOrCreate()
However, it is more interesting and useful to run a local Spark cluster with a couple of workers such that the test run script can easily be reused in a production Spark cluster with minimal or no code change at all.
The initial idea is to run a little Spark cluster on a single machine with VirtualBox. My machine is a Mac Mini so I plan to run the Spark driver on the Mac machine itself and 2 Ubuntu virtual machines as workers.
Downloading and installation of VirtualBox is a simple task and not worth mentioning. However, a few notes on creating and setting up the virtual machine in VirtualBox.
First, the network adapter is better to be a ‘Bridged Adapter’, as I want worker virtual machines can communicate with the host driver as peers like different machines in a local network. Instead of the default NAT adapter setting where the virtual machine is sitting behind the host.
Second, don’t over-allocate memory to virtual machines. The total memory assigned should be less than the physical memory of the host. Spark jobs are memory-hungry and will easily use up all allocated memory. I tried over-allocating memory to worker virtual machines and the whole system crashed when running the Spark application to process a few gigabytes of data.
Third, to save some configuration efforts and copy data around, I put the Apache Spark and data folder on an external HD connected to the host, and then share it with virtual machines. So I don’t need to copy and configure Apache Spark per virtual machine and just doing that on the host will be done for the whole cluster. To achieve that, installing VirtualBox guest addition is required as the VirtualBox shared folder has its file system type vboxsf for Linux.
Finally, running the Spark cluster is rather simple, just create a spark-env.sh from the template in the <spark>/conf folder, uncomment the line for SPARK_MASTER_HOST and give the IP address of the Mac host running the Spark driver. I kept the default for all other settings.
And it is just the start of the problem.
To test the setup, I had a simple Python script to run over 7 days of Adobe Analytics Data Feed data with a total size of 21GB, do a simple daily hits count and unique visitors count, and then put 7 days together for an overall counting for hits and unique visitors again.
While it took me about half an hour for the local Spark instance to finish, it took almost 2 hours to finish the same with the 2 virtual machines cluster setup. This made the cluster setup very unhelpful with such unacceptable performance.
To figure out what is a better configuration for the spark cluster without going through too much engineering optimisation, I came up with 14 different configurations by adjusting: worker mode (real vs virtual), number of workers (1 to 4), number of cores per worker, and data access methods (hdfs, local drive, shared folder).
Config | Worker Mode | Number of Worker | Data Access |
config-01 | local | [*] | local drive |
config-02 | local | 1 | local drive |
config-03 | VM | 2 | shared folder |
config-04 | VM | 2 | local drive |
config-05 | real | 2 | local drive |
config-06 | local | (4 cores) | local drive |
config-07 | real | (4 cores) | local drive |
config-08 | local | [*] | hdfs |
config-09 | local | 1 | hdfs |
config-10 | VM | 2 | hdfs |
config-11 | real | 2 | hdfs |
config-12 | local | (4 cores) | hdfs |
config-13 | real | (4 cores) | hdfs |
config-14 | real | 4 | hdfs |
When running local workers, it is the Mac Mini as the Spark master and also the worker. When running 2 real workers, it is the Mac Mini as the Spark master and 1 of the workers with another MacBook Pro for another worker. When running 4 real works, there are one more Mac Mini and a Windows notebook as the two additional workers.
As it is a mixed environment of Mac and Windows, I don’t have the configuration for the 4 real workers with the local drive where the local path cannot be aligned. Moreover, the shared folder is only valid for VM workers as well.
For HDFS, it is running on the primary Mac Mini with the spark master.
For each of the above configurations, run them 5 times to get the average performance.
For the first part reading 7 days of Adobe Analytics Data Feed data and counting, the average time for most configurations is less than 2 minutes. VM workers were not performing well as previously noted and also the performance was not consistent with some outliners on the slow side.
For the second part combining 7 days of data and the overall counting, 2 real workers with local drive data outperformed all other configurations and the VM workers with shared folders were terrible.
With both parts added up, the configuration with 2 real workers and a local drive is the best. It is also noted that the HDFS vs local drive time for a single local worker is small and consistent. Moreover, looking at the HDFS option alone and comparing 1 vs 2 vs 4 real workers, performance dropped from 1 to 2 real workers is very likely because of the network latency where a single Spark worker and HDFS are on the same machine. The time increment is moderate but still acceptable.
In the end, my final Spark lab configuration is running the Spark master/worker and HDFS on the primary Mac Mini, with the option for adding more real workers by simply turning on Spark worker on an additional machine and connecting to the Spark master.
As this test didn’t involve much calculation, just reading Adobe Analytics Data Feed and doing some simple counting, I hope that in future exercises with more computation required, the processing power from multiple machines will overcome the penalty from network latency.
Leave a Reply