Over the last few weeks I have done a lot of work with Apache Spark. You might be asking why I – as a relational database expert – concentrates on a Big Data technology? The answer is quite simple: beginning with SQL Server 2019, you have a complete Apache Spark integration in SQL Server – with a feature or technology called SQL Server Big Data Clusters. Therefore, it is for me mandatory to get an idea how Apache Spark works, and how you can troubleshoot performance problems in this technology stack :-).
To be honest, Apache Spark is not that different from a relational database engine like SQL Server: there are Execution Plans, and there is even a Query Optimizer. One of the biggest differences is that a query against Apache Spark is distributed across multiple worker nodes in an Apache Hadoop Cluster. Therefore, you have a true parallelism across multiple physical machines. This is a huge difference to a traditional relational database engine.
And you can code against Apache Spark in different programming languages like R, Scale, Python, and – SQL!
Apache Spark & Docker
If you want to get familiar with Apache Spark, you need to have an installation of Apache Spark. There are different approaches: you can deploy a whole SQL Server Big Data Cluster within minutes in Microsoft Azure Kubernetes Services (AKS). My problem with that approach is two-folded: first of all, I have to pay regularly for the deployed SQL Server Big Data Cluster in AKS just to get familiar with Apache Spark.
Another problem with Big Data Clusters in AKS is the fact that a shutdown of worker nodes can damage your whole deployment, because AKS has still a problem with reattaching Persistent Volumes to different worker nodes during the startup. Therefore, you have to leave your SQL Server Big Data Cluster always up and running, which accumulates a lot of money over the time.
Therefore, I did some research if it is possible to run Apache Spark in a Docker environment. I have already a “big” Apple iMac with 40 GB RAM and a powerful processor, which acts as my local Docker host in my Home Lab. Running Apache Spark in Docker is possible (otherwise I wouldn’t write this blog posting), but I had a few requirements for this approach:
- The Worker Nodes of Apache Spark should be directly deployed to the Apache HDFS Data Nodes. Therefore, an Apache Spark worker can access its own HDFS data partitions, which provides the benefit of Data Locality for Apache Spark queries.
- I want to scale the Apache Spark Worker and HDFS Data Nodes in an easy way up and down.
- The whole Apache Spark environment should be deployed as easy as possible with Docker.
These are not quite difficult requirements, but it was not that easy to achieve them. One of the biggest problems is that there are almost no Docker examples where the Apache Spark Worker Nodes are directly deployed onto the Apache HDFS Data Nodes. Everything that I have found consisted of a separate Apache HDFS Cluster and a separate Apache Spark Cluster. This is not really a good deployment model, because you have no data locality, which hurts the performance of Apache Spark Jobs.
More or less, I have the following 2 blog postings, which demonstrates how to run an Apache HDFS Cluster and an Apache Spark Cluster in Docker – in separate deployments:
Therefore, my idea was to combine both deployments into a single deployment, so that I have finally one (!) combined Apache HDFS/Spark Cluster running in Docker. The only change that was needed, was to deploy and start Apache Spark on the HDFS Name Node and the various HDFS Data Nodes. So, I took the Github repository from the first blog posting and modified it accordingly. The following modified Docker file shows how to accomplish that for the HDFS Name Node:
FROM bde2020/hadoop-base MAINTAINER Ivan Ermilov
HEALTHCHECK CMD curl -f http://localhost:9870/ || exit 1 ENV HDFS_CONF_dfs_namenode_name_dir=file:///hadoop/dfs/name RUN mkdir -p /hadoop/dfs/name VOLUME /hadoop/dfs/name RUN apt-get update RUN apt-get install -y wget RUN apt-get install -y tar RUN apt-get install -y bash RUN apt-get install -y python RUN apt-get install -y python3 RUN wget http://apache.mirror.anlx.net/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz RUN tar -xzf spark-2.4.4-bin-hadoop2.7.tgz RUN mv spark-2.4.4-bin-hadoop2.7 /spark RUN rm spark-2.4.4-bin-hadoop2.7.tgz ADD run.sh /run.sh RUN chmod a+x /run.sh EXPOSE 9870 CMD ["/run.sh"]
As you can see, I’m just downloading Apache Spark 2.4.4 and extracting the downloaded archive to the folder /spark. And finally, I’m starting the Apache Spark Master Node in the run.sh file. The only trick here was to start the HDFS Name Node and the Spark Master Node.
#!/bin/bash namedir=`echo $HDFS_CONF_dfs_namenode_name_dir | perl -pe 's#file://##'` if [ ! -d $namedir ]; then echo "Namenode name directory not found: $namedir" exit 2 fi if [ -z "$CLUSTER_NAME" ]; then echo "Cluster name not specified" exit 2 fi if [ "`ls -A $namedir`" == "" ]; then echo "Formatting namenode name directory: $namedir" $HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR namenode -format $CLUSTER_NAME fi $HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR namenode > /dev/null 2>&1 & /spark/bin/spark-class org.apache.spark.deploy.master.Master --ip namenode --port 7077 --webui-port 8080
And there is the modified Docker file for the HDFS Data Node:
FROM bde2020/hadoop-base MAINTAINER Ivan Ermilov
HEALTHCHECK CMD curl -f http://localhost:9864/ || exit 1 ENV HDFS_CONF_dfs_datanode_data_dir=file:///hadoop/dfs/data RUN mkdir -p /hadoop/dfs/data VOLUME /hadoop/dfs/data ENV HADOOP_PREFIX=/opt/hadoop-$HADOOP_VERSION ENV HADOOP_CONF_DIR=/etc/hadoop ENV MULTIHOMED_NETWORK=1 ENV USER=root ENV PATH $HADOOP_PREFIX/bin/:$PATH RUN apt-get update RUN apt-get install -y wget RUN apt-get install -y tar RUN apt-get install -y bash RUN apt-get install -y python RUN apt-get install -y python3 RUN wget http://apache.mirror.anlx.net/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz RUN tar -xzf spark-2.4.4-bin-hadoop2.7.tgz RUN mv spark-2.4.4-bin-hadoop2.7 /spark RUN rm spark-2.4.4-bin-hadoop2.7.tgz ADD run.sh /run.sh RUN chmod a+x /run.sh EXPOSE 9864 CMD ["/run.sh"]
And in the file run.sh I’m starting again the HDFS Data Node and Spark Worker Node and attaching it to the Spark Master Node.
#!/bin/bash datadir=`echo $HDFS_CONF_dfs_datanode_data_dir | perl -pe 's#file://##'` if [ ! -d $datadir ]; then echo "Datanode data directory not found: $datadir" exit 2 fi $HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR datanode > /dev/null 2>&1 & export SPARK_WORKER_MEMORY=4G /spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8080 spark://namenode:7077
The only additional modification that I did here was to set the environment variable SPARK_WORKER_MEMORY to 4 GB, so that the Worker Node gets 4 GB of RAM. Without that tweak I got Out-of-Memory exceptions during the execution of Spark Jobs. To make this possible I have also assigned 24 GB of RAM to my Docker Environment on the iMac:
With these changes in place, you have a combined Apache HDFS/Spark Cluster in place. The cool thing is now that you can start up this environment with a simple command through docker-compose:
The idea of docker-compose is to manage multiple Docker containers instead of starting them up manually. docker-compose takes its input from the file docker-compose.yml, which you can see here:
It just describes which Docker containers should be started and which ports should be exposed to the Docker host. You can see from the file, I have exposed here multiple additional ports so that I access everything from my iMac. This one limitation of Docker on Mac OS X, that you are not able to access directly a Docker Container through its IP address from the Mac Host…
As soon as everything is up and running, I can access the Apache HDFS Name Node through http://localhost:9870:
The Apache Spark Master Node is accessed through http://localhost:8080:
And if you have a running Apache Spark Job, it can be accessed through http://localhost:4040 – which gives you also access to the Execution Plans:
If I want to have additional Worker Nodes, I just have to change the docker-compose.yml file. As easy as possible.
Running Apache Spark in a Docker environment is not a big deal but running the Spark Worker Nodes on the HDFS Data Nodes is a little bit more sophisticated. But as you have seen in this blog posting, it is possible. And in combination with docker-compose you can deploy and run an Apache Hadoop environment with a simple command line.
Thanks for your time,