Running Apache Spark standalone cluster on Docker

For those who are familiar with Docker technology, it can be one of the simplest way of running Spark standalone cluster.

Here is the Dockerfile which can be used to build image (docker build .) with Spark 2.0 and Oracle's server JDK 1.8 on Ubuntu OS:

FROM ubuntu:14.04

RUN apt-get -y update
RUN apt-get -y install curl

# JAVA
ARG JAVA_ARCHIVE=http://download.oracle.com/otn-pub/java/jdk/8u102-b14/server-jre-8u102-linux-x64.tar.gz
ENV JAVA_HOME /usr/local/jdk1.8.0_102

ENV PATH $PATH:$JAVA_HOME/bin
RUN curl -s --insecure \
  --header "Cookie: oraclelicense=accept-securebackup-cookie;" ${JAVA_ARCHIVE} \
  | tar -xz -C /usr/local/ && ln -s $JAVA_HOME /usr/local/java 

# SPARK
ARG SPARK_ARCHIVE=http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz
ENV SPARK_HOME /usr/local/spark-2.0.0-bin-hadoop2.7

ENV PATH $PATH:${SPARK_HOME}/bin
RUN curl -s ${SPARK_ARCHIVE} | tar -xz -C /usr/local/

WORKDIR $SPARK_HOME

Having this, docker compose to run multiple containers at the same time can be used. One master and one worker makes cluster ready, but compose file can be extended, and other workers can be added. You can find this file on gettyimages or after some modifications here:

spark-master:
    image: spark-2
    command: bin/spark-class org.apache.spark.deploy.master.Master -h spark-master
    hostname: spark-master
    environment:
      MASTER: spark://spark-master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: 127.0.0.1
    expose:
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7006
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
    volumes:
      - ./conf/spark-master:/conf
      - ./data:/tmp/data  

spark-worker-1:
    image: spark-2
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    hostname: spark-worker-1
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: 127.0.0.1
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
    links:
      - spark-master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 7016
      - 8881
    ports:
      - 8081:8081
    volumes:
      - ./conf/spark-worker-1:/conf
      - ./data:/tmp/data

SPARK_PUBLIC_DNS variable is set to localhost, but this is going to work only on Linux. Mac and Windows users should replace it e.g. with an IP address of virtual machine which is hosting docker. Conf folder contains subfolders with spark-defaults.conf files, and its content is mounted to containers /conf directory.

To start Apache Spark Standalone cluster:

docker-compose up

command should be executed from folder in which docker-compose.yml file is located.

results matching ""

    No results matching ""