In this tutorial, we’ll explain how to Install Apache Spark on Ubuntu. Apache Spark is an open-source unified analytics engine for large-scale data processing.
Steps to Install Apache Spark on Ubuntu 20.10/20.04/18.04
Step 1: Update Ubuntu system.
It is recommended to update the system before installation of Apache Spark.
sudo apt update
Step 2: Install Java on Ubuntu 20.10
Java package is a prerequisite to use Apache Spark. As of now we’ll install default Java on Ubuntu.
sudo apt install curl mlocate default-jdk -y
Step 3: Verify Java version
$ java -version
root@Apache-Spark:~# java -version openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.10) OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.10, mixed mode, sharing) root@Apache-Spark:~#
Step 4: Download Apache Spark on Ubuntu 20.10
Check out for the latest Apache Spark version.
curl -O https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
Extract Spark tarball
tar xvf spark-3.1.1-bin-hadoop3.2.tgz
Move spark directory to /opt
sudo mv spark-3.1.1-bin-hadoop3.2/ /opt/spark
Configure Spark environment
vim ~/.bashrc
Add below line:
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Reflect or activate ~/.bashrc
source ~/.bashrc
Step 5: Start a standalone master server
$ start-master.sh
Sample Output:
root@Apache-Spark:~# start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-Apache-Spark.out root@Apache-Spark:~#
Step 6: Verify the TCP port
$ sudo ss -tunelp | grep 8080
Sample Output:
root@Apache-Spark:~# sudo ss -tunelp | grep 8080 tcp LISTEN 0 1 *:8080 *:* users:(("java",pid=11444,fd=308)) ino:946073 sk:17 v6only:0 <->
Access Web UI
My Spark URL http://localhost:8080/ OR http://127.0.0.1:8080
Step 7: Start Spark worker
$ start-workers.sh spark://localhost:7077
Sample Output:
ubuntu@php:/opt$ start-workers.sh spark://localhost:7077 ubuntu@localhost's password: localhost: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-ubuntu-org.apache.spark.deploy.worker.Worker-1-Apache-Spark.out ubuntu@php:/opt$
If you are not getting start-slave.sh file using locate command
$ sudo updatedb $ locate start-slave.sh
Once worker get started, then go back to the browser and access spark UI.
How to access Spark shell
$ /opt/spark/bin/spark-shell
How to access python spark shell
$ /opt/spark/bin/pyspark
How to shutdown master and slave Spark processes
$ SPARK_HOME/sbin/stop-slave.sh $ SPARK_HOME/sbin/stop-master.sh
End of the article, we’ve seen how to Install Apache Spark on Ubuntu.