download.sh

This module is used to generate download spark and its dependencies. It comes with xml2er, json2er or merge2er modules.

This module is unnecessary if the environment already has spark installed, like:

AWS EMR
MapR distributions
HortonWorks distributions
Cloudera distributions
Custom installations
Docker-based containers and Kubernetes

Parameters

To get all the supported download.sh parameters, the following command can be used:

$ /usr/share/flexter/sbin/download.sh -h

or 

$ ~/.local/share/flexter/sbin/download.sh -h

All the compatible parameters that can be used are listed:

Usage: /usr/share/flexter/sbin/download.sh [OPTIONS] | <spark|hadoop|dep> [OPTIONS]

  -h                      print this message.

  -f                      Force switch/installation.

  -r                      Force re-installation.

  -c                      CI/CD mode, preventing progress messages.

  -v <SPARK_VERSION>      Spark version
                          default: 3.1.2

  -b [hadoop]<SPARK_BIN>  Spark binary package option.
                          Ex: hadoop3.2, 3.2, hadoop3, 3 or without-hadoop
                          It also accepts to choose an specific hadoop version, which is downloaded
                          automatically with spark's without-hadoop package.
                          Ex: hadoop3.4.4, 3.4.4, hadoop3.2.0 or 3.2.0
                          default: hadoop3.2

  -H <HADOOP_VERSION>     Hadoop binary package option.
                          default: 3.2.0

  -w                      Spark binary package without-hadoop.

  -V                      Download hive dependencies for spark without-hadoop.

  -a                      Download AWS dependencies (s3).

  -z                      Download Azure dependencies (blob storage/datalake).

  -s                      Download Snowflake dependencies (spark/jdbc).

  -S <VERSION>            Spark Snowflake dependency version.

  -g                      Download Google Cloud dependencies (storage/bigquery)

  -G <VERSION>            Google Cloud Storage dependency version.

  -B <VERSION>            Google Cloud Big Query dependency version.

  -p <GROUP:NAME:VERSION> Download custom package, comma separated. Same as "spark-submit --packages".

  -R <REPOSITORY_URL>     Extra repositories for custom packages, comma separated.

  SUB-COMMANDS

  spark [OPTIONS]         Download only spark package.

  hadoop [OPTIONS]        Download only hadoop package.

  dep [OPTIONS]           Download only spark dependencies.

Examples

Ways to enforce an installation

Installing the default spark version, it there is no other version already installed. If there is an older version, or an external installation, it won’t do anything.

$ /usr/share/flexter/sbin/download.sh

Forcing installing the default spark version, even if there is another version installed. If the version is already installed, it will switch the /usr/share/flexter/spark/default link to this version.

$ /usr/share/flexter/sbin/download.sh -f

Forcing reinstalling a new version, downloading the packages again, even if the same or other version is installed.

$ /usr/share/flexter/sbin/download.sh -r

Choosing a particular spark version

In some cases, the spark is released with some patches, or the server environment demands a particular spark version for other reasons.

In this case you can do as below:

$ /usr/share/flexter/sbin/download.sh -v 3.1.2

Choosing a particular spark binary packages

Spark packages in each version come in different builds. Like the version 3.1.x has

hadoop3.2: Embedded hadoop 3.2.x dependencies, the default.
hadoop2.7: Embedded hadoop 2.7.x dependencies.
without-hadoop: The spark will come without hadoop and hive at all. And it will download the default hadoop version and hive dependencies.
hadoop<MAJOR>.<MINOR>.<PATCH>: The spark will come without hadoop and hive at all. And it will download a particular hadoop version and hive dependencies.
<MAJOR>.<MINOR>.<PATCH>: Same as above.

In this case you can do as below:

$ /usr/share/flexter/sbin/download.sh -b hadoop3.2

If you choose the option without-hadoop, it will trigger a separated hadoop package downloading and hive dependencies.

Below tou can see the default approach:

$ /usr/share/flexter/sbin/download.sh -b without-hadoop

or

$ /usr/share/flexter/sbin/download.sh -w

You can also define a particular hadoop version to download together with spark (without-hadoop package) and hive dependencies

$ /usr/share/flexter/sbin/download.sh -b hadoop3.2.0 

or

$ /usr/share/flexter/sbin/download.sh -b 3.2.0

or

$ /usr/share/flexter/sbin/download.sh -w -H 3.2.0

Downloading dependencies

Some dependencies already come with hadoop, like aws and azure ones. But for spark with hadoop embedded packages, these aren’t available.

In the case like the without-hadoop spark package, there are no hive dependencies.

The download.sh is also capable to install other external dependencies that don’t come with spark and hadoop.

hive dependencies

It will be downloaded automatically if you had chosen to install the spark packages without hadoop, however, you can download it again using:

$ /usr/share/flexter/sbin/download.sh -V

hadoop-aws dependencies

They come with a hadoop package when it is downloaded separated from spark, but if you wish to have it without a separated hadoop, you can do as below:

$ /usr/share/flexter/sbin/download.sh -a

hadoop-azure dependencies

They come with a hadoop package when it is downloaded separated from spark, but if you wish to have it without a separated hadoop, you can do as below:

$ /usr/share/flexter/sbin/download.sh -z

GCLoud dependencies

The latest Google Cloud Storage and BigQuery dependencies can be downloaded with the command below:

$ /usr/share/flexter/sbin/download.sh -g

However, if a particular Google Cloud Storage version is required, it can be defined as below:

$ /usr/share/flexter/sbin/download.sh -G <GCS_CONNECTOR_HADOOP_VERSION>

However, if a particular Google Cloud BigQuery version is required, it can be defined as below:

$ /usr/share/flexter/sbin/download.sh -B <SPARK_BIGQUERY_VERSION>

Snowflake dependencies

The latest versions of Snowflake spark and jdbc connectors can be downloaded with the command below:

$ /usr/share/flexter/sbin/download.sh -s

However, if a particular version is required, a particular spark snowflake connector version can be informed:

$ /usr/share/flexter/sbin/download.sh -S <SPARK-SNOWFLAKE_VERSION>

* The right jdbc driver which comes with the spark connector will be downloaded as well.

Custom dependencies

The same spark-submit --packages can be downloaded and installed permanently with the command below:

$ /usr/share/flexter/sbin/download.sh -p <GROUP:NAME:VERSION>