Invoking SystemML in Spark Batch Mode
- Overview
- Spark Batch Mode Invocation Syntax
- Execution modes
- Recommended Spark Configuration Settings
- Examples
Overview
Given that a primary purpose of SystemML is to perform machine learning on large distributed data sets, one of the most important ways to invoke SystemML is Spark Batch. Here, we will look at this mode in more depth.
NOTE: For a programmatic API to run and interact with SystemML via Scala or Python, please see the Spark MLContext Programming Guide.
Spark Batch Mode Invocation Syntax
SystemML can be invoked in Spark Batch mode using the following syntax:
spark-submit SystemML.jar [-? | -help | -f <filename>] (-config <config_filename>) ([-args | -nvargs] <args-list>)
The DML script to invoke is specified after the -f
argument. Configuration settings can be passed to SystemML
using the optional -config
argument. DML scripts can optionally take named arguments (-nvargs
) or positional
arguments (-args
). Named arguments are preferred over positional arguments. Positional arguments are considered
to be deprecated. All the primary algorithm scripts included with SystemML use named arguments.
Example #1: DML Invocation with Named Arguments
spark-submit SystemML.jar -f scripts/algorithms/Kmeans.dml -nvargs X=X.mtx k=5
Example #2: DML Invocation with Positional Arguments
spark-submit SystemML.jar -f src/test/scripts/applications/linear_regression/LinearRegression.dml -args "v" "y" 0.00000001 "w"
Execution modes
SystemML works seamlessly with all Spark execution modes, including local (--master local[*]
),
yarn client (--master yarn --deploy-mode client
), yarn cluster (--master yarn --deploy-mode cluster
), etc. More
information on Spark cluster execution modes can be found on the
official Spark cluster deployment documentation.
Note that Spark can be easily run on a laptop in local mode using the --master local[*]
described
above, which SystemML supports.
Recommended Spark Configuration Settings
For best performance, we recommend setting the following configuration value when running SystemML with Spark:
--conf spark.driver.maxResultSize=0
.
Examples
Please see the MNIST examples in the included SystemML-NN library for examples of Spark Batch mode execution with SystemML to train MNIST classifiers: