Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a broad set of higher-level tools, including Spark SQL for SQL and structured data processing, Pandas API in Spark for panda workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computing and stream processing.
Get Spark from the downloads page on the project website. This documentation is for Spark version 3.4.0. Spark uses the Hadoop client libraries for HDFS and YARN. Downloads are prepackaged for a handful of popular versions of Hadoop. Users can also download a “Hadoop-free” binary and run Spark with any version of Hadoop by augmenting the Spark classpath. Scala and Java users can include Spark in their projects using its Maven coordinates and Python users can install Spark from PyPI.
If you want to build Spark from source, visit Building Spark.
Spark runs on systems similar to Windows and UNIX (e.g. Linux, Mac OS), and must run on any platform running a supported version of Java. This must include JVM in x86_64 and ARM64. It’s easy to run locally on a machine: all you need is to have java installed on your system’s PATH, or the JAVA_HOME environment variable pointing to a Java installation.
Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+. Support for Python 3.7 is deprecated as of Spark 3.4.0. Java 8 support prior to version 8u362 is deprecated as of Spark 3.4.0. When using the Scala API, applications must use the same version of Scala for which Spark was built. For example, when using Scala 2.13, use Spark compiled for 2.13 and compile code/applications for Scala 2.13 as well.
For Java 11, you need to set -Dio.netty.tryReflectionSetAccessible=true for the Apache Arrow library. This prevents java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer. (long, int) error not available when Apache Arrow uses Netty internally.
Spark comes with several sample programs. The Python, Scala, Java, and R examples are located in the examples/src/main directory.
To run Spark interactively in a Python interpreter, use bin/pyspark:
sample applications are provided in Python. For example: To
one of the Scala or Java sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-sending script to launch applications.) For example,
you can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.
The -master option specifies the master URL of a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a complete list of options, run the Spark shell with the -help option.
Since version 1.4, Spark has provided an R API (only DataFrame APIs are included). To run Spark interactively in an R interpreter, use bin/sparkR:
sample applications are also provided in R. For example:
Run Spark client applications
Spark Connect Spark
Connect is a new client-server architecture introduced in Spark 3.4 that decouples client applications from Spark and enables remote connectivity to Spark clusters. The separation between client and server allows Spark and its open ecosystem to be leveraged from anywhere, integrated into any application. In Spark 3.4, Spark Connect provides DataFrame API coverage for PySpark and DataFrame/Dataset API support in Scala.
To learn more about Spark Connect and how to use it, see Spark Connect overview.
The Spark cluster mode overview explains the key concepts for running on a cluster. Spark can run on its own or on top of several existing cluster administrators. It currently provides several options for
- Standalone deployment mode: Easiest way to deploy Spark on a private cluster
- Apache Mesos (deprecated)
- Hadoop YARN
- Kubernetes Programming Guides
- A quick introduction to the Spark API; start here!
- core but old API), accumulators,
- Spark SQL, Datasets
- with relationship queries (using Datasets and DataFrames, Newer API than
- Processing data streams with DStreams (old API)
- MLlib: Application of machine learning algorithms
- GraphX: Graph processing SparkR
- CLI: Data processing with SQL on the command line
RDD Programming Guide: Overview of Spark basics: RDD (
and broadcast variables
, and DataFrames: Structured data processing with relational queries (newer API than RDD) Structured streaming: processing structured data streams
DStreams) Spark Streaming:
: Data processing with Spark on R PySpark: Data processing with
- Scala API (Scaladoc
- R API (Roxygen2
- Spark SQL Built-in Functions (MkDocs)
) Spark Java API (
) Spark Python API (
- Overview: Overview of Concepts and Components When Running in a Cluster
- Application submission: Packaging and
- scripts that allow you to launch a cluster on EC2 in approximately 5 minutes
- standalone cluster quickly without a third-party cluster administrator
- Mesos: Deploy a private cluster with Apache Mesos
- Hadoop NextGen (YARN)
- on top of Kubernetes
Standalone deployment mode: Start a
YARN: Deploy Spark on top of
Kubernetes: Deploy Spark
deploying applications Deployment modes: Amazon EC2:
- Configuration: Customize Spark through its configuration system
- Monitoring: Track the behavior of your
- Tuning Guide: Best Practices for Optimizing Performance and Memory
- scheduling resources between and within
- Spark Security Support
- Hardware provisioning: Cluster hardware recommendations
- Integration with other storage systems:
- Cloud infrastructures
- OpenStack Quick
Migration Guide: Migration
- with the Maven System
- Contributing to
- Related third-party
Usage Job scheduling:
Spark applications Security:
Guides for Spark
Creating Spark: Building Spark
third-party Spark projects:
Spark projects External resources: Spark home page
- Spark community resources, including local
- StackOverflow tag apache-spark
- Mailing lists: Ask questions
- AMP Camps: A series of boot camps at UC Berkeley that featured talks and exercises on Spark, Spark Streaming, Meos, and more. Videos, are available online for free.
- Code samples: more are also available in the Spark samples subfolder (Scala, Java, Python, R)
about Spark here