Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a broad set of higher-level tools, including Spark SQL for SQL and structured data processing, Pandas API in Spark for panda workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computing and stream processing.

Get Spark from the downloads page on the project website. This documentation is for Spark version 3.4.0. Spark uses the Hadoop client libraries for HDFS and YARN. Downloads are prepackaged for a handful of popular versions of Hadoop. Users can also download a “Hadoop-free” binary and run Spark with any version of Hadoop by augmenting the Spark classpath. Scala and Java users can include Spark in their projects using its Maven coordinates and Python users can install Spark from PyPI.

If you want to build Spark from source, visit Building Spark.

Spark runs on systems similar to Windows and UNIX (e.g. Linux, Mac OS), and must run on any platform running a supported version of Java. This must include JVM in x86_64 and ARM64. It’s easy to run locally on a machine: all you need is to have java installed on your system’s PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+. Support for Python 3.7 is deprecated as of Spark 3.4.0. Java 8 support prior to version 8u362 is deprecated as of Spark 3.4.0. When using the Scala API, applications must use the same version of Scala for which Spark was built. For example, when using Scala 2.13, use Spark compiled for 2.13 and compile code/applications for Scala 2.13 as well.

For Java 11, you need to set -Dio.netty.tryReflectionSetAccessible=true for the Apache Arrow library. This prevents java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer. (long, int) error not available when Apache Arrow uses Netty internally.

Spark comes with several sample programs. The Python, Scala, Java, and R examples are located in the examples/src/main directory.

To run Spark interactively in a Python interpreter, use bin/pyspark:

sample applications are provided in Python. For example: To

run

one of the Scala or Java sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-sending script to launch applications.) For example,

you can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

The -master option specifies the master URL of a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a complete list of options, run the Spark shell with the -help option.

Since version 1.4, Spark has provided an R API (only DataFrame APIs are included). To run Spark interactively in an R interpreter, use bin/sparkR:

sample applications are also provided in R. For example:

Run Spark client applications

anywhere with

Spark Connect Spark

Connect is a new client-server architecture introduced in Spark 3.4 that decouples client applications from Spark and enables remote connectivity to Spark clusters. The separation between client and server allows Spark and its open ecosystem to be leveraged from anywhere, integrated into any application. In Spark 3.4, Spark Connect provides DataFrame API coverage for PySpark and DataFrame/Dataset API support in Scala.

To learn more about Spark Connect and how to use it, see Spark Connect overview.

The Spark cluster mode overview explains the key concepts for running on a cluster. Spark can run on its own or on top of several existing cluster administrators. It currently provides several options for

deployment:

Standalone deployment mode: Easiest way to deploy Spark on a private cluster
Apache Mesos (deprecated)
Hadoop YARN
Kubernetes Programming Guides

: Quickstart:

A quick introduction to the Spark API; start here!

RDD Programming Guide: Overview of Spark basics: RDD (

core but old API), accumulators,

and broadcast variables

Spark SQL, Datasets

, and DataFrames: Structured data processing with relational queries (newer API than RDD) Structured streaming: processing structured data streams

with relationship queries (using Datasets and DataFrames, Newer API than

DStreams) Spark Streaming:

Processing data streams with DStreams (old API)
MLlib: Application of machine learning algorithms
GraphX: Graph processing SparkR

: Data processing with Spark on R PySpark: Data processing with

Spark

on Python

Spark

SQL

CLI: Data processing with SQL on the command line

API Documents

: Spark

Scala API (Scaladoc

) Spark Java API (

Javadoc

) Spark Python API (

Sphinx

) Spark

R API (Roxygen2

)

Spark SQL Built-in Functions (MkDocs)

Deployment

Guides

: Cluster

Overview: Overview of Concepts and Components When Running in a Cluster
Application submission: Packaging and

deploying applications Deployment modes: Amazon EC2:

- scripts that allow you to launch a cluster on EC2 in approximately 5 minutes
- standalone cluster quickly without a third-party cluster administrator
- Mesos: Deploy a private cluster with Apache Mesos
- Hadoop NextGen (YARN)
- on top of Kubernetes

Other documents

Configuration: Customize Spark through its configuration system
Monitoring: Track the behavior of your

applications

Tuning Guide: Best Practices for Optimizing Performance and Memory

Usage Job scheduling:

scheduling resources between and within

Spark applications Security:

Spark Security Support
Hardware provisioning: Cluster hardware recommendations
Integration with other storage systems:
- Cloud infrastructures
- OpenStack Quick

Guides for Spark

Components

Creating Spark: Building Spark

with the Maven System
Contributing to

third-party Spark projects:

Related third-party

Spark projects External resources: Spark home page

Spark community resources, including local

meetings

StackOverflow tag apache-spark
Mailing lists: Ask questions

about Spark here

AMP Camps: A series of boot camps at UC Berkeley that featured talks and exercises on Spark, Spark Streaming, Meos, and more. Videos, are available online for free.
Code samples: more are also available in the Spark samples subfolder (Scala, Java, Python, R)

Blogs

Apache Spark™ – Unified Engine for large-scale data analytics

anywhere with

anywhere with

Contact US