Chat Zalo Chat Messenger Phone Number Đăng nhập
Apache Spark™ - Unified Engine for large-scale data analytics

Apache Spark™ – Unified Engine for large-scale data analytics

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a broad set of higher-level tools, including Spark SQL for SQL and structured data processing, Pandas API in Spark for panda workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computing and stream processing.

Get Spark from the downloads page on the project website. This documentation is for Spark version 3.4.0. Spark uses the Hadoop client libraries for HDFS and YARN. Downloads are prepackaged for a handful of popular versions of Hadoop. Users can also download a “Hadoop-free” binary and run Spark with any version of Hadoop by augmenting the Spark classpath. Scala and Java users can include Spark in their projects using its Maven coordinates and Python users can install Spark from PyPI.

If you want to build Spark from source, visit Building Spark.

Spark runs on systems similar to Windows and UNIX (e.g. Linux, Mac OS), and must run on any platform running a supported version of Java. This must include JVM in x86_64 and ARM64. It’s easy to run locally on a machine: all you need is to have java installed on your system’s PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+. Support for Python 3.7 is deprecated as of Spark 3.4.0. Java 8 support prior to version 8u362 is deprecated as of Spark 3.4.0. When using the Scala API, applications must use the same version of Scala for which Spark was built. For example, when using Scala 2.13, use Spark compiled for 2.13 and compile code/applications for Scala 2.13 as well.

For Java 11, you need to set -Dio.netty.tryReflectionSetAccessible=true for the Apache Arrow library. This prevents java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer. (long, int) error not available when Apache Arrow uses Netty internally.

Spark comes with several sample programs. The Python, Scala, Java, and R examples are located in the examples/src/main directory.

To run Spark interactively in a Python interpreter, use bin/pyspark:

sample applications are provided in Python. For example: To

run

one of the Scala or Java sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-sending script to launch applications.) For example,

you can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

The -master option specifies the master URL of a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a complete list of options, run the Spark shell with the -help option.

Since version 1.4, Spark has provided an R API (only DataFrame APIs are included). To run Spark interactively in an R interpreter, use bin/sparkR:

sample applications are also provided in R. For example:

Run Spark client applications

anywhere with

Spark Connect Spark

Connect is a new client-server architecture introduced in Spark 3.4 that decouples client applications from Spark and enables remote connectivity to Spark clusters. The separation between client and server allows Spark and its open ecosystem to be leveraged from anywhere, integrated into any application. In Spark 3.4, Spark Connect provides DataFrame API coverage for PySpark and DataFrame/Dataset API support in Scala.

To learn more about Spark Connect and how to use it, see Spark Connect overview.

The Spark cluster mode overview explains the key concepts for running on a cluster. Spark can run on its own or on top of several existing cluster administrators. It currently provides several options for

deployment:

  • Standalone deployment mode: Easiest way to deploy Spark on a private cluster
  • Apache Mesos (deprecated)
  • Hadoop YARN
  • Kubernetes Programming Guides

: Quickstart:

  • A quick introduction to the Spark API; start here!
  • RDD Programming Guide: Overview of Spark basics: RDD (

  • core but old API), accumulators,
  • and broadcast variables

  • Spark SQL, Datasets
  • , and DataFrames: Structured data processing with relational queries (newer API than RDD) Structured streaming: processing structured data streams

  • with relationship queries (using Datasets and DataFrames, Newer API than
  • DStreams) Spark Streaming:

  • Processing data streams with DStreams (old API)
  • MLlib: Application of machine learning algorithms
  • GraphX: Graph processing SparkR
  • : Data processing with Spark on R PySpark: Data processing with

  • Spark
  • on Python

  • Spark
  • SQL

  • CLI: Data processing with SQL on the command line

API Documents

: Spark

  • Scala API (Scaladoc
  • ) Spark Java API (

  • Javadoc
  • ) Spark Python API (

  • Sphinx
  • ) Spark

  • R API (Roxygen2
  • )

  • Spark SQL Built-in Functions (MkDocs)

Deployment

Guides

: Cluster

  • Overview: Overview of Concepts and Components When Running in a Cluster
  • Application submission: Packaging and
  • deploying applications Deployment modes: Amazon EC2:

    • scripts that allow you to launch a cluster on EC2 in approximately 5 minutes
    • Standalone deployment mode: Start a

    • standalone cluster quickly without a third-party cluster administrator
    • Mesos: Deploy a private cluster with Apache Mesos
    • YARN: Deploy Spark on top of

    • Hadoop NextGen (YARN)
    • Kubernetes: Deploy Spark

    • on top of Kubernetes

Other documents

:

  • Configuration: Customize Spark through its configuration system
  • Monitoring: Track the behavior of your
  • applications

  • Tuning Guide: Best Practices for Optimizing Performance and Memory
  • Usage Job scheduling:

  • scheduling resources between and within
  • Spark applications Security:

  • Spark Security Support
  • Hardware provisioning: Cluster hardware recommendations
  • Integration with other storage systems:
    • Cloud infrastructures
    • OpenStack Quick
    • Migration Guide: Migration

  • Guides for Spark

  • Components
  • Creating Spark: Building Spark

  • with the Maven System
  • Contributing to
  • third-party Spark projects:

  • Related third-party

Spark projects External resources: Spark home page

  • Spark community resources, including local
  • meetings

  • StackOverflow tag apache-spark
  • Mailing lists: Ask questions
  • about Spark here

  • AMP Camps: A series of boot camps at UC Berkeley that featured talks and exercises on Spark, Spark Streaming, Meos, and more. Videos, are available online for free.
  • Code samples: more are also available in the Spark samples subfolder (Scala, Java, Python, R)

Contact US