Member-only story

The Spark Ecosystem

Everton Gomede, PhD

Published in

Python in Plain English

3 min readMar 28, 2023

Spark’s ecosystem is presented in Figure 1–2. It has three main components:

Environments: Spark can run anywhere and integrates well with other environments.
Applications: Spark integrates well with a variety of big data platforms and applications.
Data sources: Spark can read and write data from and to many data sources.

Spark’s expansive ecosystem makes PySpark a great tool for ETL, data analysis, and many other tasks. With PySpark, you can read data from many different data sources (the Linux filesystem, Amazon S3, the Hadoop Distributed File System, relational tables, MongoDB, Elasticsearch, Parquet files, etc.) and represent it as a Spark data abstraction, such as RDDs or DataFrames.

Once your data is in that form, you can use a series of simple and powerful Spark transformations to transform the data into the desired shape and format. For example, you may use the filter() transformation to drop unwanted records, use groupByKey() to group your data by your desired key, and finally use the mapValues() transformation to perform final aggregation (such as finding average, median, and standard deviation of numbers) on the grouped data.

All of these transformations are very possible by using the simple but powerful PySpark API.

Python in Plain English

The Spark Ecosystem

Spark Architecture

Published in Python in Plain English

Written by Everton Gomede, PhD

No responses yet