Apache Superset: Data Visualization & Exploration at Scale

Nuzhi Meyen
Python in Plain English
3 min readOct 21, 2021

--

Photo by Mohammad Rahmani on Unsplash

Apache Superset is one of the neater data visualization and exploration tools that I have come across.

Positioned as an open-source cloud-native application, Superset was created by Maxime Beauchemin (the creator of Apache Airflow — another open-source Python-based project for workflow management) while he was working at Airbnb.

Apache Superset Logo — Image Courtesy — https://github.com/apache/superset

In terms of specifications, the current backend for Superset is developed on Python 3.7+ and Postgres/MySQL is the suggested database of choice for managing databases. Superset supports multiple data sources and supports any database with SQL Alchemy support and relevant connectors as shown below.

Supported Data Sources on Apache Superset — Image Courtesy — https://github.com/apache/superset

Besides these, Superset also makes use of other Python libraries such as Flask, Pandas and Apache Arrow on the backend. In terms of the frontend the core is developed on Node 14+ based on TypeScript, React, Redux, Ant Design and Emotion. For chart plugins, JavaScript and TypeScript are used.

In addition to this, even though they are optional, there are modules such as Redis for caching (which is highly recommended for serving petabyte-scale data) along with Celery (for scheduling and asynchronous functionality) as well as Selenium ( with web driver support for Chrome and Firefox browsers for thumbnails, reports and alerts.)

In terms of the functionality that Superset provides the major positive features are as follows:

  • An easy to use and intuitive interface for visualizing and creating dashboards.
  • A variety of visualizations to showcase data based on Apache ECharts integration.
  • Code-free visualization builder to extract and present datasets.
  • A built-in SQL IDE for preparing data for visualization, and browsing metadata.
  • A lightweight semantic layer for defining custom dimensions and metrics.
  • Out-of-the-box support for most SQL-speaking databases (that has a Python DB-API driver and a SQL Alchemy dialect).
  • Seamless, in-memory asynchronous caching and queries (with Redis integration).
  • An extensible security model that allows configuration rules on who can access which product features and datasets. (eg. role-based authentication)
  • Integration with major authentication backends (database, OpenID, LDAP, OAuth, REMOTE_USER, etc.)
  • The ability to add custom visualization plugins based on Apache ECharts.
  • An API for programmatic customization.

While Apache Superset does have a lot of pros going for it, there are some limitations such as not being able to join data from different databases or sources to create a single report as of yet.

However, in terms of an open-source solution for business intelligence reporting, Apache Superset is most probably one of the best tools available, in my opinion, to handle data at scale.

And with that, we end our topic. Thank you for reading.

More content at plainenglish.io

--

--

Co-founder of Helios P2P. Sri Lankan. Interested in Finance, Advanced Analytics, BI, Data Visualization, Computer Science, Statistics, and Design Thinking.