Downloading Overture Map Foundation’s buildings data using Apache Sedona with Docker — Python and stored them in PostGIS

Pavlos Demetriades
Python in Plain English
7 min readJan 2, 2024

--

The Overture Maps Foundation is a project by the Linux Foundation that aims to create reliable, easy-to-use, and interoperable open map data 1. It is a collaborative effort by Amazon Web Services (AWS), Meta (i.e. Facebook), Microsoft, and TomTom to build an ecosystem of interoperable open map data. The foundation is intended for developers who build map services or use geospatial data. Overture aims to incorporate map data from multiple sources including Overture Members, civic organizations, and open data sources.The foundation will simplify interoperability by providing a system that links entities from different data sets to the same real-world entities. Overture Maps data will undergo validation checks to detect map errors, breakage, and vandalism to help ensure that map data can be used in production systems. Overture will define and drive adoption of a common, well-structured, and documented data schema to create an easy-to-use ecosystem of map data.

Apache Sedona is a cluster computing system that processes large-scale spatial data. It is an open-source project of the Apache Software Foundation that extends existing cluster computing systems, such as Apache Spark and Apache Flink, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines. It supports various geospatial data formats, tools, and languages, and has many use cases and applications in various domains. The system offers Scala, Java, Python, and R APIs and integrates them into the underlying system kernels for spatial analytics and data mining applications.

The use case.

In this article we are going to download all the buildings of Cyprus (we hope so that all exist 😎) from the Overture Maps Foundation dataset — buildings and stored them in PostGIS using docker — python.

For more about the conversion of parquet to geoparquet format, read this article from Jia Yu.

Steps

  1. Docker Hub — Pulling the Apache Sedona image — Running a container — Configuring Ports
  2. Jupyter Lab in Docker — Converting buildings data to GeoDataFrame
  3. Downloading the buildings data locally — Geojson format
  4. Stored the building data in our PostGIS Geodatabase
  5. Using QGIS to visualize Cyprus Buildings which are stored in PostGIS.

1. Docker hub— Pulling the Apache Sedona image — Running a container — Configuring Ports

First of all, we must download Docker hub. After the download we search for apache/sedona. Then we press Run.

After that, we choose the Images tab and the apache/sedona Image must show up.

Then we press the Run button (under the Actions Group). We choose optional Settings and then we configure the ports as shown in image, below, after we run the container ▶️.

After that in the container tab, a container of Apache Sedona must show up (For me the name is thirsty_leavitt, for you must be Apache Sedona — 1).

👉 Ctrl + Right click http://127.0.0.1:8888/lab

After pressing Ctrl + Right click http://127.0.0.1:8888/ lab a jupyter lab environment open in the browser.

2. Jupyter Lab in Docker — Converting buildings data to Geopandas GeoDataFrame

To start the procedure to find all the building of Cyprus we must open the Sedona_OvertureMaps_GeoParquet.ipynb for guidance. Also, we create a new .ipynb file called Cyprus_Buildings.ipynb. After that we start coding in Cyprus_Buildings.ipynb.

# Importing Libraries

from sedona.spark import *
import os
import time
import pandas as pd
import geopandas as gpd
# AWS Data link for Overture Maps Data

DATA_LINK = "s3a://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/"
# Configuring 

config = SedonaContext.builder().master("spark://localhost:7077") .\
config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
getOrCreate()

sedona = SedonaContext.create(config)
# Cyprus closed polygon or your Area of Interest, THE POLYGON must be closed
spatial_filter = "POLYGON ((32.222900390625 34.479391971048, 32.222900390625 35.719757939334, 34.749755859375 35.719757939334, 34.749755859375 34.479391971048,32.222900390625 34.479391971048))"
%%time # Magic command to print the processing time

df_building = sedona.read.format("geoparquet").load(DATA_LINK+"theme=buildings/type=building")

df_building = df_building.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")

df_building = df_building.limit(2_000_000) # 2_00_000 optional

# Wall time: 24.5s for ~ 800_000 Buildings - Awesome !!🚀🚀
%%time
# Visualizing all the buildings using Kepler

map_building = SedonaKepler.create_map(df_building, 'Building')
map_building

Converting Apache Sedona DataFrame to Pandas DataFrame and then to GeoDataFrame

# Converting Apache Sdona DataFrame to Pandas DataFrame

df = df_building.toPandas()

# and then converting the DataFrame to GeoDataFrame

gdf = gpd.GeoDataFrame(data=df)

Downloading Cyprus Buildings in csv format in the Container.

# Downloading the data in the container in Geojson format

gdf.to_csv(path_or_buf="Downloaded_Data/OMF_Buildings.csv")

3. Downloading the buildings data locally — Geojson format

A new file called Downloaded_Data must be shown up (left panel of jupyter lab). Inside of this folder a file Called OMF_Buildings.geojson appears. To download it locally we press right click on it and then download. We save the file in a folder so that we know where it is located.

4. Storing the building data in our PostGIS Geodatabase using Python — Anaconda locally (not using docker).

Before we start, we must create a virtual environment using conda. In Anaconda Shell we type

Command Line

conda create -n sql python # creating the new environment
conda activate sql # activate the new environment
conda install ipython-sql sqlalchemy psycopg2 notebook pandas geopandas -c conda-forge # installing necessary libraries

Python

# importing libraries
import os
from sqlalchemy import create_engine
from sqlalchemy import inspect
import pandas as pd
import geopandas as gpd
# variables for the connection of our database

host = "localhost" # Geodatabase is local
database = "Cyprus_Data" # The name of your Postgres geodatabase
user=os.getenv("USERNAME") # The username of your geodatabase is using
password = os.getenv("PASSWORD") # Password of your geodatabase
connection_string = f"postgresql://{user}:{password}@{host}/{database}"
# Inspect the engine if it is working
insp = inspect(engine)
Cyprus_Data_tables = insp.get_table_names() # Here we collect all the tables of our database to see if it is working

To store the data in PostGIS first we must create a new null table in PostGIS. One way to do this is using this code in python.

%%sql # Magic command in python to 'transform' the cell in a 'sql cell'

# only lower case to name a table in a database (e.g. ofm_buildings)

CREATE TABLE omf_buildings(name varchar, geom geometry);

or

In your pgAdmin 4 you find your database -> tables -> create new table (please use only lower case to name the new table).

Loading ofm_building using geopandas

CY_Buildings.to_postgis(name='omf_buildings', con=engine, if_exists='replace', index=False)

💡Hint: If your other geospatial data in the is in different EPSG from 4326, I suggest you to transform the ofm_building to that EPSG. For me the EPSG is 6312 (For Cyprus).

Transforming the EPSG.

CY_Buildings = gpd.read_postgis(sql= "SELECT * FROM omf_buildings",con = engine,geom_col='geometry')

# Replace 4326 and 6312 with your current and target EPSG codes
source_epsg = 4326
target_epsg = 6312

# Transform the geometry column to the new EPSG using SQL
sql_query = f"""
SELECT *, ST_Transform(geometry, {target_epsg}) AS transformed_geom
FROM omf_buildings;
"""
transformed_gdf = gpd.read_postgis(sql_query, con=engine,geom_col="transformed_geom")

# Save the transformed GeoDataFrame back to the database
transformed_gdf.to_postgis("your_table_name_transformed", con=engine, if_exists="replace", index=False)

5. Using QGIS to visualize Cyprus Buildings which are stored in our PostGIS database.

Database Tab -> DB Manager -> Cyprus Data (name of GeodataBase) -> ofm_buildings

Let’s try the new 3D Tiles in QGIS version 3.34. So fast … 🚀🚀

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--

Higher Diploma - Rural, Surveyor and Geoinformatics Engineer (NTUA), MSc - Geoinformatics and Earth Observation (CUT)