Explore this notebook interactively:

Representing an EO data cube#

An EO data cube stores and organizes EO data, and possibly also additional data from other sources (e.g. a digital surface model). It falls entirely within the image domain of the framework and is constructed and managed by a software expert.

This notebook describes the data structures to represent an EO data cube in semantique, and how to interact with it.

Prepare#

Import packages:

[1]:

import semantique as sq

[2]:

import geopandas as gpd
import json

Structure#

There is not a single standard on how an EO data cube should be implemented in practice. Many approaches exist, which we refer to as different configurations of an EO data cube. The structure of semantique is flexible and does not put any limitations on what the configuration of an EO data cube should be. Instead, it allows expert users to create their own class that represents their specific EO data cube configuration. Such a class should always inherit from the abstract base class Datacube that semantique contains. Such a base class can be seen as a blueprint for any object that represents an EO data cube. Do note that an EO data cube representation in semantique never contains the data values of the cube themselves. Instead, it will have some property that allows the retriever to access the actual storage location. How this property looks like will depend on the configuration of the EO data cube.

[3]:

base = sq.datacube.Datacube()

To ensure interoperability, an EO data cube representation of any configuration is required to contain two standard elements. The first one is the layout, which is generally provided when initializing an instance of an EO data cube class and stored inside the layout property. The layout is a repository containing a description of the content of the EO data cube. Each distinct data layer in the cube (e.g. the different spectral bands) should be described by a dedicated metadata object in the layout that can be indexed by a unique name assigned to that layer. This metadata object provides information about the values in the layer, but also about where it can be found inside the structure of the cube. In semantique, the layout is always expected to have the structure of a Python dictionary. In this dictionary, the names of data layers are the keys and the metadata objects (which themselves are a dictionary as well) are the values. The dictionary may have a nested structure formalizing a categorization of data layers, e.g. the metadata object of the data layer “blue band” can be indexed by the key pair (“reflectance”, “blue band”) instead of only “blue band”.

[4]:

hasattr(base, "layout")

[4]:

True

The second required element is the retriever function, which should be implemented as the retrieve() method of an EO data cube class. The retriever is an interface that exposes the content of the EO data cube to the other components in the querying framework. Given a textual reference to a data layer and a spatio-temporal extent, it is able to retrieve the corresponding data values from the cube, and return them in a three-dimensional, spatio-temporal array. The exact implementation of the retriever depends on the configuration of the EO data cube, but the inputs it accepts and the output it returns does not. Hence, the retriever serves as a standardized API that allows other components of the framework to access the content of the EO data cube independent from its configuration.

[5]:

hasattr(base, "retrieve")

[5]:

True

It is important to note here that when evaluating a query recipe the query processor does not care at all what the configuration of the EO data cube is and how it retrieves data. It only interacts with the retriever function, and thus only cares about what input the retriever function accepts, and in what format it returns the retrieved data.

The Opendatacube configuration#

Through the Opendatacube sub-class semantique has built-in support for EO data cubes that are deployed using the Open Data Cube software. It contains a retriever that knows how to interact with the Python API of Open Data Cube.

To initialize a representation of such a data cube, you need to provide a valid layout dictionary and a Datacube object that allows the retriever to interact with the actual data storage. A valid layout for this configuration means that the metadata objects for each data layer should contain at least the following keys and values:

name: The name of the data layer inside the Open Data Cube. In Open Data Cube terminology this is called the measurement.
product: The product in which the data layer is stored inside the Open Data Cube.
type: The value type of the values in the data layer. There are no limits to what the value type categorization should be. If it differs from the semantique value types, this needs to be mapped through a configuration parameter when initializing the instance, see below.
values: An overview of the values in the data layer. If the values are numerical, this should be a dictionary containing the keys “min”, “max” and “precision”. If the values are categorical, this should be a list of dictionaries, with each dictionary referring to single category and containing the keys “id”, “label” and “description”.

When initializing the instance, you should also provide the timezone in which the temporal coordinates inside the data cube are stored. By default, semantique assumes this is UTC. Additionaly, you can provide several configuration parameters that tune the data retrieval process, see the documentation of the Opendatacube class for details.

Since we do not have an Open Data Cube running in our semantique development environment, nor have a Open Data Cube layout file stored inside the package, we can not show code that initializes and interacts with a Opendatacube instance. However, the following code snippet should give an indication:

import datacube
import json
import semantique as sq

odc_obj = datacube.Datacube()

with open("somedir/layout.json", "r") as file:
    dc = sq.datacube.Opendatacube(json.load(file), connection = odc_obj)

The STACCube configuration#

Semantique contains a third built-in EO data cube configuration called STACCube. This configuration is targeted at ad-hoc data cubes built from the results of a STAC metadata search. Contrary to the Opendatacube and the Geotiffarchive, it doesn’t require the pre-organising the data (e.g. to ingest the data into a database in case of the Opendatacube or to create a temporally stacked geotiff in case of the Geotiffarchive). Instead, the STACCube contains a retriever that knows how to fetch assets linked in STAC search results into a data cube. If the linked assets are provided in Cloud Optimizes GeoTiff format (CoGs), the STACCube offers a high-performance option for fetching and subsequently processing earth observation data only for the area and timespan of interest. To initialize a representation of such a data cube, you need to provide the STAC search results and valid layout dictionary. Semantique contains a third built-in EO data cube configuration called STACCube. This configuration is targeted at ad-hoc data cubes built from the results of a STAC metadata search. Contrary to the Opendatacube and the Geotiffarchive, it doesn’t require pre-organising the data (e.g. to ingest the data into a database in case of the Opendatacube or to create a temporally stacked geotiff in case of the Geotiffarchive). Instead, the STACCube contains a retriever that knows how to fetch assets linked in STAC search results into a data cube. If the linked assets are provided in Cloud Optimizes GeoTiff format (CoGs), the STACCube offers a high-performance option for fetching and subsequently processing earth observation data only for the area and timespan of interest. To initialize a representation of such a data cube, you need to provide the STAC search results and valid layout dictionary. The STAC search results can be provided as an pystac.item_collection.ItemCollection or as an iterable of pystac.item.Item. It is important that the properties object for each item contains a datetime attribute that is non-null. Otherwise, the data can’t be indexed temporally. The spatial index will be created based upon the asset-specific proj:epsg, proj:bbox, proj:shape, proj:transform attributes.

A valid layout for this configuration means that the metadata objects for each data layer should contain at least the following keys and values:

name: The name of the data layer, called asset in STAC terminology.
type: The value type of the values in the data layer. There are no limits to what the value type categorization should be. If it differs from the semantique value types, this needs to be mapped through a configuration parameter when initializing the instance, see below.
values: An overview of the values in the data layer. If the values are numerical, this should be a dictionary containing the keys “min”, “max” and “precision”. If the values are categorical, this should be a list of dictionaries, with each dictionary referring to single category and containing the keys “id”, “label” and “description”.

Additionally, you can provide several configuration parameters that tune the data retrieval process, see the documentation of the STACCube class for details. Different to the Opendatacube and the Geotiffarchive, STACCube doesn’t require to specify the timezone in which the temporal coordinates inside the data cube are stored. By default, the timezone is inferred from the STAC search results and converted to UTC.

We will demonstrate the initialization of an instance of the STACCube class using some Sentinel-2 data fetched from the registry of open data on AWS. The layout file we use looks like this.

[14]:

from pystac_client import Client
from shapely.geometry import box

# define temporal & spatial range to perform STAC query
xmin, ymin, xmax, ymax = -2.75, 47.25, -2.25, 47.75
aoi = box(xmin, ymin, xmax, ymax)
t_range = ["2020-07-15", "2020-08-01"]

# STAC-based metadata retrieval
catalog = Client.open("https://earth-search.aws.element84.com/v1")
query = catalog.search(
    collections="sentinel-2-l2a",
    datetime=t_range,
    limit=100,
    intersects=aoi
)
item_coll = query.item_collection()

# list results - part I
items = list(query.items())
print(f"Found: {len(items):d} items.")

# list results - part II
stac_json = query.item_collection_as_dict()
gdf = gpd.GeoDataFrame.from_features(stac_json, "epsg:4326")
gdf

Found: 8 items.

[14]:

	geometry	created	platform	constellation	instruments	eo:cloud_cover	proj:epsg	mgrs:utm_zone	mgrs:latitude_band	mgrs:grid_square	...	s2:datastrip_id	s2:granule_id	s2:reflectance_conversion_factor	datetime	s2:sequence	earthsearch:s3_path	earthsearch:payload_id	earthsearch:boa_offset_applied	processing:software	updated
0	POLYGON ((-3.00025 47.85369, -3.00025 46.86571...	2023-09-09T07:24:43.535Z	sentinel-2b	sentinel-2	[msi]	0.000680	32630	30	T	WT	...	S2B_OPER_MSI_L2A_DS_S2RP_20230503T232911_S2020...	S2B_OPER_MSI_L2A_TL_S2RP_20230503T232911_A0177...	0.969660	2020-07-30T11:18:07.218000Z	1	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/b2b3...	True	{'sentinel2-to-stac': '0.1.1'}	2023-09-09T07:24:43.535Z
1	POLYGON ((-3.00025 47.85369, -1.53274 47.84432...	2022-11-06T09:10:59.398Z	sentinel-2b	sentinel-2	[msi]	0.044423	32630	30	T	WT	...	S2B_OPER_MSI_L2A_DS_EPAE_20200730T131922_S2020...	S2B_OPER_MSI_L2A_TL_EPAE_20200730T131922_A0177...	0.969660	2020-07-30T11:18:07.217000Z	0	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/2aa2...	False	{'sentinel2-to-stac': '0.1.0'}	2022-11-06T09:10:59.398Z
2	POLYGON ((-3.00025 47.85369, -3.00025 46.86571...	2023-07-18T00:30:34.159Z	sentinel-2a	sentinel-2	[msi]	82.378161	32630	30	T	WT	...	S2A_OPER_MSI_L2A_DS_S2RP_20230323T230825_S2020...	S2A_OPER_MSI_L2A_TL_S2RP_20230323T230825_A0265...	0.968747	2020-07-25T11:18:09.991000Z	1	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/e73f...	True	{'sentinel2-to-stac': '0.1.1'}	2023-07-18T00:30:34.159Z
3	POLYGON ((-3.00025 47.85369, -1.53274 47.84432...	2022-11-06T09:16:29.219Z	sentinel-2a	sentinel-2	[msi]	76.555313	32630	30	T	WT	...	S2A_OPER_MSI_L2A_DS_MPS__20200725T120536_S2020...	S2A_OPER_MSI_L2A_TL_MPS__20200725T120536_A0265...	0.968747	2020-07-25T11:18:09.991000Z	0	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/1ede...	False	{'sentinel2-to-stac': '0.1.0'}	2022-11-06T09:16:29.219Z
4	POLYGON ((-3.00025 47.85369, -3.00025 46.86571...	2023-09-09T03:08:50.038Z	sentinel-2b	sentinel-2	[msi]	25.680962	32630	30	T	WT	...	S2B_OPER_MSI_L2A_DS_S2RP_20230423T113853_S2020...	S2B_OPER_MSI_L2A_TL_S2RP_20230423T113853_A0176...	0.968061	2020-07-20T11:18:06.203000Z	1	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/d045...	True	{'sentinel2-to-stac': '0.1.1'}	2023-09-09T03:08:50.038Z
5	POLYGON ((-3.00025 47.85369, -1.53274 47.84432...	2022-11-06T05:16:55.629Z	sentinel-2b	sentinel-2	[msi]	19.963404	32630	30	T	WT	...	S2B_OPER_MSI_L2A_DS_EPAE_20200720T131521_S2020...	S2B_OPER_MSI_L2A_TL_EPAE_20200720T131521_A0176...	0.968061	2020-07-20T11:18:06.202000Z	0	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/ba8f...	False	{'sentinel2-to-stac': '0.1.0'}	2022-11-06T05:16:55.629Z
6	POLYGON ((-3.00025 47.85369, -3.00025 46.86571...	2023-09-09T09:10:47.384Z	sentinel-2a	sentinel-2	[msi]	85.173172	32630	30	T	WT	...	S2A_OPER_MSI_L2A_DS_S2RP_20230423T090139_S2020...	S2A_OPER_MSI_L2A_TL_S2RP_20230423T090139_A0264...	0.967606	2020-07-15T11:18:09.131000Z	1	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/61a3...	True	{'sentinel2-to-stac': '0.1.1'}	2023-09-09T09:10:47.384Z
7	POLYGON ((-3.00025 47.85369, -1.53274 47.84432...	2022-11-06T05:17:10.863Z	sentinel-2a	sentinel-2	[msi]	81.696101	32630	30	T	WT	...	S2A_OPER_MSI_L2A_DS_SGS__20200715T140105_S2020...	S2A_OPER_MSI_L2A_TL_SGS__20200715T140105_A0264...	0.967606	2020-07-15T11:18:09.130000Z	0	s3://sentinel-cogs/sentinel-s2-l2a-cogs/30/T/W...	roda-sentinel2/workflow-sentinel2-to-stac/cbfa...	False	{'sentinel2-to-stac': '0.1.0'}	2022-11-06T05:17:10.863Z

8 rows × 42 columns

[15]:

import warnings

# define datacube
with open("files/layout_stac.json", "r") as file:
    dc = sq.datacube.STACCube(
        json.load(file),
        src = item_coll,
        dtype="int8",
        na_value=0,
        )

# use same extents as for STAC query to set up the context for the datacube
space = sq.SpatialExtent(gpd.GeoDataFrame(geometry=[aoi], crs = 4326))
time = sq.TemporalExtent(*t_range)
extent = parse_extent(space, time, spatial_resolution = [-0.001, 0.001], crs = 4326)

# load the data
with warnings.catch_warnings():
    warnings.simplefilter("ignore", UserWarning)
    data = dc.retrieve("appearance", "scl", extent=extent)
data

[16]:

import numpy as np

# plot results
colors = {
    "no data": [0, 0, 0],
    "saturated/defective": [255, 0, 0],
    "dark areas": [64, 64, 64],
    "cloud shadows": [131, 60, 12],
    "vegetation": [0, 255, 0],
    "not vegetated": [255, 255, 0],
    "water": [0, 0, 204],
    "unclassified": [117, 113, 113],
    "cloud medium prob": [174, 170, 170],
    "cloud high prob": [255, 255, 255],
    "thin cirrus": [0, 204, 255],
    "snow": [255, 102, 255]
}
values = list(range(0, 12))
levels = [x - 0.5 for x in values + [max(values) + 1]]
colors = list([np.array(x)/255 for x in colors.values()])
legend = {"ticks": [], "label": "colortype"}
data.plot(x = "x", y = "y", col = "time", levels = levels, colors = colors, cbar_kwargs = legend);

Representing an EO data cube

Contents