Explore this notebook interactively:

Internal query processing#

A semantic query is processed by a query processor. In semantique, this query processor is internally modelled as an object of class QueryProcessor, which can be seen as a worker class taking care of all tasks involving query processing. An instance of this class is initialized whenever a query recipe is executed within a specific context. That is, whenever you call the execute() method of a QueryRecipe instance, semantique internally creates a query processor object to take care of all processing tasks.

The query processor processes a query in three three core phases: query parsing, query optimization and query execution. These align with the common phases in regular relational database querying. Each of these phases has their own, dedicated method. In this notebook we will explain in more detail how each phase is implemented in semantique.

Prepare#

Import packages:

[1]:

import semantique as sq
from semantique.processor.core import QueryProcessor

[2]:

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import copy
import json

Load a query recipe:

[3]:

# Load a recipe.
with open("files/recipe.json", "r") as file:
    recipe = sq.QueryRecipe(json.load(file))

Set the context of query execution:

[4]:

# Load a mapping.
with open("files/mapping.json", "r") as file:
    mapping = sq.mapping.Semantique(json.load(file))

# Represent an EO data cube.
with open("files/layout_gtiff.json", "r") as file:
    dc = sq.datacube.GeotiffArchive(json.load(file), src = "files/layers_gtiff.zip")

# Set the spatio-temporal extent.
space = sq.SpatialExtent(gpd.read_file("files/footprint.geojson"))
time = sq.TemporalExtent("2019-01-01", "2020-12-31")

# Collect the full context.
# Including additional configuration parameters.
context = {
    "datacube": dc,
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035,
    "tz": "UTC",
    "spatial_resolution": [-1500, 1500]
}

Query optimization#

During query optimization, the query components are scanned and certain properties of the query processor are set. These properties influence some tweaks in how the query processor will behave when processing the query. For example, if the given spatial extent consists of multiple dispersed sub-areas, the query processor might instruct itself to load data separately for each sub-area, instead of loading data for the full extent and then subset it afterwards.

The query processor contains the optimize() method to run the optimization phase. It returns the same instance, but with updated properties. However, in the current version of semantique, the optimization phase only exists as a placeholder, and no properties are updated yet.

[10]:

processor = processor.optimize()

[11]:

type(processor)

[11]:

semantique.processor.core.QueryProcessor

Query execution#

Finally, the execution phase is where the query recipe is executed. The query processor contains the execute() method to run this phase. It returns the response of the query as a dictionary containing the outputs of all formulated results in the query recipe. Each of these outputs is a DataArray object.

[12]:

response = processor.execute()

[13]:

response

[13]:

{'blue_map': <xarray.DataArray 'blue_map' (y: 4, x: 4)>
 array([[1., 0., 0., 0.],
        [0., 0., 0., 1.],
        [1., 0., 1., 0.],
        [0., 0., 0., 0.]])
 Coordinates:
   * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
   * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
     spatial_ref    int64 0
     temporal_ref   int64 0
     spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
 Attributes:
     value_type:  discrete,
 'green_map': <xarray.DataArray 'green_map' (y: 4, x: 4)>
 array([[1., 2., 1., 2.],
        [2., 2., 0., 1.],
        [1., 1., 0., 1.],
        [2., 0., 1., 2.]])
 Coordinates:
   * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
   * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
     spatial_ref    int64 0
     temporal_ref   int64 0
     spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
 Attributes:
     value_type:  discrete,
 'blue_curve': <xarray.DataArray 'blue_curve' (time: 3)>
 array([0., 1., 3.])
 Coordinates:
     spatial_ref   int64 0
   * time          (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-1...
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete,
 'green_curve': <xarray.DataArray 'green_curve' (time: 3)>
 array([ 0., 13.,  6.])
 Coordinates:
     spatial_ref   int64 0
   * time          (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-1...
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete,
 'blue_stat': <xarray.DataArray 'blue_stat' ()>
 array(4.)
 Coordinates:
     spatial_ref   int64 0
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete,
 'green_stat': <xarray.DataArray 'green_stat' ()>
 array(19.)
 Coordinates:
     spatial_ref   int64 0
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete}

Data structures#

Internally the query processor uses DataArray objects from the xarray package to represent multi-dimensional arrays. It extends these objects by adding a so-called accessor with the name sq. In practice, this means that all semantique-specific methods and properties for these objects can be called on any DataArray object through the .sq prefix:

xarray_object.sq.method()
xarray_object.sq.property

Extending xarray in this way is recommended by its developers. Read more about it here. The semantique-specific properties for data arrays include for example the value type of the cube as well as spatio-temporal properties like the spatial resolution, the coordinate reference system in which the spatial coordinates are expressed and the time zone in which the temporal coordinates are expressed. The semantique-specific methods for data arrays include all the implemented verbs, as well as other functionalities that are internally used by the query processor, such as unstacking and aligning arrays. Finally, there are methods for conversions to other objects or to files on disk. For all semantique specific methods and properties for data arrays, see the API reference.

For collections of multiple arrays, semantique contains the Collection class. This is simply a list of DataArray objects. Its methods include the verbs for array collections, as well as some single-array methods that are mapped over each member of the collection.

For all specific methods and properties of these objects, see the API reference. To ensure compatible behaviour with single arrays, you may call all these methods and properties as well with the prefix .sq. However, this is not an xarray accessor, and merely provided to simplify internal code.

Adding custom verbs#

Besides the built-in semantique verbs, there is the option to provide your own set of custom verb functions when initializing the query processor. Each custom verb requires a name and a callable function. The function has to accept a DataArray object as its first argument. This is the active evaluation object in the processing chain, to which the verb is applied. It must also contain a boolean keyword argument called “track_types”, which defines if value type tracking should be applied. Hence, the function should also be able to track value types if this is requested. You can add as many additional keyword arguments as you want. The return value of the function must again be a DataArray object. Internally, semantique will forward your function and the keyword arguments to the pipe() method in xarray. Only use this functionality if you really know what you are doing!

Lets create a trivial custom verb that simply assigns a true value (i.e. 1) to each pixel in an array.

[40]:

def make_true(obj, track_types = True, **kwargs):
    newobj = obj.copy(deep = True)
    newobj.values = np.ones_like(newobj)
    if track_types:
        newobj.sq.value_type = "binary"
        del obj.sq.value_labels
    return newobj

When executing a query recipe, we can now add this custom verb to the configuration parameters as a dictionary with the function name as a key and the function itself as a value.

[41]:

new_recipe = sq.QueryRecipe()
new_context = copy.deepcopy(context)

new_recipe["vals"] = sq.reflectance("s2_band04")
new_recipe["ones"] = sq.result("vals").apply_custom("make_true")

[42]:

from semantique.exceptions import UnknownVerbError

[43]:

try:
    print(new_recipe.execute(**new_context)["ones"])
except UnknownVerbError as e:
    print(e)

Custom verb 'make_true' is not defined

[44]:

new_context["custom_verbs"] = {"make_true": make_true}

[45]:

try:
    print(new_recipe.execute(**new_context)["ones"])
except UnknownVerbError as e:
    print(e)

<xarray.DataArray 'ones' (time: 3, y: 4, x: 4)>
array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])
Coordinates:
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
    spatial_ref    int64 0
  * time           (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-...
    temporal_ref   int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
    _FillValue:     1.7976931348623157e+308
    value_type:     binary

Adding custom operators#

Operators are used when evaluating expressions on arrays with the evaluate() verb. For example, you can compare each value in an array with a given constant (using e.g. the greater() operator), add two arrays together (using the add() operator), etcetera. When creating a query recipe, you can refer to these operators by their name, for example:

sq.result("water_count").evaluate("multiply", 2)

By default, the operator functions are taken from the operators module of semantique. This is the module in which all built-in operator functions are defined. However, when initializing a query processor you can provide a set of additional operator functions that may be used during query execution.

Let us prepare such a custom operator. This requires a name, an operator function and a type promotion manual. If you plan to disable type tracking either way, the latter is of course not needed. An operator function has to accept a DataArray object as its first argument. This is the active evaluation object in the processing chain, to which the verb is applied. Bivariate operators also accept a second argument, which can either be a DataArray object as well, or a single value. The return value of the operator must again be a DataArray object. Usually it is a good idea to use the xarray.apply_ufunc() function inside operator functions, as shown below. Also, you should not forget to align the second operand to the input object before applying the core operation!

It is also possible to implement operator functions with more than two operands. All additional keyword arguments that the evaluate verb receives are simply forwarded to the operator function. You could for example build a query recipe like the following:

sq.result("water_count").evaluate("custom", y = 1, foo = 10, bar = 100)

You then need to implement an operator with four arguments:

def custom(x, y, foo, bar)

If you want to track value types, the custom operator should also have a binary “track_types” argument. Inside the function, the type promotion should be applied whenever the “track_types” argument is set to True. You can create your own type promotion manual and provide it to the TypePromoter instance through the “manual” argument, or use the type promotion manual of a built-in operator function by providing the name of that function through the “function” argument.

Lets create a custom operator that calculates the modulus of two arrays.

[46]:

def modulus(x, y, track_types = True, **kwargs):
    if track_types:
        manual = {"continuous": {"continuous": "continuous"}, "__preserve_labels": 0}
        promoter = TypePromoter(x, y, manual = manual)
        promoter.check()
    f = lambda x, y: np.mod(x, y)
    y = xr.DataArray(y).sq.align_with(x)
    out = xr.apply_ufunc(f, x, y)
    if track_types:
        out = promoter.promote(out)
    return out

When executing a query recipe, we can now add this custom operator to the configuration parameters as a dictionary with the function name as a key and the function itself as a value. Note that if a custom operator has the same name as a default operator, the default operator will be replaced by the custom operator.

[47]:

new_recipe = sq.QueryRecipe()
new_context = copy.deepcopy(context)

red = sq.reflectance("s2_band04")
nir = sq.reflectance("s2_band08")

new_recipe["mod"] = nir.evaluate("modulus", red).evaluate("floor")

[48]:

from semantique.exceptions import UnknownOperatorError

[49]:

try:
    print(new_recipe.execute(**new_context)["mod"])
except UnknownOperatorError as e:
    print(e)

Operator 'modulus' is not defined

[50]:

new_context["custom_operators"] = {"modulus": modulus}

[51]:

try:
    print(new_recipe.execute(**new_context)["mod"])
except UnknownOperatorError as e:
    print(e)

<xarray.DataArray 'mod' (time: 3, y: 4, x: 4)>
array([[[ 434.,  530.,  553.,  856.],
        [ 499.,  725.,  291.,  378.],
        [ 427.,  530., 2690.,  310.],
        [ 644.,  231.,  317.,  656.]],

       [[ 211.,  273.,  195.,  255.],
        [ 245.,   61.,  367.,  229.],
        [ 210.,   20.,  188.,  288.],
        [  77.,  709.,  136.,  172.]],

       [[ 314.,  161.,  737.,  101.],
        [ 752.,  340.,  409.,  247.],
        [ 344.,  724., 1085.,   64.],
        [ 387., 1549.,  556.,  369.]]])
Coordinates:
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
    spatial_ref    int64 0
  * time           (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-...
    temporal_ref   int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    value_type:  discrete

Adding custom reducers#

Reducers are used inside the reduce() verb as a function to aggregate values along a dimension of an array. For example, you can reduce the temporal dimension of an array by calculating the average value of each slice along the time dimension axis (using the mean() reducer), etcetera. When creating a query recipe, you can refer to these reducers by their name, for example:

sq.result("water_count_space").reduce("mean", "time")

By default, the reducer functions are taken from the reducers module of semantique. This is the module in which all built-in reducer functions are defined. However, when initializing a query processor you can provide a set of additional reducer functions that may be used during query execution.

Let us prepare such a custom reducer. This requires a name, a reducer function and a type promotion manual. If you plan to disable type tracking either way, the latter is of course not needed. A reducer function has to accept a DataArray object as its first argument. This is the active evaluation object in the processing chain, to which the verb is applied. The return value of the reducer must again be a DataArray object. Usually it is a good idea to use the xarray.reduce() method inside reducer functions, as shown below. This function will take care e.g. of translating the dimension name (if provided) into its corresponding axis number.

A reducer function also needs to accept additional keyword arguments. Two of them are “reserved”, and should never be used for a custom argument: “dim” for the name of the dimension to reduce over, and “axis” for the axis number belonging to that dimension. All other keyword arguments are simply forwarded to the reducer function. That allows you to include additional variables in your reducer functions (e.g. constants).

If you want to track value types, the custom reducer should also have a binary “track_types” argument. Inside the function, the type promotion should be applied whenever the “track_types” argument is set to True. You can create your own type promotion manual and provide it to the TypePromoter instance through the “manual” argument, or use the type promotion manual of a built-in reducer function by providing the name of that function through the “function” argument.

Lets create a custom operator that reduces a set of values to its sum of squares.

[52]:

def sum_of_squares(x, track_types = False, **kwargs):
    if track_types:
        promoter = TypePromoter(x, function = "sum")
        promoter.check()
    f = lambda x, axis = None: np.sum(np.square(x), axis)
    out = x.reduce(f, **kwargs)
    if track_types:
        promoter.promote(out)
    return out

When executing a query recipe, we can now add this custom reducer to the configuration parameters as a dictionary with the function name as a key and the function itself as a value. Note that if a custom reducer has the same name as a default reducer, the default reducer will be replaced by the custom reducer.

[53]:

new_recipe = sq.QueryRecipe()
new_context = copy.deepcopy(context)

nir = sq.reflectance("s2_band08")

new_recipe["foo"] = nir.reduce("sum_of_squares", "space")

[54]:

from semantique.exceptions import UnknownReducerError

[55]:

try:
    print(new_recipe.execute(**new_context)["foo"])
except UnknownReducerError as e:
    print(e)

Reducer 'sum_of_squares' is not defined

[56]:

new_context["custom_reducers"] = {"sum_of_squares": sum_of_squares}

[57]:

try:
    print(new_recipe.execute(**new_context)["foo"])
except UnknownReducerError as e:
    print(e)

<xarray.DataArray 'foo' (time: 3)>
array([2.1983427e+08, 5.3802670e+07, 3.9970504e+07])
Coordinates:
    spatial_ref   int64 0
  * time          (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-1...
    temporal_ref  int64 0
Attributes:
    value_type:  continuous

Exporting the response#

Each array in the response is a DataArray object, and can thus be further processed with one of the many functions that the xarray package has to offer. It is also possible to convert a response to another format. Semantique contains functions to convert an array to a DataFrame object from the pandas package, and to a GeoDataFrame object from the geopandas package (requires spatial dimensions). These conversion methods can be called through the sq-accessor of the arrays.

[58]:

response["blue_curve"].sq.to_dataframe()

[58]:

	blue_curve
time
2019-12-15 10:17:33.408715	0.0
2020-09-05 10:17:43.167942	1.0
2020-12-19 10:17:34.610661	3.0

[59]:

response["blue_map"].sq.to_geodataframe()

[59]:

	y	x	blue_map	geometry
0	2696250.0	4530750.0	1.0	POINT (4530750.000 2696250.000)
1	2696250.0	4532250.0	0.0	POINT (4532250.000 2696250.000)
2	2696250.0	4533750.0	0.0	POINT (4533750.000 2696250.000)
3	2696250.0	4535250.0	0.0	POINT (4535250.000 2696250.000)
4	2694750.0	4530750.0	0.0	POINT (4530750.000 2694750.000)
5	2694750.0	4532250.0	0.0	POINT (4532250.000 2694750.000)
6	2694750.0	4533750.0	0.0	POINT (4533750.000 2694750.000)
7	2694750.0	4535250.0	1.0	POINT (4535250.000 2694750.000)
8	2693250.0	4530750.0	1.0	POINT (4530750.000 2693250.000)
9	2693250.0	4532250.0	0.0	POINT (4532250.000 2693250.000)
10	2693250.0	4533750.0	1.0	POINT (4533750.000 2693250.000)
11	2693250.0	4535250.0	0.0	POINT (4535250.000 2693250.000)
12	2691750.0	4530750.0	0.0	POINT (4530750.000 2691750.000)
13	2691750.0	4532250.0	0.0	POINT (4532250.000 2691750.000)
14	2691750.0	4533750.0	0.0	POINT (4533750.000 2691750.000)
15	2691750.0	4535250.0	0.0	POINT (4535250.000 2691750.000)

Semantique also allow to export an array to either a CSV file or a GeoTIFF file (requires spatial dimensions). To do so, call respectively the to_csv or to_geotiff methods through the sq-accessor of the arrays.

Caching data layers#

The query processor allows to cache retrieved data layers to reduce RAM memory requirements if the same data layer is referenced multiple times in the query recipe or the mapping. RAM memory requirements are proportional to the number of data layers that are stored as intermediate results. Caching data layers in RAM should only be done for those that are needed again when evaluating downstream parts of the recipe. This requires foresight about the execution order of the recipe, which accordingly requires a preview run preceding the actual execution. This preview run is performed by loading the data with drastically reduced spatial resolution (5x5 pixel grid). It resolves the data references and fills a cache by creating a list of the data references in the order in which they are evaluated. This list is then used dynamically during the actual execution of the recipe as a basis for keeping data layers in the cache and reading them from there if they are needed again.

Below the result of the preview run is shown first to demonstrate what the resolved data references look like. You will see that the same data layer is referenced multiple times. The resulting initialised cache can then be fed as an argument to the QueryProcessor in a second step for the actual recipe execution.

[60]:

# Step I: preview run.
qp = QueryProcessor.parse(recipe, **{**context, "preview": True})
qp.optimize().execute()
qp.cache.seq

[60]:

[['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype']]

[61]:

# Step II: query processor execution.
qp = QueryProcessor.parse(recipe, **{**context, "cache": qp.cache})
response = qp.optimize().execute()

When executing a query recipe you can directly initiate the entire caching workflow (preview & full resolution recipe execution) by setting the “cache_data” argument to True:

[62]:

# Same as above in a single step.
response = recipe.execute(**{**context, "cache_data": True})

Caching does not always lead to a increase in performance. The effect depends on:

The resolution in which the query recipe is executed.
The redundancy of the data references in the recipe, i.e. if layers are called multiple times loading them from cache will reduce the overall time significantly.
The data source (EO data cube) from which they are retrieved.

It should be noted that in our demos only data loaded from locally stored GeoTIFF files are analysed. This is sort of the worst case for demonstrating the benefits of caching since the data is stored locally and is therefore quickly accessible. Also, GeoTIFF files that are not stored in cloud-optimised format (CoGs) require to load the whole data into memory even when running in preview mode just to evaluate the sequence of data layers. Keep in mind, however, that caching is designed for and particularly beneficial in case of STACCubes when loading data over the internet.

Logging progress#

When processing a query, the query processor logs its progress. Most of these logs are of the “DEBUG” level. They are helpful to debug the query processing, since they show intermediate outputs when the query processor iterates through the building blocks in the query recipe.

[63]:

import logging
import sys

logger = logging.getLogger("semantique.processor.core")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.FileHandler(filename = "files/logs.txt", mode = "w"))

[64]:

response = recipe.execute(**context)

See here the logs file we created.

Internal query processing

Contents