Explore this notebook interactively: Binder

Internal query processing#

A semantic query is processed by a query processor. In semantique, this query processor is internally modelled as an object of class QueryProcessor, which can be seen as a worker class taking care of all tasks involving query processing. An instance of this class is initialized whenever a query recipe is executed within a specific context. That is, whenever you call the execute() method of a QueryRecipe instance, semantique internally creates a query processor object to take care of all processing tasks.

The query processor processes a query in three three core phases: query parsing, query optimization and query execution. These align with the common phases in regular relational database querying. Each of these phases has their own, dedicated method. In this notebook we will explain in more detail how each phase is implemented in semantique.

Content#

Prepare#

Import packages:

[1]:
import semantique as sq
from semantique.processor.core import QueryProcessor
[2]:
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import copy
import json

Load a query recipe:

[3]:
# Load a recipe.
with open("files/recipe.json", "r") as file:
    recipe = sq.QueryRecipe(json.load(file))

Set the context of query execution:

[4]:
# Load a mapping.
with open("files/mapping.json", "r") as file:
    mapping = sq.mapping.Semantique(json.load(file))

# Represent an EO data cube.
with open("files/layout_gtiff.json", "r") as file:
    dc = sq.datacube.GeotiffArchive(json.load(file), src = "files/layers_gtiff.zip")

# Set the spatio-temporal extent.
space = sq.SpatialExtent(gpd.read_file("files/footprint.geojson"))
time = sq.TemporalExtent("2019-01-01", "2020-12-31")

# Collect the full context.
# Including additional configuration parameters.
context = {
    "datacube": dc,
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035,
    "tz": "UTC",
    "spatial_resolution": [-1500, 1500]
}

Query parsing#

During query parsing, all required components for processing (i.e., the query recipe, the mapping, the EO data cube, the spatial and temporal extents, and the additional configuration parameters) are read and converted all together into a single object which will be used internally for further processing of the query. Hence, query parsing takes care of initializing a query processor instance. Therefore, the parse() method of the QueryProcessor is implemented as a classmethod).

One of the main tasks of the query parser is to parse the provided spatial and temporal extents into a single spatio-temporal array, see below. This array is used as a template array when retrieving data values from the EO data cube. Ideally, parsing should also take care of validating the components and their interrelations. For example, it should check if referenced concepts in the provided query recipe are actually defined in the provided ontology. Such functionality is not implemented yet in the current version of semantique.

[5]:
processor = QueryProcessor.parse(recipe, **context)
[6]:
type(processor)
[6]:
semantique.processor.core.QueryProcessor

Parsing the spatial and temporal extent#

To parse the spatial and temporal extent into a single spatio-temporal array, the query parser calls the utility function parse_extent(). The result is a DataArray object with a spatial and a temporal dimension. The workflow is as follows:

  1. The spatial extent is reprojected and rasterized given the specified coordinate reference system and spatial resolution in the configuration parameters. The resulting array is always rectangular and therefore covers the full bounding box of the area that was orginally given as spatial extent. Cells that don’t overlap with the area boundaries itself are filled with nodata values. Others are filled with an integer index. If the given area consisted of a single feature, they are all indexed with a value of 1. If the given area consisted of multiple disconnected features, each feature gets a different index value.

  2. The spatial array is expanded over a time dimension. The time dimension has two coordinates: the first coordinate value corresponds to the timestamp at the start of the given time interval, and the second coordinate value corresponds to the timestamp at the end of the given time interval. The time values are converted into the specified time zone in the configuration parameters.

  3. The spatio-temporal array is trimmed, meaning that coordinates for which all pixel values are nodata, are removed from the array. The spatial dimensions are treated differently, by trimming it only at the edges, and thus maintaining the regularity of the spatial dimensions.

You can call the parse_extent() function directly to create a spatio-temporal extent cube from respectively a spatial and temporal extent. It may occur that due to the reprojection of the provided spatial extent, some pixels at the edges of the extent get a nodata value.

[7]:
from semantique.processor.utils import parse_extent
[8]:
extent = parse_extent(
    spatial_extent = context["space"],
    temporal_extent = context["time"],
    spatial_resolution = context["spatial_resolution"],
    crs = context["crs"],
    tz = context["tz"]
)
[9]:
extent
[9]:
<xarray.DataArray 'index' (time: 2, y: 4, x: 4)>
array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])
Coordinates:
  * time           (time) datetime64[ns] 2019-01-01 2020-12-31
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
    spatial_ref    int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
    temporal_ref   int64 0
Attributes:
    name:          index
    long_name:     index
    _FillValue:    nan
    value_type:    nominal
    value_labels:  {1: 'feature_1'}

Query optimization#

During query optimization, the query components are scanned and certain properties of the query processor are set. These properties influence some tweaks in how the query processor will behave when processing the query. For example, if the given spatial extent consists of multiple dispersed sub-areas, the query processor might instruct itself to load data separately for each sub-area, instead of loading data for the full extent and then subset it afterwards.

The query processor contains the optimize() method to run the optimization phase. It returns the same instance, but with updated properties. However, in the current version of semantique, the optimization phase only exists as a placeholder, and no properties are updated yet.

[10]:
processor = processor.optimize()
[11]:
type(processor)
[11]:
semantique.processor.core.QueryProcessor

Query execution#

Finally, the execution phase is where the query recipe is executed. The query processor contains the execute() method to run this phase. It returns the response of the query as a dictionary containing the outputs of all formulated results in the query recipe. Each of these outputs is a DataArray object.

[12]:
response = processor.execute()
[13]:
response
[13]:
{'blue_map': <xarray.DataArray 'blue_map' (y: 4, x: 4)>
 array([[1., 0., 0., 0.],
        [0., 0., 0., 1.],
        [1., 0., 1., 0.],
        [0., 0., 0., 0.]])
 Coordinates:
   * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
   * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
     spatial_ref    int64 0
     temporal_ref   int64 0
     spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
 Attributes:
     value_type:  discrete,
 'green_map': <xarray.DataArray 'green_map' (y: 4, x: 4)>
 array([[1., 2., 1., 2.],
        [2., 2., 0., 1.],
        [1., 1., 0., 1.],
        [2., 0., 1., 2.]])
 Coordinates:
   * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
   * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
     spatial_ref    int64 0
     temporal_ref   int64 0
     spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
 Attributes:
     value_type:  discrete,
 'blue_curve': <xarray.DataArray 'blue_curve' (time: 3)>
 array([0., 1., 3.])
 Coordinates:
     spatial_ref   int64 0
   * time          (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-1...
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete,
 'green_curve': <xarray.DataArray 'green_curve' (time: 3)>
 array([ 0., 13.,  6.])
 Coordinates:
     spatial_ref   int64 0
   * time          (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-1...
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete,
 'blue_stat': <xarray.DataArray 'blue_stat' ()>
 array(4.)
 Coordinates:
     spatial_ref   int64 0
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete,
 'green_stat': <xarray.DataArray 'green_stat' ()>
 array(19.)
 Coordinates:
     spatial_ref   int64 0
     temporal_ref  int64 0
 Attributes:
     value_type:  discrete}

Handler functions#

When executing a query recipe, the query processor moves through the building blocks of the result instructions. For each building block, it has a specific handler function implemented as a method. Whenever the processor reaches a new block, it looks up its type and then calls its handler with the call_handler() method. For example, the handle_concept() method knows how to call the translator of the provided mapping in order to translate a textual semantic concept reference into a semantic array. while the handle_reduce() method know how to apply the reduce verb to such an array.

[14]:
processor.handle_concept(sq.entity("vegetation"))
[14]:
<xarray.DataArray 'vegetation' (time: 3, y: 4, x: 4)>
array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[1., 1., 1., 1.],
        [1., 1., 0., 1.],
        [1., 1., 0., 1.],
        [1., 0., 1., 1.]],

       [[0., 1., 0., 1.],
        [1., 1., 0., 0.],
        [0., 0., 0., 0.],
        [1., 0., 0., 1.]]])
Coordinates:
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
    spatial_ref    int64 0
  * time           (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-...
    temporal_ref   int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
    _FillValue:     1.7976931348623157e+308
    value_type:     binary

Data structures#

Internally the query processor uses DataArray objects from the xarray package to represent multi-dimensional arrays. It extends these objects by adding a so-called accessor with the name sq. In practice, this means that all semantique-specific methods and properties for these objects can be called on any DataArray object through the .sq prefix:

xarray_object.sq.method()
xarray_object.sq.property

Extending xarray in this way is recommended by its developers. Read more about it here. The semantique-specific properties for data arrays include for example the value type of the cube as well as spatio-temporal properties like the spatial resolution, the coordinate reference system in which the spatial coordinates are expressed and the time zone in which the temporal coordinates are expressed. The semantique-specific methods for data arrays include all the implemented verbs, as well as other functionalities that are internally used by the query processor, such as unstacking and aligning arrays. Finally, there are methods for conversions to other objects or to files on disk. For all semantique specific methods and properties for data arrays, see the API reference.

For collections of multiple arrays, semantique contains the Collection class. This is simply a list of DataArray objects. Its methods include the verbs for array collections, as well as some single-array methods that are mapped over each member of the collection.

For all specific methods and properties of these objects, see the API reference. To ensure compatible behaviour with single arrays, you may call all these methods and properties as well with the prefix .sq. However, this is not an xarray accessor, and merely provided to simplify internal code.

Tracking value types#

When moving through the building blocks of a query recipe, the query processor keeps track of the value type of the wrangled arrays. Such a value type describes what kind of data the array contains. It differs from the very technical, computer-oriented numpy.dtype categorization, which contains e.g. int, float, etc. Instead, the value type describes data on a more general, statistical level.

Currently semantique makes a distinction between two value types for quantitative data:

  • Continuous

  • Discrete

And three value types for qualitative data:

  • Nominal

  • Ordinal

  • Binary

Additional value types exist for spatio-temporal data:

  • datetime (for timestamps)

  • coords (for spatial coordinate tuples)

  • geometry (for spatial geometries stored GeoDataFrame objects)

It is expected that whenever an array is retrieved from the EO data cube, the corresponding value type is stored as an semantique-specific array attribute value_type. Even when the data are not qualitative, they are usually stored as numbers. In these cases, an additional attribute value_labels may be used to define the mapping between character-encoded labels and integer-encoded indices.

[15]:
colors = dc.retrieve("appearance", "colortype", extent = extent)
[16]:
colors.sq.value_type
[16]:
'ordinal'
[17]:
colors.sq.value_labels
[17]:
{1: 'SVHNIR',
 2: 'SVLNIR',
 3: 'AVHNIR',
 4: 'AVLNIR',
 5: 'WV',
 6: 'SHV',
 7: 'SHRBRHNIR',
 8: 'SHRBRLNIR',
 9: 'HRBCR',
 10: 'WR',
 11: 'PB',
 12: 'GH',
 13: 'VBBB',
 14: 'BBB',
 15: 'SBB',
 16: 'ABB',
 17: 'DBB',
 18: 'WBBorSHB',
 19: 'NIRPBB',
 20: 'BA',
 21: 'DPWASH',
 22: 'SLWASH',
 23: 'TWASH',
 24: 'SASLWA',
 27: 'TNCLV',
 28: 'TNCLWA_BB',
 29: 'SN',
 30: 'SHSN',
 31: 'SH',
 32: 'FLAME'}
[18]:
water = mapping.translate("entity", "water", extent = extent, datacube = dc)
[19]:
water.sq.value_type
[19]:
'binary'

Whenever applying actions to an array, its value type might change. For example, when evaluating an expression (e.g. when evaluating an expression involving a comparison operator the resulting values are always binary) or applying a reducer (e.g. when counting the number of “true” values in a binary array the resulting values are discrete). This is what we also call type promotion. Each implemented operator and reducer function is able to promote the type of the output given the type(s) of the input(s) by using a type promotion manual.

A common type promotion manual for a univariate operator and a reducer has a dictionary structure with as keys the supported input value types. The value of each key is the output value type. For example, the count() reducer should only be applied to arrays containing binary data, since it counts the number of “true” values. The output of that operation, however, is an array with discrete quantitative data. Hence, the corresponding type promotion manual looks like this:

[20]:
from semantique.processor.types import TYPE_PROMOTION_MANUALS
[21]:
TYPE_PROMOTION_MANUALS["count"]
[21]:
{'binary': 'discrete', '__preserve_labels__': 0}

The additional key “preserve_labels” defines if value labels should be preserved after the reducer is applied, with 0 being “no” and 1 being “yes”.

The reduction function will apply the type promotion whenever track_types = True, which is the default. When we apply the count reducer to a binary array, this means the value type of the output will be discrete:

[22]:
from semantique.processor import reducers
[23]:
new = water.sq.reduce(reducers.count_, "time")
new.sq.value_type
[23]:
'discrete'

Now, when we call this reducer function on e.g. an ordinal array instead of a binary array, an InvalidValueTypeError will be thrown. Hence, the query processor uses type tracking also to check type promotions, and thus, to detect invalid operations.

[24]:
from semantique.exceptions import InvalidValueTypeError
[25]:
try:
    colors.sq.reduce(reducers.count_, "time")
except InvalidValueTypeError as e:
    print(e)
Unsupported operand value type(s) for 'count': [['ordinal']]

This will not happen when track_types = False. In that case, value types are not tracked and the output array has no value type at all.

[26]:
new = colors.sq.reduce(reducers.count_, "time", track_types = False)
"value_type" in new.attrs
[26]:
False

For bivariate expressions, the type promotion manuals have an extra “layer” for the second operand. The first layer of keys refers to the value type of the first operand (i.e. x) and the second layer of keys to the value type of the second operand (i.e. y). For example, the boolean operator and() accepts only arrays with binary values, and returns binary values as well. Hence, the corresponding type promotion manual looks like this:

[27]:
TYPE_PROMOTION_MANUALS["and"]
[27]:
{'binary': {'binary': 'binary'}, '__preserve_labels__': 1}

In the case of these bivariate operators, the “preserve_labels” key may also have a value of 2, meaning that instead of the labels of the first operand, the labels of the second operand should be preserved (this is the case e.g. with the assign() operator).

Just as with the reducers, the operators promote the value type of the output based on the manual, and throw an error if a combination of types is not supported.

[28]:
from semantique.processor import operators
[29]:
new = water.sq.evaluate(operators.and_, water)
new.sq.value_type
[29]:
'binary'
[30]:
try:
    colors.sq.evaluate(operators.and_, water)
except InvalidValueTypeError as e:
    print(e)
Unsupported operand value type(s) for 'and': [['ordinal'], ['binary']]

If the second operand in a bivariate expression is not an array but a single value, its value type is determined based on its numpy data type. In semantique, each of these numpy data types is mapped to one or more semantique value types. In the case of a mapping to multiple value types, all operations supporting at least one of these types will accept the value.

[31]:
from semantique.processor.types import DTYPE_MAPPING
[32]:
DTYPE_MAPPING
[32]:
{'b': ['binary'],
 'i': ['continuous', 'discrete', 'ordinal', 'nominal'],
 'u': ['continuous', 'discrete', 'ordinal', 'nominal'],
 'f': ['continuous'],
 'M': ['datetime'],
 'O': ['ordinal', 'nominal'],
 'U': ['ordinal', 'nominal']}
[33]:
new = water.sq.evaluate(operators.and_, True)
new.sq.value_type
[33]:
'binary'

Inside the reducer and operator functions, an instance of the TypePromoter class is the worker that actually takes care of the type promotion. It can be initialized by providing it with the operand(s) of the process, and the name of the function. Then, it will find the type promotion manual belonging to that function. Alternatively (e.g. when you add your own reducers or operators, see the next sections) you can provide it your own custom type promotion manual.

The promoter can determine the value types of the operands and check if they are supported by the operation. Subsequently, it can promote the value type of the operation output.

[34]:
from semantique.processor.types import TypePromoter
[35]:
promoter = TypePromoter(water, colors, function = "and")
[36]:
try:
    promoter.check()
except InvalidValueTypeError as e:
    print(e)
Unsupported operand value type(s) for 'and': [['binary'], ['ordinal']]
[37]:
promoter = TypePromoter(water, water, function = "and")
[38]:
promoter.check()
[39]:
promoter.promote(new)
[39]:
<xarray.DataArray 'water' (time: 3, y: 4, x: 4)>
array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 0.]],

       [[1., 0., 0., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [0., 0., 0., 0.]]])
Coordinates:
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
    spatial_ref    int64 0
  * time           (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-...
    temporal_ref   int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
    _FillValue:     1.7976931348623157e+308
    value_type:     binary

Now what if you don’t want the query processor to track types at all? There are several reasons why you’d want to do so. For example, type tracking can be so strict that it limits your flexibility of using certain processes “out of the box”. Some EO data cube layouts may not specify value types at all, and some custom operators or reducers that you use may not have any type promotion manual. Also, type tracking of course takes time and thus decreases query performance. No worries, it is easy to tell the query processor to not track types. Simply add the configuration parameter “track_types” when initializing the query processor, and set its value to False:

context["track_types"] = False

Adding custom verbs#

Besides the built-in semantique verbs, there is the option to provide your own set of custom verb functions when initializing the query processor. Each custom verb requires a name and a callable function. The function has to accept a DataArray object as its first argument. This is the active evaluation object in the processing chain, to which the verb is applied. It must also contain a boolean keyword argument called “track_types”, which defines if value type tracking should be applied. Hence, the function should also be able to track value types if this is requested. You can add as many additional keyword arguments as you want. The return value of the function must again be a DataArray object. Internally, semantique will forward your function and the keyword arguments to the pipe() method in xarray. Only use this functionality if you really know what you are doing!

Lets create a trivial custom verb that simply assigns a true value (i.e. 1) to each pixel in an array.

[40]:
def make_true(obj, track_types = True, **kwargs):
    newobj = obj.copy(deep = True)
    newobj.values = np.ones_like(newobj)
    if track_types:
        newobj.sq.value_type = "binary"
        del obj.sq.value_labels
    return newobj

When executing a query recipe, we can now add this custom verb to the configuration parameters as a dictionary with the function name as a key and the function itself as a value.

[41]:
new_recipe = sq.QueryRecipe()
new_context = copy.deepcopy(context)

new_recipe["vals"] = sq.reflectance("s2_band04")
new_recipe["ones"] = sq.result("vals").apply_custom("make_true")
[42]:
from semantique.exceptions import UnknownVerbError
[43]:
try:
    print(new_recipe.execute(**new_context)["ones"])
except UnknownVerbError as e:
    print(e)
Custom verb 'make_true' is not defined
[44]:
new_context["custom_verbs"] = {"make_true": make_true}
[45]:
try:
    print(new_recipe.execute(**new_context)["ones"])
except UnknownVerbError as e:
    print(e)
<xarray.DataArray 'ones' (time: 3, y: 4, x: 4)>
array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])
Coordinates:
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
    spatial_ref    int64 0
  * time           (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-...
    temporal_ref   int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
    _FillValue:     1.7976931348623157e+308
    value_type:     binary

Adding custom operators#

Operators are used when evaluating expressions on arrays with the evaluate() verb. For example, you can compare each value in an array with a given constant (using e.g. the greater() operator), add two arrays together (using the add() operator), etcetera. When creating a query recipe, you can refer to these operators by their name, for example:

sq.result("water_count").evaluate("multiply", 2)

By default, the operator functions are taken from the operators module of semantique. This is the module in which all built-in operator functions are defined. However, when initializing a query processor you can provide a set of additional operator functions that may be used during query execution.

Let us prepare such a custom operator. This requires a name, an operator function and a type promotion manual. If you plan to disable type tracking either way, the latter is of course not needed. An operator function has to accept a DataArray object as its first argument. This is the active evaluation object in the processing chain, to which the verb is applied. Bivariate operators also accept a second argument, which can either be a DataArray object as well, or a single value. The return value of the operator must again be a DataArray object. Usually it is a good idea to use the xarray.apply_ufunc() function inside operator functions, as shown below. Also, you should not forget to align the second operand to the input object before applying the core operation!

It is also possible to implement operator functions with more than two operands. All additional keyword arguments that the evaluate verb receives are simply forwarded to the operator function. You could for example build a query recipe like the following:

sq.result("water_count").evaluate("custom", y = 1, foo = 10, bar = 100)

You then need to implement an operator with four arguments:

def custom(x, y, foo, bar)

If you want to track value types, the custom operator should also have a binary “track_types” argument. Inside the function, the type promotion should be applied whenever the “track_types” argument is set to True. You can create your own type promotion manual and provide it to the TypePromoter instance through the “manual” argument, or use the type promotion manual of a built-in operator function by providing the name of that function through the “function” argument.

Lets create a custom operator that calculates the modulus of two arrays.

[46]:
def modulus(x, y, track_types = True, **kwargs):
    if track_types:
        manual = {"continuous": {"continuous": "continuous"}, "__preserve_labels": 0}
        promoter = TypePromoter(x, y, manual = manual)
        promoter.check()
    f = lambda x, y: np.mod(x, y)
    y = xr.DataArray(y).sq.align_with(x)
    out = xr.apply_ufunc(f, x, y)
    if track_types:
        out = promoter.promote(out)
    return out

When executing a query recipe, we can now add this custom operator to the configuration parameters as a dictionary with the function name as a key and the function itself as a value. Note that if a custom operator has the same name as a default operator, the default operator will be replaced by the custom operator.

[47]:
new_recipe = sq.QueryRecipe()
new_context = copy.deepcopy(context)

red = sq.reflectance("s2_band04")
nir = sq.reflectance("s2_band08")

new_recipe["mod"] = nir.evaluate("modulus", red).evaluate("floor")
[48]:
from semantique.exceptions import UnknownOperatorError
[49]:
try:
    print(new_recipe.execute(**new_context)["mod"])
except UnknownOperatorError as e:
    print(e)
Operator 'modulus' is not defined
[50]:
new_context["custom_operators"] = {"modulus": modulus}
[51]:
try:
    print(new_recipe.execute(**new_context)["mod"])
except UnknownOperatorError as e:
    print(e)
<xarray.DataArray 'mod' (time: 3, y: 4, x: 4)>
array([[[ 434.,  530.,  553.,  856.],
        [ 499.,  725.,  291.,  378.],
        [ 427.,  530., 2690.,  310.],
        [ 644.,  231.,  317.,  656.]],

       [[ 211.,  273.,  195.,  255.],
        [ 245.,   61.,  367.,  229.],
        [ 210.,   20.,  188.,  288.],
        [  77.,  709.,  136.,  172.]],

       [[ 314.,  161.,  737.,  101.],
        [ 752.,  340.,  409.,  247.],
        [ 344.,  724., 1085.,   64.],
        [ 387., 1549.,  556.,  369.]]])
Coordinates:
  * x              (x) float64 4.531e+06 4.532e+06 4.534e+06 4.535e+06
  * y              (y) float64 2.696e+06 2.695e+06 2.693e+06 2.692e+06
    spatial_ref    int64 0
  * time           (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-...
    temporal_ref   int64 0
    spatial_feats  (y, x) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    value_type:  discrete

Adding custom reducers#

Reducers are used inside the reduce() verb as a function to aggregate values along a dimension of an array. For example, you can reduce the temporal dimension of an array by calculating the average value of each slice along the time dimension axis (using the mean() reducer), etcetera. When creating a query recipe, you can refer to these reducers by their name, for example:

sq.result("water_count_space").reduce("mean", "time")

By default, the reducer functions are taken from the reducers module of semantique. This is the module in which all built-in reducer functions are defined. However, when initializing a query processor you can provide a set of additional reducer functions that may be used during query execution.

Let us prepare such a custom reducer. This requires a name, a reducer function and a type promotion manual. If you plan to disable type tracking either way, the latter is of course not needed. A reducer function has to accept a DataArray object as its first argument. This is the active evaluation object in the processing chain, to which the verb is applied. The return value of the reducer must again be a DataArray object. Usually it is a good idea to use the xarray.reduce() method inside reducer functions, as shown below. This function will take care e.g. of translating the dimension name (if provided) into its corresponding axis number.

A reducer function also needs to accept additional keyword arguments. Two of them are “reserved”, and should never be used for a custom argument: “dim” for the name of the dimension to reduce over, and “axis” for the axis number belonging to that dimension. All other keyword arguments are simply forwarded to the reducer function. That allows you to include additional variables in your reducer functions (e.g. constants).

If you want to track value types, the custom reducer should also have a binary “track_types” argument. Inside the function, the type promotion should be applied whenever the “track_types” argument is set to True. You can create your own type promotion manual and provide it to the TypePromoter instance through the “manual” argument, or use the type promotion manual of a built-in reducer function by providing the name of that function through the “function” argument.

Lets create a custom operator that reduces a set of values to its sum of squares.

[52]:
def sum_of_squares(x, track_types = False, **kwargs):
    if track_types:
        promoter = TypePromoter(x, function = "sum")
        promoter.check()
    f = lambda x, axis = None: np.sum(np.square(x), axis)
    out = x.reduce(f, **kwargs)
    if track_types:
        promoter.promote(out)
    return out

When executing a query recipe, we can now add this custom reducer to the configuration parameters as a dictionary with the function name as a key and the function itself as a value. Note that if a custom reducer has the same name as a default reducer, the default reducer will be replaced by the custom reducer.

[53]:
new_recipe = sq.QueryRecipe()
new_context = copy.deepcopy(context)

nir = sq.reflectance("s2_band08")

new_recipe["foo"] = nir.reduce("sum_of_squares", "space")
[54]:
from semantique.exceptions import UnknownReducerError
[55]:
try:
    print(new_recipe.execute(**new_context)["foo"])
except UnknownReducerError as e:
    print(e)
Reducer 'sum_of_squares' is not defined
[56]:
new_context["custom_reducers"] = {"sum_of_squares": sum_of_squares}
[57]:
try:
    print(new_recipe.execute(**new_context)["foo"])
except UnknownReducerError as e:
    print(e)
<xarray.DataArray 'foo' (time: 3)>
array([2.1983427e+08, 5.3802670e+07, 3.9970504e+07])
Coordinates:
    spatial_ref   int64 0
  * time          (time) datetime64[ns] 2019-12-15T10:17:33.408715 ... 2020-1...
    temporal_ref  int64 0
Attributes:
    value_type:  continuous

Exporting the response#

Each array in the response is a DataArray object, and can thus be further processed with one of the many functions that the xarray package has to offer. It is also possible to convert a response to another format. Semantique contains functions to convert an array to a DataFrame object from the pandas package, and to a GeoDataFrame object from the geopandas package (requires spatial dimensions). These conversion methods can be called through the sq-accessor of the arrays.

[58]:
response["blue_curve"].sq.to_dataframe()
[58]:
blue_curve
time
2019-12-15 10:17:33.408715 0.0
2020-09-05 10:17:43.167942 1.0
2020-12-19 10:17:34.610661 3.0
[59]:
response["blue_map"].sq.to_geodataframe()
[59]:
y x blue_map geometry
0 2696250.0 4530750.0 1.0 POINT (4530750.000 2696250.000)
1 2696250.0 4532250.0 0.0 POINT (4532250.000 2696250.000)
2 2696250.0 4533750.0 0.0 POINT (4533750.000 2696250.000)
3 2696250.0 4535250.0 0.0 POINT (4535250.000 2696250.000)
4 2694750.0 4530750.0 0.0 POINT (4530750.000 2694750.000)
5 2694750.0 4532250.0 0.0 POINT (4532250.000 2694750.000)
6 2694750.0 4533750.0 0.0 POINT (4533750.000 2694750.000)
7 2694750.0 4535250.0 1.0 POINT (4535250.000 2694750.000)
8 2693250.0 4530750.0 1.0 POINT (4530750.000 2693250.000)
9 2693250.0 4532250.0 0.0 POINT (4532250.000 2693250.000)
10 2693250.0 4533750.0 1.0 POINT (4533750.000 2693250.000)
11 2693250.0 4535250.0 0.0 POINT (4535250.000 2693250.000)
12 2691750.0 4530750.0 0.0 POINT (4530750.000 2691750.000)
13 2691750.0 4532250.0 0.0 POINT (4532250.000 2691750.000)
14 2691750.0 4533750.0 0.0 POINT (4533750.000 2691750.000)
15 2691750.0 4535250.0 0.0 POINT (4535250.000 2691750.000)

Semantique also allow to export an array to either a CSV file or a GeoTIFF file (requires spatial dimensions). To do so, call respectively the to_csv or to_geotiff methods through the sq-accessor of the arrays.

Caching data layers#

The query processor allows to cache retrieved data layers to reduce RAM memory requirements if the same data layer is referenced multiple times in the query recipe or the mapping. RAM memory requirements are proportional to the number of data layers that are stored as intermediate results. Caching data layers in RAM should only be done for those that are needed again when evaluating downstream parts of the recipe. This requires foresight about the execution order of the recipe, which accordingly requires a preview run preceding the actual execution. This preview run is performed by loading the data with drastically reduced spatial resolution (5x5 pixel grid). It resolves the data references and fills a cache by creating a list of the data references in the order in which they are evaluated. This list is then used dynamically during the actual execution of the recipe as a basis for keeping data layers in the cache and reading them from there if they are needed again.

Below the result of the preview run is shown first to demonstrate what the resolved data references look like. You will see that the same data layer is referenced multiple times. The resulting initialised cache can then be fed as an argument to the QueryProcessor in a second step for the actual recipe execution.

[60]:
# Step I: preview run.
qp = QueryProcessor.parse(recipe, **{**context, "preview": True})
qp.optimize().execute()
qp.cache.seq
[60]:
[['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype'],
 ['appearance', 'colortype']]
[61]:
# Step II: query processor execution.
qp = QueryProcessor.parse(recipe, **{**context, "cache": qp.cache})
response = qp.optimize().execute()

When executing a query recipe you can directly initiate the entire caching workflow (preview & full resolution recipe execution) by setting the “cache_data” argument to True:

[62]:
# Same as above in a single step.
response = recipe.execute(**{**context, "cache_data": True})

Caching does not always lead to a increase in performance. The effect depends on:

  • The resolution in which the query recipe is executed.

  • The redundancy of the data references in the recipe, i.e. if layers are called multiple times loading them from cache will reduce the overall time significantly.

  • The data source (EO data cube) from which they are retrieved.

It should be noted that in our demos only data loaded from locally stored GeoTIFF files are analysed. This is sort of the worst case for demonstrating the benefits of caching since the data is stored locally and is therefore quickly accessible. Also, GeoTIFF files that are not stored in cloud-optimised format (CoGs) require to load the whole data into memory even when running in preview mode just to evaluate the sequence of data layers. Keep in mind, however, that caching is designed for and particularly beneficial in case of STACCubes when loading data over the internet.

Logging progress#

When processing a query, the query processor logs its progress. Most of these logs are of the “DEBUG” level. They are helpful to debug the query processing, since they show intermediate outputs when the query processor iterates through the building blocks in the query recipe.

[63]:
import logging
import sys

logger = logging.getLogger("semantique.processor.core")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.FileHandler(filename = "files/logs.txt", mode = "w"))
[64]:
response = recipe.execute(**context)

See here the logs file we created.