During this examples, we will use openEO Platform as back-end implementation. Also the openEO Platform as a project is under continuous development and the assumptions for receiving sample data made in this project, had to consider the current state of the back-end implementations. In the following course of this vignette, we will give some examples, how to obtain sample data for developing and debugging your openEO workflows. The example code is printed, but the results for many code blocks will not be shown. This has to be this way, because there are interactive elements involved like authenticating with personal user credentials against the back-end service.
For dealing with remote processing trust is one very important aspect. For once you must trust the provider that the data is correctly pre-processed, meaning the data collection is correct and valid. Then the provided operators like mathematical band operations or kernel operations need to work correctly. Also you must trust that the optimization and parallelization strategy work correctly and do not affect the quality of your analysis. One way to build trust towards a remote processing system is testing through local inspection of data. The other aspects in remote processing for the need of sample data with regard to openEO is debugging and the development of user defined functions (UDF).
As for now, openEO Platform supports the user on this matter in the form that data can be processed until a specific processing step and the result can be downloaded.
In order to minimize the effort for an R user to obtain a small data
chunk as sample data for local inspection, the function
get_sample()
was developed. As openEO normally
operates on spatio-temporal data cubes, the get_sample()
function aims to reduce the amount of data by sub setting the initial
data spatially. Furthermore by using this function you already download
the sample data and retrieve it optionally as a stars
object. The latter will require the package stars
to be
installed in your R environment.
At first, we need an enabled account at openEO platform, then we establish a connection to the back-end and start the login procedure. In the process you will be asked for your authentication provider and potentially your login credentials (e.g. Authentication with Github, where you will be redirected to the Github page). During the process you will be very likely asked for a device code, which will be printed to console during the authentication procedure. This code needs to be copied and pasted into the respective field in the internet browser. You will then need to authorize the connection of ‘openeo’ for this session, which you will be asked also in the browser window.
As an example use case we will calculate the NDVI for a certain
temporal interval. To make comparison easier at the end, the original
bounding box is stored as a separate variable bounding_box
.
Potentially the steps afterwards could be a spatial aggregation
operation on polygons and calculating zonal statistics, but this will
not be covered in this example.
p = processes()
coll = list_collections()
f = list_file_formats()
bounding_box = list(west=6.75,south=51.85,east=7.25,north=52.15)
data = p$load_collection(id = coll$SENTINEL2_L2A,
bands=c("B04","B08"),
spatial_extent = bounding_box,
temporal_extent = list("2021-03-01","2021-07-15"))
ndvi = p$reduce_dimension(data=data, dimension = "bands",reducer = function(x, context) {
b04 = x[1]
b08 = x[2]
(b08-b04)/(b08+b04)
})
get_sample()
functionWithout the sampling function you need to configure the workflow completely by yourself, which involves:
compute_result
you would
have to manually set the save_result
node.stars
or any other package that can
handle spatial or spatio-temporal dataA workflow would have looked like the following:
Calculate a new bounding box.
center = list(lon=mean(c(bounding_box$west, bounding_box$east)), lat=mean(c(bounding_box$south, bounding_box$north)))
diff = 0.0003
sample_bbox = list(west=center$lon-diff,south=center$lat-diff,east=center$lon+diff,north=center$lat+diff)
Replace the old bounding box with the new one.
Compute the results and store them in a file on your local disk.
Check the data by opening the file with stars
.
Update the time dimension with a more useful representation.
library(lubridate)
library(stars)
library(ggplot2)
dates = as_date(st_get_dimension_values(obj,"t"))
obj = st_set_dimensions(obj,which="t",values = dates)
obj
Running sample data retrieval this way is still valid and reasonable, when more complex spatial features are used for spatial sub setting.
library(mapview)
library(sf)
initial_bbox = as.numeric(bounding_box)
names(initial_bbox) = c("xmin","ymin","xmax","ymax")
sample_bbox = as.numeric(sample_bbox)
names(sample_bbox) = c("xmin","ymin","xmax","ymax")
initial_bbox = st_as_sfc(st_bbox(initial_bbox,crs=4326))
sample_bbox = st_as_sfc(st_bbox(sample_bbox,crs=4326))
received_data_bbox = st_as_sfc(st_bbox(obj))
mapview(list(initial_bbox,sample_bbox,received_data_bbox))
The two driving functions for sample retrieval are
get_sample()
in combination with
compute_result()
For convenience and auto-completion options in RStudio, we can store
the formats as f
. A plain text format identifier like
format="netCDF"
would do the same in the end.
sample_file = "test_get_sample.nc"
obj = get_sample(ndvi,
as_stars=TRUE,
output_file=sample_file,
format = f$output$netCDF)
The get_sample()
function can be applied directly on an
intermediate processing step, without the need of explicitly defining a
save_result
process. Internally either
compute_result
(synchronous call) or
create_job
(asynchronous call) is used, depending on the
parameter execution='async'|'sync'
. For
compute_result
several configurations can be passed to the
function using the ...
parameter. In that function the
automatic addition of save_result
is handled. If no
specific format was specified, then ‘netCDF’ will be chosen as a
default, if the back-end support that format.
In the former example, there were two more interesting parameters. As
format
, output_file
is defined in
compute_result
and allows to specify a path were to store
the sample data on your local machine. The parameter
as_stars
controls whether or not the downloaded sample will
be opened as a stars
object (requires package
stars
to be installed).
And as obj
is already a stars
object, we
now completely work on local sample data. You can now test and check the
sample data and potentially develop UDFs.
Hint: Newer versions of ‘stars’ already interpret the temporal dimension into POSIX dates and the next code block can be skipped.
library(lubridate)
library(stars)
library(ggplot2)
dates = as_date(st_get_dimension_values(obj,"t"))
obj = st_set_dimensions(obj,which="t",values = dates)
Now, lets compare the bounding boxes of the original request with the one of the obtained data.
library(mapview)
library(sf)
received_data_bbox = st_as_sfc(st_bbox(obj))
mapview(list(initial_bbox,received_data_bbox))
Note: The bounding box of the obtained sample data is smaller than
the original one. It should be even smaller, get_sample()
takes the center point of the bounding box and fetches data in a 0.0003°
radius, which is approximately 30 meter in all directions. If the
bounding box coordinate reference is not in WGS84 (default EPSG:4326)
the package sf
is required to handle the coordinate
transformations. As mentioned in the beginning the current realization
of the back-end provider has realized a serialization of a minimum tile
sizes of 256x256 pixel.
stars
The stars
package was chosen to represent
multidimensional spatio-temporal data in this context, because that was
the exact developers intention. Also stars
offer several
coercion functions to convert the data into formats of other packages
for a more specialized purpose like terra
for raster
operations. Another reason is the planned use of stars
in
the UDF module for R. With this in mind the user can obtain a
stars
object and develop an R-UDF locally, which can
potentially run on the back-end.