Source code of the pyveg package

pyveg.src.analysis_preprocessing module

This module consists of methods to process downloaded GEE data. The starting point is a json file written out at the end of the downloading step. This module cleans, resamples, and reformats the data to make it ready for analysis.

pyveg.src.analysis_preprocessing.detrend_data(dfs, period='MS')[source]

Loop over each sub image time series DataFrames and remove time series seasonality by subtracting the previous year. Remove seasonality from precipitation data in the same way.

Parameters
  • dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

  • period (str, optional) –

  • Resample time series to this frequency and then infer (`) – lag to use for deseasonalizing.

Returns

Time series data for multiple sub-image with seasonality removed.

Return type

dict of DataFrame

pyveg.src.analysis_preprocessing.detrend_df(df, period='MS')[source]

Remove seasonality from a DataFrame containing the time series for a single sub-image.

Parameters
  • df (DataFrame) – Time series data for a single sub-image location.

  • period (str, optional) –

  • Resample time series to this frequency and then infer (`) – lag to use for deseasonalizing.

Returns

Input with seasonality removed from time series columns.

Return type

DataFrame

pyveg.src.analysis_preprocessing.drop_veg_outliers(dfs, column='offset50', sigmas=3.0)[source]

Loop over vegetation DataFrames and drop points in the time series that a significantly far away from the mean of the time series. Such points are assumed to be unphysical.

Parameters
  • dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

  • column (str) – Name of the column to drop outliers on.

  • sigmas (float) – Number of standard deviations a data point has to be from the mean to be labelled as an outlier and dropped.

Returns

Time series data for multiple sub-image locations with some values in column potentially set to NaN.

Return type

dict of DataFrame

pyveg.src.analysis_preprocessing.fill_veg_gaps(dfs, missing)[source]

Loop through sub-image time series and replace any gaps with mean value of the same month in other years.

Parameters
  • dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

  • missing (dict of array) – Missing time points where no sub-images were analyse for each veg dataframe in dfs.

pyveg.src.analysis_preprocessing.get_missing_time_points(dfs)[source]

Find missing time points for each vegetation dataframe in dfs, and return a dict, with the same key as in dfs, but with values corresponding to missing dates.

Parameters

dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

Returns

Missing time points for each vegetation df.

Return type

dict

pyveg.src.analysis_preprocessing.make_time_series(dfs)[source]

Given a dictionary of DataFrames which may contian many rows per time point (corresponding to the network centrality values of different sub-locations), collapse this into a time series by calculating the mean and std of the different sub- locations at each date.

Parameters

dfs (dict of DataFrame) – Input DataFrame read by read_json_to_dataframes.

Returns

ts_list – The time-series results averaged over sub-locations. First entry will be main dataframe of vegetation and weather. Second one (if present) will be historical weather.

Return type

list of DataFrames

pyveg.src.analysis_preprocessing.preprocess_data(input_json, output_basedir, drop_outliers=True, fill_missing=True, resample=True, smoothing=True, detrend=True, n_smooth=4, period='MS')[source]

This function reads and process data downloaded by GEE. Processing can be configured by the function arguments. Processed data is written to csv.

Parameters
  • input_json (dict) – JSON data created during a GEE download job.

  • output_basedir (str,) – Directory where time-series csv will be put.

  • drop_outliers (bool, optional) – Remove outliers in sub-image time series.

  • fill_missing (bool, optional) – Fill missing points in the time series.

  • resample (bool, optional) – Resample the time series using linear interpolation.

  • smoothing (bool, optional) – Smooth the time series using LOESS smoothing.

  • detrend (bool, optional) – Remove seasonal component by subtracting previous year.

  • n_smooth (int, optional) – Number of time points to use for the smoothing window size.

  • period (str, optional) – Pandas DateOffset string describing sampling frequency.

Returns

  • output_dir (str) – Path to the csv file containing processed data.

  • defs (dict) – Dictionary of dataframes.

pyveg.src.analysis_preprocessing.read_json_to_dataframes(data)[source]

convert json data to a dict of DataFrame. :param data: :type data: dict, json data output from run_pyveg_pipeline

Returns

A dict of the saved results in a DataFrame format. Keys are names of collections and the values are DataFrame of results for that collection.

Return type

dict

pyveg.src.analysis_preprocessing.read_results_summary(input_location, input_filename='results_summary.json', input_location_type='local')[source]

Read the results_summary.json, either from local storage, Azure blob storage, or zenodo.

Parameters
  • input_location (str, directory or container with results_summary.json in,) – or coords_id if reading from zenodo

  • input_filename (str, name of json file, default is "results_summary.json") –

  • input_location_type (str: 'local' or 'azure' or 'zenodo' or 'zenodo_test') –

Returns

json_data

Return type

dict, the contents of results_summary.json

pyveg.src.analysis_preprocessing.resample_data(dfs, period='MS')[source]

Resample vegetation and rainfall DataFrames. Vegetation DataFrames are resampled at the sub-image level.

Parameters
  • dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

  • period (string) – Period for resampling.

Returns

Resampled data.

Return type

dict of DataFrame

pyveg.src.analysis_preprocessing.resample_dataframe(df, columns, period='MS')[source]

Resample and interpolate a time series dataframe so we have one row per time period.

Parameters
  • df (DataFrame) – Dataframe with date as index.

  • columns (list) – List of column names to resample. Should contain numeric data.

  • period (string) – Period for resampling.

Returns

DataFrame with resample time series in columns.

Return type

DataFrame

pyveg.src.analysis_preprocessing.resample_time_series(series, period='MS')[source]

Resample and interpolate a time series dataframe so we have one row per time period (useful for FFT)

Parameters
  • df (DataFrame) – Dataframe with date as index

  • col_name (string,) – Identifying the column we will pull out

  • period (string) – Period for resampling

Returns

pandas Series with datetime index, and one column, one row per day

Return type

Series

pyveg.src.analysis_preprocessing.save_ts_summary_stats(ts_dirname, output_dir, metadata)[source]

Given a time series DataFrames (constructed with make_time_series), give summary statistics of all the avalaible time series.

Parameters
  • ts_dirname (str) – Directory where the time series are saved.

  • output_dir (str) – Directory to save the plots in.

  • metadata (dict) – Dictionary with metadata from location

pyveg.src.analysis_preprocessing.smooth_all_sub_images(df, column='offset50', n=4, it=3)[source]

Perform LOWESS (Locally Weighted Scatterplot Smoothing) on the time series of a set of sub-images.

Parameters
  • df (DataFrame) – DataFrame containing time series results for all sub-images, with multiple rows per time point and (lat,long) point.

  • column (string, optional) – Name of the column in df to smooth.

  • n (int, optional) – Size of smoothing window.

  • it (int, optional) – Number of iterations of LOESS smoothing to perform.

Returns

DataFrame of results with a new column containing a LOESS smoothed version of the column column.

Return type

Dataframe

pyveg.src.analysis_preprocessing.smooth_subimage(df, column='offset50', n=4, it=3)[source]

Perform LOWESS (Locally Weighted Scatterplot Smoothing) on the time series of a single sub-image.

Parameters
  • df (DataFrame) – Input DataFrame containing the time series for a single sub-image.

  • column (string, optional) – Name of the column in df to smooth.

  • n (int, optional) – Size of smoothing window.

  • it (int, optional) – Number of iterations of LOESS smoothing to perform.

Returns

The time-series DataFrame with a new column containing the smoothed results.

Return type

DataFrame

pyveg.src.analysis_preprocessing.smooth_veg_data(dfs, column='offset50', n=4)[source]

Loop over vegetation DataFrames and perform LOESS smoothing on the time series of each sub-image.

Parameters
  • dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

  • column (str) – Name of the column to drop outliers and smooth.

  • n (int) – Number of neighbouring point to use in smoothing

Returns

Time series data for multiple sub-image locations with new column for smoothed data and ci.

Return type

dict of DataFrame

pyveg.src.analysis_preprocessing.store_feature_vectors(dfs, output_dir)[source]

Write out all feature vector information to a csv file, to be read later by the feature vector plotting script.

Parameters
  • dfs (dict of DataFrame) – Time series data for multiple sub-image locations.

  • output_dir (str) – Path to directory to save the csv.

pyveg.src.azure_utils module

pyveg.src.azure_utils.check_blob_exists(blob_name, container_name, bbs=None)[source]

See if a blob already exists for this account name.

pyveg.src.azure_utils.check_container_exists(container_name, bbs=None)[source]

See if a container already exists for this account name.

pyveg.src.azure_utils.create_container(container_name, bbs=None)[source]
pyveg.src.azure_utils.delete_blob(blob_name, container_name, bbs=None)[source]
pyveg.src.azure_utils.download_rgb(container, rgb_dir)[source]
Parameters
  • container (str, the container name) –

  • rgb_dir (str, directory into which to put image files.) –

pyveg.src.azure_utils.download_summary_json(container, json_dir)[source]
Parameters
  • container (str, the container name) –

  • json_dir (str, temporary directory into which to put json file.) –

pyveg.src.azure_utils.get_blob_to_tempfile(filename, container_name, bbs=None)[source]
pyveg.src.azure_utils.get_sas_token(container_name, token_duration=1, permissions='READ', bbs=None)[source]
pyveg.src.azure_utils.list_directory(path, container_name, bbs=None)[source]
pyveg.src.azure_utils.read_image(blob_name, container_name, bbs=None)[source]
pyveg.src.azure_utils.read_json(blob_name, container_name, bbs=None)[source]
pyveg.src.azure_utils.remove_container_name_from_blob_path(blob_path, container_name)[source]

Get the bit of the filepath after the container name.

pyveg.src.azure_utils.retrieve_blob(blob_name, container_name, destination='/tmp/', bbs=None)[source]

use the BlockBlobService to retrieve file from Azure, and place in destination folder.

pyveg.src.azure_utils.sanitize_container_name(orig_name)[source]

only allowed alphanumeric characters and dashes.

pyveg.src.azure_utils.save_image(image, output_location, output_filename, container_name, format='png', bbs=None)[source]

Given a PIL.Image (list of pixel values), save to requested filename - note that the file extension will determine the output file type, can be .png, .tif, probably others…

pyveg.src.azure_utils.save_json(data, blob_path, filename, container_name, bbs=None)[source]
pyveg.src.azure_utils.write_file_to_blob(file_path, blob_name, container_name, bbs=None)[source]
pyveg.src.azure_utils.write_files_to_blob(path, container_name, blob_path=None, file_endings=[], bbs=None)[source]

Upload a whole directory structure to blob storage. If we are given ‘blob_path’ we use that - if not we preserve the given file path structure. In both cases we take care to remove the container name from the start of the blob path

pyveg.src.batch_utils module

Functions for submitting batch jobs. Currently only support Azure Batch. Largely taken from https://github.com/Azure-Samples/batch-python-quickstart

pyveg.src.batch_utils.add_task(task_id, job_name, input_script, input_config, input_azure_config, task_dependencies, batch_service_client=None)[source]

add the batch task to the job.

Parameters
  • task_id (str, unique ID within this job for the task) –

  • job_name (str, name for the job - usually Sequence name + timestamp) –

  • input_script (ResourceFile corresponding to bash script uploaded to blob storage) –

  • input_config (ResourceFile corresponding to json config for this task uploaded to blob storage) –

  • input_azure_config (ResourceFile corresponding to azure config, uploaded to blob storage) –

  • task_dependencies (list of str, task_ids of any tasks that this one depends on) –

  • batch_service_client (BatchServiceClient) –

pyveg.src.batch_utils.check_task_failed_dependencies(task, job_id, batch_service_client=None)[source]

If a task depends on other task(s), and those have failed, the job will not be able to run.

Parameters
  • task (azure.batch.models.CloudTask, the task we will look at dependencies for) –

  • job_id (str, the unique ID of the Job.) –

  • batch_service_client (BatchServiceClient - will create if not provided.) –

Returns

  • True if the job depends on other tasks that have failed (or those – tasks depend on failed tasks)

  • False otherwise

pyveg.src.batch_utils.check_tasks_status(job_id, task_name_prefix='', batch_service_client=None)[source]

For a given job, query the status of all the tasks.

Returns

task_status – num_success: int, successfully completed num_failed: int, completed but with non-zero exit code num_running: int, currently running num_waiting: int, in “active” state num_cannot_run: int, in “active” state, but with dependent tasks that failed.

Return type

dict, containing the following keys/values:

pyveg.src.batch_utils.create_batch_client()[source]
pyveg.src.batch_utils.create_job(job_id, pool_id=None, batch_service_client=None)[source]

Creates a job with the specified ID, associated with the specified pool.

Parameters
  • job_id (str, ID for the job - will typically be module or sequence name +timestamp) –

  • pool_id (str, ID for the pool. If not provided, use the one from azure_config.py) –

  • batch_service_client (BatchServiceClient instance. Create one if not provided.) –

pyveg.src.batch_utils.create_pool(pool_id, batch_service_client=None)[source]

Creates a pool of compute nodes.

Parameters
  • pool_id (str, identifier for the pool) –

  • batch_service_client (azure.batch.BatchServiceClient, A Batch service client.) –

pyveg.src.batch_utils.delete_job(job_id, batch_service_client=None)[source]

Removes a job, and associated tasks.

pyveg.src.batch_utils.delete_pool(pool_id=None, batch_service_client=None)[source]

Removes a pool of batch nodes

pyveg.src.batch_utils.prepare_for_task_submission(job_name, config_container_name, batch_service_client, blob_client)[source]

Create pool and job if not already existing, and upload the azure config file and the bash script used to run the batch job.

Parameters
  • job_name (str, ID of the job) –

  • batch_service_client (BatchServiceClient to interact with Azure batch.) –

Returns

input_azure_config, input_script – and batch_commands.sh scripts, uploaded to blob storage.

Return type

ResourceFiles corresponding to the azure_config.py

pyveg.src.batch_utils.print_task_output(batch_service_client, job_id, encoding=None)[source]

Prints the stdout.txt file for each task in the job.

Parameters
  • batch_client (batchserviceclient.BatchServiceClient) – The batch client to use.

  • job_id (str) – The id of the job with task output files to print.

pyveg.src.batch_utils.submit_tasks(task_dicts, job_name)[source]

Submit batch jobs to Azure batch.

task_dicts: list of dicts, [ {

“task_id”: <task_id>, “config”: <config_dict>, “depends_on”: [<task_ids>]

} ]

job_name: str, should identify the sequence generating the jobs

pyveg.src.batch_utils.upload_file_to_container(block_blob_client, container_name, file_path)[source]

Uploads a local file to an Azure Blob storage container.

Parameters
  • block_blob_client (azure.storage.blob.BlockBlobService) – A blob service client.

  • container_name (str) – The name of the Azure Blob storage container.

  • file_path (str) – The local path to the file.

Return type

azure.batch.models.ResourceFile

Returns

A ResourceFile initialized with a SAS URL appropriate for Batch

tasks.

pyveg.src.batch_utils.wait_for_tasks_to_complete(job_id, timeout=60, batch_service_client=None)[source]

Returns when all tasks in the specified job reach the Completed state.

Parameters
  • batch_service_client (azure.batch.BatchServiceClient) – A Batch service client.

  • job_id (str) – The id of the job whose tasks should be to monitored.

  • timeout (timedelta) – The duration to wait for task completion. If all

tasks in the specified job do not reach Completed state within this time period, an exception will be raised.

pyveg.src.combiner_modules module

Modules that can consolidate inputs from different sources and produce combined output file (typically JSON).

class pyveg.src.combiner_modules.CombinerModule(name=None)[source]

Bases: pyveg.src.pyveg_pipeline.BaseModule

class pyveg.src.combiner_modules.VegAndWeatherJsonCombiner(name=None)[source]

Bases: pyveg.src.combiner_modules.CombinerModule

Expect directory structures like: <something>/<input_veg_location>/<date>/network_centralities.json <something>/<input_weather_location>/RESULTS/weather_data.json

check_output_dict(output_dict)[source]

For all the keys (i.e. dates) in the vegetation time-series, count how many have data for both veg and weather

combine_json_lists(json_lists)[source]

If for example we have json files from the NetworkCentrality and NDVI calculators, all containing lists of dicts for sub-images, combine them here by matching by coordinate.

get_metadata()[source]

Fill a dictionary with info about this job - coords, date range etc.

get_veg_time_series()[source]

Combine contents of JSON files written by the NetworkCentrality and NDVI calculator Modules. If we are running in a Pipeline, get the expected set of date strings from the vegetation sequence we depend on, and if there is no data for a particular date, make a null entry in the output.

get_weather_time_series()[source]
run()[source]
set_default_parameters()[source]

See if we can set our input directories from the output directories of previous Sequences in the pipeline. The pipeline (if there is one) will be a grandparent, i.e. self.parent.parent and the names of the Sequences we will want to combine should be in the variable self.depends_on.

pyveg.src.coordinate_utils module

Collection of utility functions for manipulating coordinates and their string representations.,

pyveg.src.coordinate_utils.coords_dict_to_coords_string(coords)[source]

Given a dict of long/lat values, return a string, rounding to 2 decimal places.

pyveg.src.coordinate_utils.coords_list_to_coords_string(coords)[source]

Given a list or tuple of [long, lat], return a string, rounding to 2 decimal places.

pyveg.src.coordinate_utils.find_coords_string(file_path)[source]

Parse a file path using a regular expresion to find a substring that looks like a set of coordinates, and return that.

pyveg.src.coordinate_utils.get_region_string(coords, region_size)[source]

Given a set of (long,lat) coordinates, and the size of a square region in long,lat space, return a string in the format expected by GEE.

Parameters
  • coords (list of floats, [longitude,latitude]) –

  • region_size (float, size of each side of the region, in degrees) –

Returns

region_string – representing four corners of the region.

Return type

str, string representation of list of four coordinates,

pyveg.src.coordinate_utils.get_sub_image_coords(coords, region_size, x_parts, y_parts)[source]

If an image is divided into sub_images, return a list of coordinates for all the sub-images.

Parameters
  • coords (list of floats, [long,lat]) –

  • region_size (float, size of square image in degrees long,loat) –

  • x_parts (int, number of sub-images in x-direction) –

  • y_parts (int, number of sub-images in y-direction) –

Returns

sub_image_coords

Return type

list, of lists of floats [[long,lat],..]

pyveg.src.coordinate_utils.lookup_country(latitude, longitude)[source]

Use the OpenCage API to do reverse geocoding

pyveg.src.data_analysis_utils module

Data analysis code including functions to read the .json results file, and functions analyse and plot the data.

pyveg.src.data_analysis_utils.ar1_moving_average_time_series(series, length=1)[source]

Calculate an AR1 time series using a moving average

Parameters
  • series (pandas Series) – Time series observations.

  • length (int) – Length of the moving window in number of observations.

Returns

pandas Series with datetime index, and one column, one row per date

Return type

pandas Series

pyveg.src.data_analysis_utils.calculate_ci(data, ci_level=0.99)[source]

Calculate the confidence interval on the mean for a set of data. :param data: Series of data to calculate the confidence interval of the mean. :type data: Series :param ci_level: Size of the confidence interval to calculate :type ci_level: float, optional

Returns

Confidence interval value where the CI is [mu - h, mu + h], where mu is the mean.

Return type

float

pyveg.src.data_analysis_utils.cball(x=range(1, 13), alpha=1.5, n=150.0, xbar=8.0, sigma=2.0)[source]

Calculates the Crystal Ball pdf on the values 1 to 12 by default (i.e. monthly) Default parameter values give a fit close to those we would expect from offset50 time series :param x: Index values going from 1 to the length of the annual time series :type x: Time series :param alpha: Parameters used in Crystal Ball pdf calculation :type alpha: Model parameters, int :param n: Parameters used in Crystal Ball pdf calculation :type n: Model parameters, int :param xbar: Parameters used in Crystal Ball pdf calculation :type xbar: Model parameters, int :param sigma: Parameters used in Crystal Ball pdf calculation :type sigma: Model parameters, int

Returns

The values of the Crystal Ball pdf for each index of x

Return type

ndarray

pyveg.src.data_analysis_utils.cball_parfit(p0, timeseries, plot_name='CB_fit.png', output_dir='')[source]

Uses least squares regression to optimise the parameters in cball to fit the timeseries supplied. The supplied time series should be the original series as this function finds the mean annual ts and reverses and normalises it :param p0: A list a parameters (alpha, n, xbar, sigma) to use in the Crystal Ball calculation as an initial estimate :type p0: Initial parameters, list :param timeseries: Original time series to calculate mean annual time series on, reverse and normalise

and then use to optimise the parameters on

Parameters
  • plot_name (string) – Name for the data/fit comparison plot

  • output_dir (str) – Directory to save the plots in.

Returns

  • ndarray – A list of optimised parameters (alpha, n, xbar, sigma)

  • int – A indication that the optimisation works (if output is 1,2,3 or 4 then ok)

  • float – The residuals from the best CB fit

pyveg.src.data_analysis_utils.coarse_dataframe(geodf, side_square)[source]

Coarse the granularity of a dataframe by grouping lat,long points that are close to each other in a square of L = size_square :param geodf: Input dataframe :type geodf: Dataframe :param side_square: Side of the square :type side_square: integer

Returns

A coarser dataframe

Return type

A dataframe

pyveg.src.data_analysis_utils.convert_to_geopandas(df)[source]

Given a pandas DatFrame with lat and long columns, convert to geopandas DataFrame. :param df: Pandas DatFrame with lat and long columns. :type df: DataFrame

Returns

Return type

geopandas DataFrame

pyveg.src.data_analysis_utils.create_lat_long_metric_figures(geodf, metric, output_dir)[source]

From input data-frame with processed network metrics create 2D gird figure for each date available using Geopandas. :param geodf: Input dataframe :type geodf: GeoDataframe :param metric: Variable to plot :type metric: string :param output_dir: Directory to save the figures

Returns

Parameters

----------

pyveg.src.data_analysis_utils.decay_rate(x, resolution=12, method='basic')[source]

Calculates the decay rate between the max and min values of a time series. :param x: Time series to calculate decay rate on. mean_annual_ts is calculated

on this series within this function so raw time series is expected.

Parameters
  • resolution (int) – Number of values each year in a time series (12 is monthly for example)

  • method ('basic' (default) or 'adjusted') – A choice on whether to calculate the decay rate on the mean annual time series calculated within the function or to adjust the time series such that the min value is set to 1 by substracting the minimum plus 1 of the mean annual time series (useful for offset50 values)

Returns

The decay rate value

Return type

float

pyveg.src.data_analysis_utils.early_warnings_null_hypothesis(series, indicators=['var', 'ac'], roll_window=0.4, smooth='Lowess', span=0.1, band_width=0.2, lag_times=[1], n_simulations=1000)[source]

Function to estimate the significance of the early warnings analysis by performing a null hypothesis test. The function estimate distributions of trends in early warning indicators from different surrogate timeseries generated after fitting an ARMA(p,q) model on the original data. The trends are estimated by the nonparametric Kendall tau correlation coefficient and can be compared to the trends estimated in the original timeseries to produce probabilities of false positives. The function returns a dataframe that contains the Kendall tau rank correlation estimates for orignal data and surrogates. :param series: Time series observations. :type series: pandas Series :param indicators: The statistics (leading indicator) selected for which the sensitivity analysis is perfomed. :type indicators: list of strings :param roll_window: Rolling window size as a proportion of the length of the time-series

data.

Parameters
  • smooth (string) – Type of detrending. It can be {‘Gaussian’, ‘Lowess’, ‘None’}.

  • span (float) – Span of time-series data used for Lowess filtering. Taken as a proportion of time-series length if in (0,1), otherwise taken as absolute.

  • band_width (float) – Bandwidth of Gaussian kernel. Taken as a proportion of time-series length if in (0,1), otherwise taken as absolute.

  • lag_times (list of int) – List of lag times at which to compute autocorrelation.

  • n_simulations (int) – The number of surrogate data. Default is 1000.

Returns

A dataframe that contains the Kendall tau rank correlation estimates for each indicator estimated on each surrogate dataset.

Return type

DataFrame

pyveg.src.data_analysis_utils.early_warnings_sensitivity_analysis(series, indicators=['var', 'ac'], winsizerange=[0.1, 0.8], incrwinsize=0.1, smooth='Gaussian', bandwidthrange=[0.05, 1.0], spanrange=[0.05, 1.1], incrbandwidth=0.2, incrspanrange=0.1)[source]

Function to estimate the sensitivity of the early warnings analysis to the smoothing and windowsize used. The function returns a dataframe that contains the Kendall tau rank correlation estimates for the rolling window sizes (winsize variable) and bandwidths or span sizes depending on the de-trending (smooth variable). This function is inspired in the sensitivity_ews.R function from Vasilis Dakos, Leo Lahti in the early-warnings-R package: https://github.com/earlywarningtoolbox/earlywarnings-R. :param series: Time series observations. :type series: pandas Series :param indicators: The statistics (leading indicator) selected for which the sensitivity analysis is perfomed. :type indicators: list of strings :param winsizerange: Range of the rolling window sizes expressed as ratio of the timeseries length (must be numeric between 0 and 1). Default is 0.25 - 0.75. :type winsizerange: list of float :param incrwinsize: Increments the rolling window size (must be numeric between 0 and 1). Default is 0.25. :type incrwinsize: float :param smooth: Type of detrending. It can be {‘Gaussian’, ‘Lowess’, ‘None’}. :type smooth: string :param bandwidthrange: Range of the bandwidth used for the Gaussian kernel when gaussian filtering is selected. It is expressed as percentage of the timeseries length (must be numeric between 0 and 100). Default is 5% - 100%. :type bandwidthrange: list of float :param spanrange: Parameter that controls the degree of Lowess smoothing (numeric between 0 and 1). Default is 0.05 - 1. :type spanrange: list of float :param incrbandwidth: Size to increment the bandwidth used for the Gaussian kernel when gaussian filtering is applied. It is expressed as percentage of the timeseries length (must be numeric between 0 and 1). Default is 0.2. :type incrbandwidth: float :param incrspanrange: Size to increment the the span used for the Lowess smoothing :type incrspanrange: float

Returns

A dataframe that contains the Kendall tau rank correlation estimates for the rolling window sizes (winsize variable) and bandwidths or span sizes depending on the de-trending (smooth variable).

Return type

DataFrame

pyveg.src.data_analysis_utils.err_func(params, ts)[source]

Calculates the difference between the cball function with supplied params and a supplied time series of the same length. err_func is used within cball_parfit function below where full time series needs to be supplied :param params: Parameters used in Crystal Ball pdf calculation

alpha, n, xbar, sigma

Parameters

ts (Time series) – Time series to compare output of cball function to

Returns

Residuals/differences between Crytal Ball pdf and supplied time series

Return type

ndarray

pyveg.src.data_analysis_utils.exp_model_fit(x, resolution=12, method='basic')[source]

Fits an exponential model from the maximum to the minimum of the mean annual time series. A raw time series is expected as an input. :param x: Time series to calculate decay rate on. mean_annual_ts is calculated

on this series within this function so raw time series is expected.

Parameters
  • resolution (int) – Number of values each year in a time series (12 is monthly for example)

  • method ('basic' (default) or 'adjusted') – A choice on whether to fit the expoenential model on the mean annual time series calculated within the function or to adjust the time series such that the min value is set to 1 by substracting the minimum plus 1 of the mean annual time series (useful for offset50 values)

Returns

The coefficient values from the exponential model fit

Return type

ndarray

pyveg.src.data_analysis_utils.fft_series(time_series)[source]

Perform Fast Fourier Transform on an input series (assume one row per day). :param time_series: :type time_series: a pandas Series with one row per day, and datetime index (which we’ll ignore)

Returns

xvals, yvals – Ready to be plotted directly in a matplotlib plot.

Return type

np.arrays of frequencies (1/day) and strengths in frequency space.

pyveg.src.data_analysis_utils.get_AR1_parameter_estimate(ys)[source]

Fit an AR(1) model to the time series data and return the associated parameter of the model. :param ys: Input time series data. :type ys: array

Returns

  • float – The parameter value of the AR(1) model..

  • float – The parameter standard error

pyveg.src.data_analysis_utils.get_ar1_var_timeseries_df(series, window_size=0.5)[source]

Given a time series calculate AR1 and variance using a moving window. Put the two resulting time series into a new DataFrame and return the result. :param series: Time series observations. :type series: pandas Series :param window_size: Size of the moving window as a fraction of the time series length. :type window_size: float (optional)

Returns

The AR1 and variance results in a time series dataframe.

Return type

DataFrame

pyveg.src.data_analysis_utils.get_confidence_intervals(df, column, ci_level=0.99)[source]

Calculate the confidence interval at each time point of a DataFrame containing data for a large image. :param df: Time series data for multiple sub-image locations. :type df: DataFrame :param column: Name of the column to calculate the CI of. :type column: str :param ci_level: Size of the confidence interval to calculate :type ci_level: float, optional

Returns

Time series data for multiple sub-image locations with added column for the ci.

Return type

DataFrame

pyveg.src.data_analysis_utils.get_correlation_lag_ts(series_A, series_B, window_size=0.5)[source]

Given two time series and a lag betweent them, calculate the lagged correlation between the two time series using a moving window. Additionally calculate the lag of the maximum precipitation using the moving window.. :param series_A: Observations of the first time series. :type series_A: pandas Series :param series_B: Observations of the second time series. :type series_B: pandas Series :param window_size: Size of the moving window as a fraction of the time series length. :type window_size: float (optional)

Returns

Lagged corrleation and lag which maximises the correlation time series.s

Return type

DataFrame

pyveg.src.data_analysis_utils.get_corrs_by_lag(series_A, series_B)[source]
pyveg.src.data_analysis_utils.get_datetime_xs(df)[source]

Return the date column of df as datetime objects.

pyveg.src.data_analysis_utils.get_kendell_tau(ys)[source]

Kendall’s tau gives information about the trend of the time series. It is just a rank correlation test with one variable being time (or the vector 1 to the length of the time series), and the other variable being the data itself. A tau value of 1 means that the time series is always increasing, whereas -1 mean always decreasing, and 0 signifies no overall trend. :param ys: Input time series data. :type ys: array

Returns

  • float – The value of tau.

  • float – The p value of the rank correlation test.

pyveg.src.data_analysis_utils.get_max_lagged_cor(dirname, veg_prefix)[source]

Convenience function which returns the maximum correlation as a function of lag (using a file saved earlier). :param dirname: Path to the analysis/ directory of the current analysis job. :type dirname: str :param veg_prefix: Compact representation of the satellite collection name used to

obtain vegetation data.

Returns

Max correlation, and lag, for smoothed and unsmoothed vegetation time series.

Return type

tuple

pyveg.src.data_analysis_utils.mean_annual_ts(x, resolution=12)[source]

Calculate mean annual time series from time series. Also fills in missing values by linear interpolation. NB Fails if there is missing value at the start or end. :param x: Time series to calculate mean annual time series for :type x: Time series :param resolution: Number of values each year in a time series (12 is monthly for example) :type resolution: float

Returns

Array of length equal to resolution that is the mean annual time series

Return type

ndarray

pyveg.src.data_analysis_utils.moving_window_analysis(df, output_dir, window_size=0.5)[source]

Run moving window AR1 and variance calculations for several input time series time series. :param df: Input time series DataFrame containing several time series. :type df: DataFrame :param output_dir: Path output plotting directory. :type output_dir: str :param window_size: Size of the moving window as a fraction of the time series length. :type window_size: float (optional)

Returns

AR1 and variance time-series for each of the input time series.

Return type

DataFrame

pyveg.src.data_analysis_utils.network_figure(df, date, metric, vmin, vmax, output_dir)[source]

Make 2D heatmap plot with network centrality measures :param df: Input dataframe :type df: Dataframe :param date: Date to be plot :type date: String :param metric: Which metric is going to be plot :type metric: string :param vmin: Colorbar minimum values :type vmin: int :param vmax: Colorbar max values :type vmax: int :param output_dir: Directory where to save the plots :type output_dir: string

pyveg.src.data_analysis_utils.reverse_normalise_ts(x)[source]

Takes what is expected to be a mean annual time series (from mean_annual_ts), arranges it so the first value is the last, reverses it and then normalises it. It is to be used within cball function below. :param x: Time series reverse and normalise. Assumed this is from mean_annual_ts output :type x: time series

Returns

The reversed and normalised time series

Return type

ndarray

pyveg.src.data_analysis_utils.stl_decomposition(series, period=12)[source]

Run STL decomposition on a pandas Series object. :param series: The observations to be deseasonalised. :type series: Series object :param period: Length of the seasonal period in observations. :type period: int (optional)

pyveg.src.data_analysis_utils.variance_moving_average_time_series(series, length)[source]

Calculate a variance time series using a moving average :param series: Time series observations. :type series: pandas Series :param length: Length of the moving window in number of observations. :type length: int

Returns

pandas Series with datetime index, and one column, one row per date.

Return type

pandas Series

pyveg.src.data_analysis_utils.write_slimmed_csv(dfs, output_dir, filename_suffix='')[source]
pyveg.src.data_analysis_utils.write_to_json(filename, out_dict)[source]

Create or append the contents of out_dict to json file filename. :param filename: Output json filename. :type filename: array :param out_dict: Information to save. :type out_dict: dict

pyveg.src.date_utils module

Useful functions for manipulating dates and date strings, e.g. splitting a period into sub-periods.

When dealing with date strings, ALWAYS use the ISO format YYYY-MM-DD

pyveg.src.date_utils.assign_dates_to_tasks(date_list, n_tasks)[source]

For batch jobs, will want to split dates as evenly as possible over some number of tasks.

pyveg.src.date_utils.find_mid_period(start_date, end_date)[source]

Given two strings in the format YYYY-MM-DD return a string in the same format representing the middle (to the nearest day)

Parameters
  • start_date (str, date in format YYYY-MM-DD) –

  • end_date (str, date in format YYYY-MM-DD) –

Returns

mid_date

Return type

str, mid point of those dates, format YYYY-MM-DD

pyveg.src.date_utils.get_date_range_for_collection(date_range, coll_dict)[source]

Return the intersection of the date range asked for by the user, and the min and max dates for that collection.

Parameters
  • date_range (list or tuple of strings, format YYYY-MM-DD) –

  • coll_dict (dictionary containing min_date and max_date kyes) –

Returns

Return type

tuple of strings, format YYYY-MM-DD

pyveg.src.date_utils.get_date_strings_for_time_period(start_date, end_date, period_length)[source]

Use the two functions above to slice a time period into sub-periods, then find the mid-date of each of these.

Parameters
  • start_date (str, format YYYY-MM-DD) –

  • end_date (str, format YYYY-MM-DD) –

  • period_length (str, format '<integer><d|w|m|y>', e.g. 30d) –

Returns

periods – each of which is the mid-point of a sub-period

Return type

list of strings in format YYYY-MM-DD,

pyveg.src.date_utils.get_num_n_day_slices(start_date, end_date, days_per_chunk)[source]

Divide the full period between the start_date and end_date into n equal-length (to the nearest day) chunks. The size of the chunk is defined by days_per_chunk. Takes start_date and end_date as strings ‘YYYY-MM-DD’. Returns an integer with the number of possible points avalaible in that time period]

pyveg.src.date_utils.get_time_diff(date1, date2, units='years')[source]

calculate the time difference between two dates, :param date1: :type date1: str, dates in format YYYY-MM-DD :param date2: :type date2: str, dates in format YYYY-MM-DD :param units: :type units: str, can be “years”, “months”, “days”

Returns

time_diff

Return type

int, difference in times, in specified units

pyveg.src.date_utils.slice_time_period(start_date, end_date, period_length)[source]

Slice a time period into chunks, whose length is determined by the period_length, which will be e.g. ‘30d’ for 30 days, or ‘1m’ for one month.

Parameters
  • start_date (str, format YYYY-MM-DD) –

  • end_date (str, format YYYY-MM-DD) –

  • period_length (str, format '<integer><d|w|m|y>', e.g. 30d) –

Returns

periods – each of which is the start and end of a sub-period

Return type

list of lists of strings in format YYYY-MM-DD,

pyveg.src.date_utils.slice_time_period_into_n(start_date, end_date, n)[source]

Divide the full period between the start_date and end_date into n equal-length (to the nearest day) chunks. Takes start_date and end_date as strings ‘YYYY-MM-DD’. Returns a list of tuples [ (chunk0_start,chunk0_end),…]

pyveg.src.download_modules module

pyveg.src.file_utils module

pyveg.src.file_utils.consolidate_json_to_list(json_dir, output_dir=None, output_filename=None)[source]

Load all the json files (e.g. from individual sub-images), and return a list of dictionaries, to be written out into one json file.

Parameters
  • json_dir (str, full path to directory containing temporary json files) –

  • output_dir (str, full path to desired output directory.) – Can be None, in which case no output written to disk.

  • output_filename (str, name of the output json file.) – Can be None, in which case no output written to disk.

Returns

results

Return type

list of dicts.

pyveg.src.file_utils.construct_filename_from_metadata(metadata, suffix)[source]

Given a dictionary of metadata, construct a filename. Will be used for the results summary json, and the summary stats csv as they are uploaded to Zenodo.

pyveg.src.file_utils.construct_image_savepath(output_dir, collection_name, coords, date_range, image_type)[source]

Function to abstract output image filename construction. Current approach is to create a new dir inside output_dir for the satellite, and then save date and coordinate stamped images in this dir.

pyveg.src.file_utils.download_and_unzip(url, output_tmpdir)[source]

Given a URL from GEE, download it (will be a zipfile) to a temporary directory, then extract archive to that same dir. Then find the base filename of the resulting .tif files (there should be one-file-per-band) and return that.

Parameters
  • url (str, URL of zipfile on GEE server.) –

  • output_tmpdir (str, full path of directory into which to unpack zipfile.) –

Returns

tif_filenames

Return type

list of strings, the full paths to unpacked tif files.

pyveg.src.file_utils.get_filepath_after_directory(path, dirname, include_dirname=False)[source]

Return part of a filepath from a certain point onwards. e.g. if we have path /a/b/c/d/e/f and we say dirname=c, then this will return d/e/f if include_dirname==False, or c/d/e/f if it is True.

Parameters
  • path (str, full filepath) –

  • dirname (str, delimeter, from where we will take the remaining filepath) –

  • include_dirname (bool, if True, the returned path will have dirname as its root.) –

pyveg.src.file_utils.get_tag()[source]

Get the git tag currently checked out.

pyveg.src.file_utils.save_image(image, output_dir, output_filename, verbose=False)[source]

Given a PIL.Image (list of pixel values), save to requested filename - note that the file extension will determine the output file type, can be .png, .tif, probably others…

pyveg.src.file_utils.save_json(out_dict, output_dir, output_filename, verbose=False)[source]

Given a dictionary, save to requested filename -

pyveg.src.file_utils.split_filepath(path)[source]

pyveg.src.gee_interface module

pyveg.src.image_utils module

Modify, and slice up tif and png images using Python Image Library Needs a relatively recent version of pillow (fork of PIL): ` pip install --upgrade pillow `

pyveg.src.image_utils.adaptive_threshold(img)[source]

Threshold a grayscale image using the mean pixel value of a local area to set the threshold at each pixel location. At the moment set above average brightness pixels to the max (255) and vice versa for below average brightness pixels.

@param img 2D numpy array representing a grayscale image @return thresholded image

pyveg.src.image_utils.check_image_ok(rgb_image, black_pix_threshold=0.05)[source]

Check the quality of an RGB image. Currently checking if we have > X% pixels being masked. This indicates problems with cloud masking in previous steps.

Parameters

rgb_image (Pillow.Image) – Input image to check the quality of

Returns

True if image passes quality requirements, else False.

Return type

bool

pyveg.src.image_utils.combine_tif(band_dict)[source]

Read tif files - one per specified band, and rescale and combine pixel values to r,g,b values betweek 0 and 255 in a combined output image.

Parameters

band_dict (dict, format {'<r|g|b>': {'band': <band_name>, 'filename': <filename>}}) –

Returns

new_img

Return type

PIL Image, 8-bit rgb image.

pyveg.src.image_utils.compare_binary_image_files(filename1, filename2)[source]

Wrapper for compare_binary_images that opens and closes the image files.

pyveg.src.image_utils.compare_binary_images(image1, image2)[source]

Return the fraction of pixels that are the same in the two images.

pyveg.src.image_utils.convert_to_bw(input_image, threshold, invert=False)[source]

Given an RGB input, apply a threshold to each pixel. If pix(r,g,b)>threshold, set to 255,255,255, if <threshold, set to 0,0,0

pyveg.src.image_utils.convert_to_rgb(band_dict)[source]

If we are given three or more bands, interpret the first as red, the second as green, the third as blue, and scale them to be between 0 and 255 using the combine_tif function. If we are only given one band, use the scale_tif function to scale the range of input values to between 0 and 255 then apply this to all of r,g,b

Parameters

band_dict (dict, format {'<r|g|b|rgb>': {'band': <band_name>, 'filename': <filename>}}) –

pyveg.src.image_utils.create_gif_from_images(directory_path, output_name, string_in_filename='')[source]

Loop through a directory and convert all images in it into a gif chronologically

Parameters
  • directory_path – directory where all the files are.

  • output_name – name to be given to the output gif

  • string_in_filename – select only files that containsa particular string, default is “” which implies all in directory files are selected

Returns

pyveg.src.image_utils.crop_and_convert_all(input_dir, output_dir, threshold=470, num_x=50, num_y=50)[source]

Loop through a whole directory and crop and convert to black+white all files within it.

pyveg.src.image_utils.crop_and_convert_to_bw(input_filename, output_dir, threshold=470, num_x=50, num_y=50)[source]

Open an image file, convert to monochrome, and crop into sub-images.

pyveg.src.image_utils.crop_image_nparts(input_image, n_parts_x, n_parts_y=None)[source]

Divide an image into n_parts_x*n_parts_y equal smaller sub-images.

pyveg.src.image_utils.crop_image_npix(input_image, n_pix_x, n_pix_y=None, region_size=None, coords=None)[source]

Divide an image into smaller sub-images with fixed pixel size. If region_size and coordinates are provided, we want to return the coordinates of the sub-images along with the sub-images themselves.

pyveg.src.image_utils.hist_eq(img, clip_limit=2)[source]

Perform contrast limited local histogram equalisation on an imput image.

@param img 2D numpy array representing a grayscale image @param clip_limit controls the strength of the equalisation @return 2D numpy array representing the equalised image

pyveg.src.image_utils.image_all_same_colour(image, colour=255, 255, 255, threshold=0.99)[source]

Return true if all (or nearly all) pixels are same colour

pyveg.src.image_utils.image_file_all_same_colour(image_filename, colour=255, 255, 255, threshold=0.99)[source]

Wrapper for image_all_same_colour that opens and closes the image file

pyveg.src.image_utils.image_file_to_array(input_filename)[source]

Read an image file and convert to a 2D numpy array, with values 0 for background pixels and 255 for signal. Assume that the input image has only two colours, and take the one with higher sum(r,g,b) to be “signal”.

pyveg.src.image_utils.image_from_array(input_array, output_size=None, sel_val=200)[source]

Convert a 2D numpy array of values into an image where each pixel has r,g,b set to the corresponding value in the array. If an output size is specified, rescale to this size.

pyveg.src.image_utils.invert_binary_image(image)[source]

Swap (255,255,255) with (0,0,0) for all pixels

pyveg.src.image_utils.median_filter(img, r=3)[source]

Convolve a median filter over the image.

@param img 2D numpy array representing a grayscale image @param r the size of the grid to convolve @return 2D numpy array representing the smoothed image

pyveg.src.image_utils.numpy_to_pillow(numpy_image)[source]

Convert a 2D numpy array to a PIL Image object.

@param img 2D numpy array to convert @return PIL Image object

pyveg.src.image_utils.pillow_to_numpy(pil_image)[source]

Convert a PIL Image object to a numpy array (used by openCV).

@param img PIL Image object to convert @return 2D or 3D numpy array (depending on input image)

pyveg.src.image_utils.plot_band_values(input_filebase, bands=['B4', 'B3', 'B2'])[source]

Plot histograms of the values in the chosen bands of the input image

pyveg.src.image_utils.process_and_threshold(img, r=3)[source]

Perform histogram equalisation, adaptive thresholding, and median filtering on an input PIL Image. Return the result converted back to a PIL Image.

@param img input PIL Image object @return processed PIL Image

pyveg.src.image_utils.scale_tif(input_filename)[source]

Given only a single band, scale to range 0,255 and apply this value to all of r,g,b

Parameters

input_filename (str, location of input image) –

Returns

new_img

Return type

pillow Image.

pyveg.src.pattern_generation module

Translation of Matlab code to model patterned vegetation in semi-arid landscapes.

class pyveg.src.pattern_generation.PatternGenerator[source]

Bases: object

Class that can generate simulated veget ation patterns, optionally from a loaded starting pattern, and propagate through time according to various amounts of rainfall and/or surface and soil water density.

static calc_plant_change(plant_biomass, soil_water, uptake, uptake_saturation, growth_constant, senescence, grazing_loss)[source]

Change in plant biomass as a function of available soil water and various constants.

static calc_soil_water_change(soil_water, surface_water, plant_biomass, frac_surface_water_available, bare_soil_infilt, infilt_saturation, plant_growth, soil_water_evap, uptake_saturation)[source]

Change in soil water as a function of surface water, plant_biomass, and various constants.

static calc_surface_water_change(surface_water, plant_biomass, rainfall, frac_surface_water_available, bare_soil_infilt, infilt_saturation)[source]

Change in surface water as a function of rainfall, plant_biomass, and various constants.

configure()[source]

Set initial parameters, loaded from JSON.

evolve_pattern(steps=10000, dt=1)[source]

Run the code to converge on a vegetation pattern

initial_conditions()[source]

Set initial arrays of soil and surface water.

initialize()[source]

Set initial values to zero, and boundary conditions.

load_config(config_filename)[source]

Load a set of configuration parameters from a JSON file

make_binary(threshold=None)[source]

if not given a threshold to use, look at the (max+min)/2 value - for anything below, set to zero, for anything above, set to 1

plot_image()[source]

Display the current pattern.

print_config()[source]
save_as_csv(filename)[source]

Save the image as a csv file

save_as_matlab(filename)[source]

Save the image as a matlab file

save_as_png(filename)[source]

Save the image as a png file

set_rainfall(rainfall)[source]

Rainfall in mm

set_random_starting_pattern()[source]

Use the frac from config file to randomly cover some fraction of cells.

set_starting_pattern_from_file(filename)[source]

Takes full path to a CSV file containing m rows of m comma-separated values, which are zero (bare soil) or not-zero (vegetation covered).

pyveg.src.plotting module

Plotting code.

pyveg.src.plotting.kendall_tau_histograms(series_name, df, output_dir)[source]

Produce histograms with kendall tau distribution from surrogates for significance analysis

Parameters
  • series_name (str) – String containing data collection and time series variable.

  • df (Dataframe) – The output dataframe from the sensitivity analysis function.

  • output_dir – Path to the directory to save the produced figures

pyveg.src.plotting.plot_autocorrelation_function(df, output_dir, filename_suffix='')[source]

Given a time series DataFrames (constructed with make_time_series), plot the autocorrelation function relevant columns.

Parameters
  • df (DataFrame) – Time series DataFrame.

  • output_dir (str) – Directory to save the plots in.

pyveg.src.plotting.plot_correlation_mwa(df, output_dir, filename_suffix='')[source]

Given a moving window time series DataFrame, plot the time series of veg-precip correlation.

Parameters
  • df (DataFrame) – The time-series results for veg-precip correlation coeff and lag.

  • output_dir (str) – Directory to save the plot in.

  • filename_suffix (str) – Add suffix string to file name

pyveg.src.plotting.plot_cross_correlations(df, output_dir)[source]

Plot a scatterplot matrix showing correlations between vegetation and precipitation time series, with different lags. Additionally write out the correlations as a function of the lag for later use.

Parameters
  • df (DataFrame) – Time-series data.

  • output_dir (str) – Directory to save the plot in.

pyveg.src.plotting.plot_ews_resiliance(series_name, EWSmetrics_df, Kendalltau_df, dates, output_dir)[source]

Make early warning signals resiliance plots using the output from the ewstools package.

Parameters
  • series_name (str) – String containing data collection and time series variable.

  • EWSmetrics_df (DataFrame) – DataFrame from ewstools containing ews time series.

  • Kendalltau_df (DataFrame) – DataFrame from ewstools containing Kendall tau values for EWSmetrics_df time series

  • output_dir (str) – Output dir to save plot in.

pyveg.src.plotting.plot_feature_vector(output_dir)[source]

Read feature vectors from csv (if they exist) and then make feature vector plots.

Parameters

output_dir (str) – Directory to save the plot in.

pyveg.src.plotting.plot_moving_window_analysis(df, output_dir, filename_suffix='')[source]

Given a moving window time series DataFrame, plot the time series of AR1 and Variance.

Parameters
  • df (DataFrame) – The time-series results for variance and AR1.

  • output_dir (str) – Directory to save the plot in.

  • filename_suffix (str) – Add suffix string to file name

pyveg.src.plotting.plot_ndvi_time_series(df, output_dir)[source]
pyveg.src.plotting.plot_sensitivity_heatmap(series_name, df, output_dir)[source]

Produce heatmap plot for the sensitivy analysis

Parameters
  • df (Dataframe) – The output dataframe from the sensitivity analysis function.

  • output_dir – Path to the directory to save the produced figures

pyveg.src.plotting.plot_stl_decomposition(df, period, output_dir)[source]

Run the STL decomposition and plot the results network centrality and precipitation DataFrames in df.

Parameters
  • df (DataFrame) – The time-series results.

  • period (float) – Periodicity to model.

  • output_dir (str) – Directory to save the plot in.

pyveg.src.plotting.plot_time_series(df, output_dir)[source]

Given a time series DataFrames (constructed with make_time_series), plot the vegetitation and precipitation time series.

Parameters
  • df (DataFrame) – Time series DataFrame.

  • output_dir (str) – Directory to save the plots in.

pyveg.src.processor_modules module

Class for holding analysis modules that can be chained together to build a sequence.

class pyveg.src.processor_modules.NDVICalculator(name=None)[source]

Bases: pyveg.src.processor_modules.ProcessorModule

Class to look at NDVI on sub-images images, and return the results as json. Note that the input directory is expected to be the level above the subdirectories for the date sub-ranges.

check_sub_image(ndvi_filename, input_path)[source]

Check the RGB sub-image corresponding to this NDVI image looks OK.

process_single_date(date_string)[source]

Each date will have a subdirectory called ‘SPLIT’ with ~400 NDVI sub-images.

process_sub_image(ndvi_filepath, date_string, coords_string)[source]

Calculate mean and standard deviation of NDVI in a sub-image, both with and without masking out non-vegetation pixels.

set_default_parameters()[source]

Default values. Note that these can be overridden by parent Sequence or by calling configure().

class pyveg.src.processor_modules.NetworkCentralityCalculator(name=None)[source]

Bases: pyveg.src.processor_modules.ProcessorModule

Class to run network centrality calculation on small black+white images, and return the results as json. Note that the input directory is expected to be the level above the subdirectories for the date sub-ranges.

check_sub_image(ndvi_filename, input_path)[source]

Check the RGB sub-image corresponding to this NDVI image looks OK.

process_single_date(date_string)[source]

Each date will have a subdirectory called ‘SPLIT’ with ~400 BWNDVI sub-images.

set_default_parameters()[source]

Default values. Note that these can be overridden by parent Sequence or by calling configure().

class pyveg.src.processor_modules.ProcessorModule(name)[source]

Bases: pyveg.src.pyveg_pipeline.BaseModule

check_if_finished()[source]
check_input_data_exists(date_string)[source]

Processor modules will look for inputs in <input_location>/<date_string>/<input_location_subdirs> Check that the subdirs exist and are not empty.

Parameters

date_string (str, format YYYY-MM-DD) –

Returns

Return type

True if input directories exist and are not empty, False otherwise.

check_output_data_exists(date_string)[source]

Processor modules will write output to <output_location>/<date_string>/<output_location_subdirs> Check

Parameters

date_string (str, format YYYY-MM-DD) –

Returns

  • True if expected number of output files are already in output location, – AND self.replace_existing_files is set to False

  • False otherwise

check_timeout(task_status)[source]

See how long since task_status last changed.

create_task_dict(task_id, date_list, dependencies=[])[source]
get_dependent_batch_tasks()[source]

When running in batch, we are likely to depend on tasks submitted by the previous Module in the Sequence. This Module should be in the “depends_on” attribute of this one.

Task dependencies will be a dict of format {“task_id”: <task_id>, “date_range”: [<dates>]}

get_image(image_location)[source]
run()[source]
run_batch()[source]

” Write a config json file for each set of dates. If this module depends on another module running in batch, we first get the tasks on which this modules tasks will depend on. If not, we look at the input dates subdirectories and divide them up amongst the number of batch nodes.

We want to create a list of dictionaries [{“task_id”: <task_id>, “config”: <config_dict>, “depends_on”: [<task_ids>]}] to pass to the batch_utils.submit_tasks function.

run_local()[source]

loop over dates and call process_single_date on all of them.

save_image(image, output_location, output_filename, verbose=True)[source]
set_default_parameters()[source]

Set some basic defaults. Note that these might get overriden by a parent Sequence, or by calling configure() with a dict of values

class pyveg.src.processor_modules.VegetationImageProcessor(name=None)[source]

Bases: pyveg.src.processor_modules.ProcessorModule

Class to convert tif files downloaded from GEE into png files that can be looked at or used as input to further analysis.

Current default is to output: 1) Full-size RGB image 2) Full-size NDVI image (greyscale) 3) Full-size black+white NDVI image (after processing, thresholding, …) 4) Many 50x50 pixel sub-images of RGB image 5) Many 50x50 pixel sub-images of black+white NDVI image.

construct_image_savepath(date_string, coords_string, image_type='RGB')[source]

Function to abstract output image filename construction. Current approach is to create a ‘PROCESSED’ subdir inside the sub-directory corresponding to the mid-period of the date range for the full-size images and a ‘SPLIT’ subdirectory for the sub-images.

process_single_date(date_string)[source]

For a single set of .tif files corresponding to a date range (normally a sub-range of the full date range for the pipeline), construct RGB, and NDVI greyscale images. Then do processing and thresholding to make black+white NDVI images. Split the RGB and black+white NDVI ones into small (50x50pix) sub-images.

Parameters

date_string (str, format YYYY-MM-DD) –

Returns

Return type

True if everything was processed and saved OK, False otherwise.

save_rgb_image(band_dict, date_string, coords_string)[source]

Merge the seperate tif files for the R,G,B bands into one image, and save it.

set_default_parameters()[source]

Set some basic defaults. Note that these might get overriden by a parent Sequence, or by calling configure() with a dict of values

split_and_save_sub_images(image, date_string, coords_string, image_type, npix=50)[source]

Split the full-size image into lots of small sub-images

image: pillow Image date_string: str, format YYYY-MM-DD coords_string: str, format long_lat image_type: str, typically ‘RGB’ or ‘BWNDVI’ npix: dimension in pixels of side of sub-image. Default is 50x50

True if all sub-images saved correctly.

class pyveg.src.processor_modules.WeatherImageToJSON(name=None)[source]

Bases: pyveg.src.processor_modules.ProcessorModule

Read the weather-related tif files downloaded from GEE, and write the temp and precipitation values out as a JSON file.

process_single_date(date_string)[source]

Read the tif files downloaded from GEE and extract the values (should be the same for all pixels in the image, so just take mean())

Parameters

date_string (str, format "YYYY-MM-DD") –

set_default_parameters()[source]

Set some basic defaults. Note that these might get overriden by a parent Sequence, or by calling configure() with a dict of values

pyveg.src.processor_modules.process_sub_image(i, input_filepath, output_location, date_string, coords_string)[source]

Read file and run network centrality

pyveg.src.pyveg_pipeline module

Definitions:

A PIPELINE is the whole analysis procedure for one set of coordinates. It will likely consist of a couple of SEQUENCES - e.g. one for vegetation data and one for weather data.

A SEQUENCE is composed of one or more MODULES, that each do specific tasks, e.g. download data, process images, calculate quantities from image.

A special type of MODULE may be placed at the end of a PIPELINE to combine the results of the different SEQUENCES into one output file.

class pyveg.src.pyveg_pipeline.BaseModule(name=None)[source]

Bases: object

A “Module” is a building block of a sequence - takes some input, does something (e.g. Downloads from GEE, processes some images, …) and produces some output. The working directory for all modules within a sequence will be given by the sequence - modules may write output to subdirectories of this (e.g. for different dates), but what we call “output_location” will be the base directory common to all modules, and will contain info about the image collection name, and the coordinates.

check_config()[source]

Loop through list of parameters, which will each be a tuple (name, [allowed_types]) and check that the parameter exists, and is of the correct type.

check_for_existing_files(location, num_files_expected)[source]

See if there are already num_files in the specified location. If “replace_existing_files” is set to True, always return False

check_if_finished()[source]
configure(config_dict=None)[source]

Order of preference for configuriation: 1) config_dict 2) values held by the parent Sequence 3) default values So we set them in reverse order here, so higher priorities will override.

copy_to_output_location(tmpdir, output_location, file_endings=[])[source]

Copy contents of a temporary directory to a specified output location.

Parameters
  • tmpdir (str, location of temporary directory) –

  • output_location (str, either path to a local directory (if self.output_location_type is "local")) – or to Azure <container>/<blob_path> if self.output_location_type==”azure”)

  • file_endings (list of str, optional. If given, only files with those endings will be copied.) –

get_config()[source]

Get the configuration of this module as a dict.

get_file(filename, location_type)[source]

Just return the filename if location _type is “local”. Otherwise return a tempfile with the contents of a blob if the location is “azure”.

get_json(filepath, location_type)[source]

Read a json file either local or blob storage.

join_path(*path_elements)[source]

If output_location_type is ‘local’, we will just use os.path.join, which puts a “/” separator in for posix, or “” for windows. However, if output_location_type is ‘azure’, we always want “/”.

Parameters

path_elements (list of strings. Directory-like path elements.) –

Returns

path

Return type

str, the path elements joined by “/” or “”.

list_directory(directory_path, location_type)[source]

List contents of a directory, either on local file system or Azure blob storage.

prepare_for_run()[source]
print_run_status()[source]

Print out how many jobs succeeded or failed

save_config(config_location)[source]

Write out the configuration of this module as a json file.

save_json(data, filename, location, location_type)[source]

Save json to local filesystem or blob storage depending on location_type

set_default_parameters()[source]
set_parameters(config_dict)[source]
class pyveg.src.pyveg_pipeline.Pipeline(name)[source]

Bases: object

A Pipeline contains all the Sequences we want to run on a particular set of coordinates and a date range. e.g. there might be one Sequence for vegetation data and one for weather data.

cleanup()[source]

Call cleanup() for all our sequences

configure()[source]

Configure all the sequences in this pipeline.

get(seq_name)[source]

Return a sequence object when asked for by name.

print_run_status()[source]
run()[source]

run all the sequences in this pipeline

class pyveg.src.pyveg_pipeline.Sequence(name)[source]

Bases: object

A Sequence is a collection of Modules where the output of one module is typically the input to the next one. It will typically correspond to a particular data collection, e.g. for vegetation imagery, we might have one module to download the images, one to process them, and one to analyze the processed images.

check_if_finished()[source]

Only relevant when one or more modules are running in batch mode, Sequences that depend on this Sequence will call this function while they wait for all Modules to finish.

cleanup()[source]

If we have batch resources (job/pool), remove them to avoid charges

configure()[source]
create_batch_job_if_needed()[source]

If any modules in this sequence are to be run in batch mode, create a batch job for them.

get(mod_name)[source]

Return a module object when asked for by name, or by class name

has_batch_job()[source]

Do any of the Modules in this sequence have run_mode == ‘batch’?

join_path(*path_elements)[source]

If output_location_type is ‘local’, we will just use os.path.join, which puts a “/” separator in for posix, or “” for windows. However, if output_location_type is ‘azure’, we always want “/”.

Parameters

path_elements (list of strings. Directory-like path elements.) –

Returns

path

Return type

str, the path elements joined by “/” or “”.

print_run_status()[source]

For all modules in the sequence, print out how many jobs succeeded or failed.

run()[source]

Before we run the Modules in this Sequence, check if there are any other Sequences on which we depend, and if so, wait for them to finish.

set_config(config_dict)[source]
set_output_location()[source]

pyveg.src.subgraph_centrality module

Python version of mao_pollen.m matlab code to look at connectedness of pixels on a binary image, using “Subgraph Centrality” as described in:

Mander et.al. “A morphometric analysis of vegetation patterns in dryland ecosystems”, R. Soc. open sci. (2017) https://royalsocietypublishing.org/doi/10.1098/rsos.160443

Mander et.al. “Classification of grass pollen through the quantitative analysis of surface ornamentation and texture”, Proc R Soc B 280: 20131905. https://royalsocietypublishing.org/doi/pdf/10.1098/rspb.2013.1905

Estrada et.al. “Subgraph Centrality in Complex Networks” https://arxiv.org/pdf/cond-mat/0504730.pdf

pyveg.src.subgraph_centrality.calc_adjacency_matrix(distance_matrix, include_diagonal_neighbours=False)[source]

Return a symmetric matrix of (n-pixels-over-threshold)x(n-pixel-over-threshold) where each element ij is 0 or 1 depending on whether the distance between pixel i and pixel j is < or > neighbour_threshold.

pyveg.src.subgraph_centrality.calc_and_sort_sc_indices(adjacency_matrix)[source]

Given an input adjacency matrix, calculate eigenvalues and eigenvectors, calculate the subgraph centrality (ref: <== ADD REF), then sort.

pyveg.src.subgraph_centrality.calc_distance_matrix(signal_coords)[source]

calculate the distances between all signal pixels in the original image

pyveg.src.subgraph_centrality.calc_euler_characteristic(pix_indices, graph)[source]

Find the edges where both ends are within the pix_indices list

pyveg.src.subgraph_centrality.crop_image_array(input_image, x_range, y_range)[source]

return a new image from specified pixel range of input image

pyveg.src.subgraph_centrality.feature_vector_metrics(feature_vector, output_csv=None)[source]

Calculate different metrics for the feature vector

pyveg.src.subgraph_centrality.fill_feature_vector(pix_indices, coords, adj_matrix, num_quantiles=20)[source]

Given indices and coordinates of signal pixels ordered by SC value, put them into quantiles and calculate an element of a feature vector for each quantile. by using the Euler Characteristic.

Will return:

selected_pixels, feature_vector

where selected_pixels is a vector of the pixel coordinates in each quantile, and a feature_vector is either num-connected-components or Euler characteristic, for each quantile.

pyveg.src.subgraph_centrality.fill_sc_pixels(sel_pixels, orig_image, val=200)[source]

Given an original 2D array where all the elements are 0 (background) or 255 (signal), fill in a selected subset of signal pixels as 123 (grey).

pyveg.src.subgraph_centrality.generate_sc_images(sel_pixels, orig_image, val=200)[source]

Return a dict of images with the selected subsets of signal pixels filled in in cyan.

pyveg.src.subgraph_centrality.get_signal_pixels(input_array, threshold=255, lower_threshold=True, invert_y=False)[source]

Find coordinates of all pixels within the image that are > or < the threshold ( require < threshold if lower_threshold==True) NOTE - if invert_y is set, we make the second coordinate negative, for reasons.

pyveg.src.subgraph_centrality.invert_y_coord(coord_list)[source]

Convert [(x1,y1),(x2,y2),…] to [(x1,-y1),(x2,-y2),…]

pyveg.src.subgraph_centrality.make_graph(adj_matrix)[source]

Use igraph to create a graph from our adjacency matrix

pyveg.src.subgraph_centrality.save_sc_images(image_dict, file_prefix)[source]

Saves images from dictionary.

pyveg.src.subgraph_centrality.subgraph_centrality(image, use_diagonal_neighbours=False, num_quantiles=20, threshold=255, lower_threshold=True, output_csv=None)[source]

Go through the whole calculation, from input image to output vector of pixels in each SC quantile, and feature vector (either connected-components or Euler characteristic).

pyveg.src.subgraph_centrality.text_file_to_array(input_filename)[source]

Read a csv-like representation of an image, where each row (representing a row of pixels in the image) is a comma-separated list of pixel values 0 (for black) or 255 (for white).

pyveg.src.subgraph_centrality.write_csv(feature_vec, output_filename)[source]

Write the feature vector to a 1-line csv

pyveg.src.subgraph_centrality.write_dict_to_csv(metrics_dict, output_filename)[source]

pyveg.src.zenodo_utils module

Use the Zenodo API to deposit or retrieve data.

Needs an API token - to create one: Sign-in or create an account at https://zenodo.org Create an API token by going to this page: https://zenodo.org/account/settings/applications/tokens/new/

tick “deposit:actions” and “deposit:write” in the “Scopes” section

and click Create. Then copy the created token into a file called “zenodo_api_token” in the pyveg/configs/ directory.

OR, to use the “Sandbox” API for testing, follow the same steps but replacing “zenodo.org” with “sandbox.zenodo.org” in the URLs, and put the token into a file named “zenodo_test_api_token” then call the functions in this module with the “test” argument set to True.

pyveg.src.zenodo_utils.create_deposition(test=False)[source]

Create a new, empty deposition.

Parameters

test (bool, True if we will use the sandbox API, False otherwise) –

Returns

r

Return type

dict, response from the API with info about the newly created deposition

pyveg.src.zenodo_utils.delete_file(filename, deposition_id, test=False)[source]

Delete a file from a deposition.

Parameters
  • filename (str, full path to the file to be deleted) –

  • deposition_id (int, ID of the deposition containing this file) –

  • test (bool, True if we will use the sandbox API, False otherwise) –

Returns

Return type

True if file was deleted OK, False otherwise.

pyveg.src.zenodo_utils.download_file(filename, deposition_id, destination_path='.', test=False)[source]

Upload a file to a deposition.

Parameters
  • filename (str, full path to the file to be uploaded) –

  • deposition_id (int, ID of the deposition containing this file) –

  • destination_path (str, where to put the downloaded file) –

  • test (bool, True if we will use the sandbox API, False otherwise) –

Returns

filepath

Return type

str, location of downloaded file.

pyveg.src.zenodo_utils.download_results_by_coord_id(coords_id, json_or_csv='json', destination_path=None, deposition_id=None, test=False)[source]

Search the deposition (defined by the deposition_id in zenodo_config.py) for results_summary json or summary_stats csv files beginning with ‘coord_id’ and download the most recent one.

Parameters
  • coords_id (str, two-digit string identifiying the row of the location in coordinates.py) –

  • json_or_csv (str, if "json", download 'results_summary.json', otherwise download 'ts_summary_stats.csv'.) –

  • destination_path (str, directory to download to. If not given, put in temporary dir) –

  • deposition_id (str, deposition ID in Zenodo. If not given, use the one from zenodo_config.py) –

  • test (bool, if True, use the sandbox Zenodo repository) –

pyveg.src.zenodo_utils.get_base_url_and_token(test=False)[source]

Get the base URL for the API, and the API token, for use in requests.

Parameters

test (bool, True if we will use the sandbox API, False otherwise) –

Returns

  • base_url (str, the first part of the URL for the API)

  • api_token (str, the personal access token, read from a file.)

pyveg.src.zenodo_utils.get_bucket_url(deposition_id, test=False)[source]

For a given deposition_id, find the URL needed to upload a file.

Parameters
  • deposition_id (int, ID of the deposition.) –

  • test (bool, if True use the sandbox API, if False will use the real one.) –

Returns

bucket_url

Return type

str, the URL of the bucket for this deposition, or empty string if id not found

pyveg.src.zenodo_utils.get_deposition_id(json_or_csv='json', test=False)[source]

If we have previously created a deposition, we hopefully stored its ID in the zenodo_config.py file.

pyveg.src.zenodo_utils.get_deposition_info(deposition_id, test=False)[source]

Get the JSON object containing details of a deposition.

Parameters
  • deposition_id (int, ID of the deposition.) –

  • test (bool, if True use the sandbox API, if False will use the real one.) –

Returns

dep_info

Return type

dict, information about the deposition

pyveg.src.zenodo_utils.get_results_summary_json(coords_string, collection, deposition_id, test=False)[source]

Assuming the zipfile is named following the convention results_<long>_<lat>_<collection>.zip download this from the deposition, and extract the results_summary.json.

pyveg.src.zenodo_utils.list_depositions(test=False)[source]

List all the depositions created by this account.

Parameters

test (bool, True if we will use the sandbox API, False otherwise) –

Returns

r

Return type

list of dicts, response from the API with info about the depositions

pyveg.src.zenodo_utils.list_files(deposition_id, json_or_csv='json', test=False)[source]

List all the files in a deposition.

Parameters
  • deposition_id (int, ID of the deposition on which to list files) –

  • json_or_csv (str, if 'json', list the deposition containing the results_summary.json) – otherwise list the one containing ts_summary_stats.csv

  • test (bool, True if using the sandbox API, False otherwise) –

Returns

files

Return type

list[str], list of all filenames in the deposition.

pyveg.src.zenodo_utils.prepare_results_zipfile(collection_name, png_location, png_location_type='local', json_location=None, json_location_type='local')[source]

Create a zipfile called <results_long_lat_collection> containing the ‘results_summary.json’, and the outputs of the analysis.

Parameters
  • collection_name (str, typically "Sentinel2" or "Landsat8" or similar) –

  • base_png_location (str, directory containing analysis/ subdirectory) –

  • png_location_type (str, either "local" or "azure") –

  • base_json_location (str, directory containing "results_summary.json.) – If not specified, assume same as base_png_location

  • json_location_type (str, either "local" or "azure") –

Returns

zip_filename

Return type

str, location of the produced zipfile

pyveg.src.zenodo_utils.publish_deposition(deposition_id, test=False)[source]

Submit the deposition, so it will be findable on Zenodo and have a DOI.

pyveg.src.zenodo_utils.unlock_deposition(deposition_id, test=False)[source]

Unlock a previously submitted deposition, so we can add to it.

pyveg.src.zenodo_utils.upload_custom_metadata(title, upload_type, description, creators, deposition_id, test=False)[source]

Upload a dict to the deposition containing metadata with the format:

{
‘metadata’: {

‘title’: ‘My first upload’, ‘upload_type’: ‘poster’, ‘description’: ‘This is my first upload’, ‘creators’: [{‘name’: ‘Doe, John’,

‘affiliation’: ‘Zenodo’}]

}

}

title: str, title of the deposition upload_type: str, type of upload, typically “dataset” description: str, description of the deposition creators: dict, format {“name”: <str:name>, “affiliation”: <str:affiliation>}

Returns

r

Return type

dict, JSON response from the API.

pyveg.src.zenodo_utils.upload_file(filename, deposition_id, test=False)[source]

Upload a file to a deposition.

Parameters
  • filename (str, full path to the file to be uploaded) –

  • deposition_id (int, ID of the deposition to which we want to upload.) –

  • test (bool, True if we will use the sandbox API, False otherwise) –

Returns

uploaded_ok

Return type

bool, True if we get status code 200 from the API

pyveg.src.zenodo_utils.upload_standard_metadata(deposition_id, json_or_csv='json', test=False)[source]

Upload the metadata dict defined in zenodo_config.py to the specified deposition ID.Kcontaining metadata with the format:

deposition_id: int, ID of the deposition to which to upload json_or_csv: str, can be either ‘json’ to upload the metadata for results_summary.json

or csv to upload the metadata for ts_summary_stats.csv

test: if True, use the sandbox API, if False use the production one.

Returns

r

Return type

dict, JSON response from the API.

Module contents