Source code of the pyveg package¶
pyveg.src.analysis_preprocessing module¶
This module consists of methods to process downloaded GEE data. The starting point is a json file written out at the end of the downloading step. This module cleans, resamples, and reformats the data to make it ready for analysis.
-
pyveg.src.analysis_preprocessing.
detrend_data
(dfs, period='MS')[source]¶ Loop over each sub image time series DataFrames and remove time series seasonality by subtracting the previous year. Remove seasonality from precipitation data in the same way.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
period (str, optional) –
Resample time series to this frequency and then infer (`) – lag to use for deseasonalizing.
- Returns
Time series data for multiple sub-image with seasonality removed.
- Return type
dict of DataFrame
-
pyveg.src.analysis_preprocessing.
detrend_df
(df, period='MS')[source]¶ Remove seasonality from a DataFrame containing the time series for a single sub-image.
- Parameters
df (DataFrame) – Time series data for a single sub-image location.
period (str, optional) –
Resample time series to this frequency and then infer (`) – lag to use for deseasonalizing.
- Returns
Input with seasonality removed from time series columns.
- Return type
DataFrame
-
pyveg.src.analysis_preprocessing.
drop_veg_outliers
(dfs, column='offset50', sigmas=3.0)[source]¶ Loop over vegetation DataFrames and drop points in the time series that a significantly far away from the mean of the time series. Such points are assumed to be unphysical.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
column (str) – Name of the column to drop outliers on.
sigmas (float) – Number of standard deviations a data point has to be from the mean to be labelled as an outlier and dropped.
- Returns
Time series data for multiple sub-image locations with some values in column potentially set to NaN.
- Return type
dict of DataFrame
-
pyveg.src.analysis_preprocessing.
fill_veg_gaps
(dfs, missing)[source]¶ Loop through sub-image time series and replace any gaps with mean value of the same month in other years.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
missing (dict of array) – Missing time points where no sub-images were analyse for each veg dataframe in dfs.
-
pyveg.src.analysis_preprocessing.
get_missing_time_points
(dfs)[source]¶ Find missing time points for each vegetation dataframe in dfs, and return a dict, with the same key as in dfs, but with values corresponding to missing dates.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
- Returns
Missing time points for each vegetation df.
- Return type
dict
-
pyveg.src.analysis_preprocessing.
make_time_series
(dfs)[source]¶ Given a dictionary of DataFrames which may contian many rows per time point (corresponding to the network centrality values of different sub-locations), collapse this into a time series by calculating the mean and std of the different sub- locations at each date.
- Parameters
dfs (dict of DataFrame) – Input DataFrame read by read_json_to_dataframes.
- Returns
ts_list – The time-series results averaged over sub-locations. First entry will be main dataframe of vegetation and weather. Second one (if present) will be historical weather.
- Return type
list of DataFrames
-
pyveg.src.analysis_preprocessing.
preprocess_data
(input_json, output_basedir, drop_outliers=True, fill_missing=True, resample=True, smoothing=True, detrend=True, n_smooth=4, period='MS')[source]¶ This function reads and process data downloaded by GEE. Processing can be configured by the function arguments. Processed data is written to csv.
- Parameters
input_json (dict) – JSON data created during a GEE download job.
output_basedir (str,) – Directory where time-series csv will be put.
drop_outliers (bool, optional) – Remove outliers in sub-image time series.
fill_missing (bool, optional) – Fill missing points in the time series.
resample (bool, optional) – Resample the time series using linear interpolation.
smoothing (bool, optional) – Smooth the time series using LOESS smoothing.
detrend (bool, optional) – Remove seasonal component by subtracting previous year.
n_smooth (int, optional) – Number of time points to use for the smoothing window size.
period (str, optional) – Pandas DateOffset string describing sampling frequency.
- Returns
output_dir (str) – Path to the csv file containing processed data.
defs (dict) – Dictionary of dataframes.
-
pyveg.src.analysis_preprocessing.
read_json_to_dataframes
(data)[source]¶ convert json data to a dict of DataFrame. :param data: :type data: dict, json data output from run_pyveg_pipeline
- Returns
A dict of the saved results in a DataFrame format. Keys are names of collections and the values are DataFrame of results for that collection.
- Return type
dict
-
pyveg.src.analysis_preprocessing.
read_results_summary
(input_location, input_filename='results_summary.json', input_location_type='local')[source]¶ Read the results_summary.json, either from local storage, Azure blob storage, or zenodo.
- Parameters
input_location (str, directory or container with results_summary.json in,) – or coords_id if reading from zenodo
input_filename (str, name of json file, default is "results_summary.json") –
input_location_type (str: 'local' or 'azure' or 'zenodo' or 'zenodo_test') –
- Returns
json_data
- Return type
dict, the contents of results_summary.json
-
pyveg.src.analysis_preprocessing.
resample_data
(dfs, period='MS')[source]¶ Resample vegetation and rainfall DataFrames. Vegetation DataFrames are resampled at the sub-image level.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
period (string) – Period for resampling.
- Returns
Resampled data.
- Return type
dict of DataFrame
-
pyveg.src.analysis_preprocessing.
resample_dataframe
(df, columns, period='MS')[source]¶ Resample and interpolate a time series dataframe so we have one row per time period.
- Parameters
df (DataFrame) – Dataframe with date as index.
columns (list) – List of column names to resample. Should contain numeric data.
period (string) – Period for resampling.
- Returns
DataFrame with resample time series in columns.
- Return type
DataFrame
-
pyveg.src.analysis_preprocessing.
resample_time_series
(series, period='MS')[source]¶ Resample and interpolate a time series dataframe so we have one row per time period (useful for FFT)
- Parameters
df (DataFrame) – Dataframe with date as index
col_name (string,) – Identifying the column we will pull out
period (string) – Period for resampling
- Returns
pandas Series with datetime index, and one column, one row per day
- Return type
Series
-
pyveg.src.analysis_preprocessing.
save_ts_summary_stats
(ts_dirname, output_dir, metadata)[source]¶ Given a time series DataFrames (constructed with make_time_series), give summary statistics of all the avalaible time series.
- Parameters
ts_dirname (str) – Directory where the time series are saved.
output_dir (str) – Directory to save the plots in.
metadata (dict) – Dictionary with metadata from location
-
pyveg.src.analysis_preprocessing.
smooth_all_sub_images
(df, column='offset50', n=4, it=3)[source]¶ Perform LOWESS (Locally Weighted Scatterplot Smoothing) on the time series of a set of sub-images.
- Parameters
df (DataFrame) – DataFrame containing time series results for all sub-images, with multiple rows per time point and (lat,long) point.
column (string, optional) – Name of the column in df to smooth.
n (int, optional) – Size of smoothing window.
it (int, optional) – Number of iterations of LOESS smoothing to perform.
- Returns
DataFrame of results with a new column containing a LOESS smoothed version of the column column.
- Return type
Dataframe
-
pyveg.src.analysis_preprocessing.
smooth_subimage
(df, column='offset50', n=4, it=3)[source]¶ Perform LOWESS (Locally Weighted Scatterplot Smoothing) on the time series of a single sub-image.
- Parameters
df (DataFrame) – Input DataFrame containing the time series for a single sub-image.
column (string, optional) – Name of the column in df to smooth.
n (int, optional) – Size of smoothing window.
it (int, optional) – Number of iterations of LOESS smoothing to perform.
- Returns
The time-series DataFrame with a new column containing the smoothed results.
- Return type
DataFrame
-
pyveg.src.analysis_preprocessing.
smooth_veg_data
(dfs, column='offset50', n=4)[source]¶ Loop over vegetation DataFrames and perform LOESS smoothing on the time series of each sub-image.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
column (str) – Name of the column to drop outliers and smooth.
n (int) – Number of neighbouring point to use in smoothing
- Returns
Time series data for multiple sub-image locations with new column for smoothed data and ci.
- Return type
dict of DataFrame
-
pyveg.src.analysis_preprocessing.
store_feature_vectors
(dfs, output_dir)[source]¶ Write out all feature vector information to a csv file, to be read later by the feature vector plotting script.
- Parameters
dfs (dict of DataFrame) – Time series data for multiple sub-image locations.
output_dir (str) – Path to directory to save the csv.
pyveg.src.azure_utils module¶
-
pyveg.src.azure_utils.
check_blob_exists
(blob_name, container_name, bbs=None)[source]¶ See if a blob already exists for this account name.
-
pyveg.src.azure_utils.
check_container_exists
(container_name, bbs=None)[source]¶ See if a container already exists for this account name.
-
pyveg.src.azure_utils.
download_rgb
(container, rgb_dir)[source]¶ - Parameters
container (str, the container name) –
rgb_dir (str, directory into which to put image files.) –
-
pyveg.src.azure_utils.
download_summary_json
(container, json_dir)[source]¶ - Parameters
container (str, the container name) –
json_dir (str, temporary directory into which to put json file.) –
-
pyveg.src.azure_utils.
get_sas_token
(container_name, token_duration=1, permissions='READ', bbs=None)[source]¶
-
pyveg.src.azure_utils.
remove_container_name_from_blob_path
(blob_path, container_name)[source]¶ Get the bit of the filepath after the container name.
-
pyveg.src.azure_utils.
retrieve_blob
(blob_name, container_name, destination='/tmp/', bbs=None)[source]¶ use the BlockBlobService to retrieve file from Azure, and place in destination folder.
-
pyveg.src.azure_utils.
sanitize_container_name
(orig_name)[source]¶ only allowed alphanumeric characters and dashes.
-
pyveg.src.azure_utils.
save_image
(image, output_location, output_filename, container_name, format='png', bbs=None)[source]¶ Given a PIL.Image (list of pixel values), save to requested filename - note that the file extension will determine the output file type, can be .png, .tif, probably others…
-
pyveg.src.azure_utils.
write_files_to_blob
(path, container_name, blob_path=None, file_endings=[], bbs=None)[source]¶ Upload a whole directory structure to blob storage. If we are given ‘blob_path’ we use that - if not we preserve the given file path structure. In both cases we take care to remove the container name from the start of the blob path
pyveg.src.batch_utils module¶
Functions for submitting batch jobs. Currently only support Azure Batch. Largely taken from https://github.com/Azure-Samples/batch-python-quickstart
-
pyveg.src.batch_utils.
add_task
(task_id, job_name, input_script, input_config, input_azure_config, task_dependencies, batch_service_client=None)[source]¶ add the batch task to the job.
- Parameters
task_id (str, unique ID within this job for the task) –
job_name (str, name for the job - usually Sequence name + timestamp) –
input_script (ResourceFile corresponding to bash script uploaded to blob storage) –
input_config (ResourceFile corresponding to json config for this task uploaded to blob storage) –
input_azure_config (ResourceFile corresponding to azure config, uploaded to blob storage) –
task_dependencies (list of str, task_ids of any tasks that this one depends on) –
batch_service_client (BatchServiceClient) –
-
pyveg.src.batch_utils.
check_task_failed_dependencies
(task, job_id, batch_service_client=None)[source]¶ If a task depends on other task(s), and those have failed, the job will not be able to run.
- Parameters
task (azure.batch.models.CloudTask, the task we will look at dependencies for) –
job_id (str, the unique ID of the Job.) –
batch_service_client (BatchServiceClient - will create if not provided.) –
- Returns
True if the job depends on other tasks that have failed (or those – tasks depend on failed tasks)
False otherwise
-
pyveg.src.batch_utils.
check_tasks_status
(job_id, task_name_prefix='', batch_service_client=None)[source]¶ For a given job, query the status of all the tasks.
- Returns
task_status – num_success: int, successfully completed num_failed: int, completed but with non-zero exit code num_running: int, currently running num_waiting: int, in “active” state num_cannot_run: int, in “active” state, but with dependent tasks that failed.
- Return type
dict, containing the following keys/values:
-
pyveg.src.batch_utils.
create_job
(job_id, pool_id=None, batch_service_client=None)[source]¶ Creates a job with the specified ID, associated with the specified pool.
- Parameters
job_id (str, ID for the job - will typically be module or sequence name +timestamp) –
pool_id (str, ID for the pool. If not provided, use the one from azure_config.py) –
batch_service_client (BatchServiceClient instance. Create one if not provided.) –
-
pyveg.src.batch_utils.
create_pool
(pool_id, batch_service_client=None)[source]¶ Creates a pool of compute nodes.
- Parameters
pool_id (str, identifier for the pool) –
batch_service_client (azure.batch.BatchServiceClient, A Batch service client.) –
-
pyveg.src.batch_utils.
delete_job
(job_id, batch_service_client=None)[source]¶ Removes a job, and associated tasks.
-
pyveg.src.batch_utils.
delete_pool
(pool_id=None, batch_service_client=None)[source]¶ Removes a pool of batch nodes
-
pyveg.src.batch_utils.
prepare_for_task_submission
(job_name, config_container_name, batch_service_client, blob_client)[source]¶ Create pool and job if not already existing, and upload the azure config file and the bash script used to run the batch job.
- Parameters
job_name (str, ID of the job) –
batch_service_client (BatchServiceClient to interact with Azure batch.) –
- Returns
input_azure_config, input_script – and batch_commands.sh scripts, uploaded to blob storage.
- Return type
ResourceFiles corresponding to the azure_config.py
-
pyveg.src.batch_utils.
print_task_output
(batch_service_client, job_id, encoding=None)[source]¶ Prints the stdout.txt file for each task in the job.
- Parameters
batch_client (batchserviceclient.BatchServiceClient) – The batch client to use.
job_id (str) – The id of the job with task output files to print.
-
pyveg.src.batch_utils.
submit_tasks
(task_dicts, job_name)[source]¶ Submit batch jobs to Azure batch.
- task_dicts: list of dicts, [ {
“task_id”: <task_id>, “config”: <config_dict>, “depends_on”: [<task_ids>]
} ]
job_name: str, should identify the sequence generating the jobs
-
pyveg.src.batch_utils.
upload_file_to_container
(block_blob_client, container_name, file_path)[source]¶ Uploads a local file to an Azure Blob storage container.
- Parameters
block_blob_client (azure.storage.blob.BlockBlobService) – A blob service client.
container_name (str) – The name of the Azure Blob storage container.
file_path (str) – The local path to the file.
- Return type
azure.batch.models.ResourceFile
- Returns
A ResourceFile initialized with a SAS URL appropriate for Batch
tasks.
-
pyveg.src.batch_utils.
wait_for_tasks_to_complete
(job_id, timeout=60, batch_service_client=None)[source]¶ Returns when all tasks in the specified job reach the Completed state.
- Parameters
batch_service_client (azure.batch.BatchServiceClient) – A Batch service client.
job_id (str) – The id of the job whose tasks should be to monitored.
timeout (timedelta) – The duration to wait for task completion. If all
tasks in the specified job do not reach Completed state within this time period, an exception will be raised.
pyveg.src.combiner_modules module¶
Modules that can consolidate inputs from different sources and produce combined output file (typically JSON).
-
class
pyveg.src.combiner_modules.
VegAndWeatherJsonCombiner
(name=None)[source]¶ Bases:
pyveg.src.combiner_modules.CombinerModule
Expect directory structures like: <something>/<input_veg_location>/<date>/network_centralities.json <something>/<input_weather_location>/RESULTS/weather_data.json
-
check_output_dict
(output_dict)[source]¶ For all the keys (i.e. dates) in the vegetation time-series, count how many have data for both veg and weather
-
combine_json_lists
(json_lists)[source]¶ If for example we have json files from the NetworkCentrality and NDVI calculators, all containing lists of dicts for sub-images, combine them here by matching by coordinate.
-
get_veg_time_series
()[source]¶ Combine contents of JSON files written by the NetworkCentrality and NDVI calculator Modules. If we are running in a Pipeline, get the expected set of date strings from the vegetation sequence we depend on, and if there is no data for a particular date, make a null entry in the output.
-
set_default_parameters
()[source]¶ See if we can set our input directories from the output directories of previous Sequences in the pipeline. The pipeline (if there is one) will be a grandparent, i.e. self.parent.parent and the names of the Sequences we will want to combine should be in the variable self.depends_on.
-
pyveg.src.coordinate_utils module¶
Collection of utility functions for manipulating coordinates and their string representations.,
-
pyveg.src.coordinate_utils.
coords_dict_to_coords_string
(coords)[source]¶ Given a dict of long/lat values, return a string, rounding to 2 decimal places.
-
pyveg.src.coordinate_utils.
coords_list_to_coords_string
(coords)[source]¶ Given a list or tuple of [long, lat], return a string, rounding to 2 decimal places.
-
pyveg.src.coordinate_utils.
find_coords_string
(file_path)[source]¶ Parse a file path using a regular expresion to find a substring that looks like a set of coordinates, and return that.
-
pyveg.src.coordinate_utils.
get_region_string
(coords, region_size)[source]¶ Given a set of (long,lat) coordinates, and the size of a square region in long,lat space, return a string in the format expected by GEE.
- Parameters
coords (list of floats, [longitude,latitude]) –
region_size (float, size of each side of the region, in degrees) –
- Returns
region_string – representing four corners of the region.
- Return type
str, string representation of list of four coordinates,
-
pyveg.src.coordinate_utils.
get_sub_image_coords
(coords, region_size, x_parts, y_parts)[source]¶ If an image is divided into sub_images, return a list of coordinates for all the sub-images.
- Parameters
coords (list of floats, [long,lat]) –
region_size (float, size of square image in degrees long,loat) –
x_parts (int, number of sub-images in x-direction) –
y_parts (int, number of sub-images in y-direction) –
- Returns
sub_image_coords
- Return type
list, of lists of floats [[long,lat],..]
pyveg.src.data_analysis_utils module¶
Data analysis code including functions to read the .json results file, and functions analyse and plot the data.
-
pyveg.src.data_analysis_utils.
ar1_moving_average_time_series
(series, length=1)[source]¶ Calculate an AR1 time series using a moving average
- Parameters
series (pandas Series) – Time series observations.
length (int) – Length of the moving window in number of observations.
- Returns
pandas Series with datetime index, and one column, one row per date
- Return type
pandas Series
-
pyveg.src.data_analysis_utils.
calculate_ci
(data, ci_level=0.99)[source]¶ Calculate the confidence interval on the mean for a set of data. :param data: Series of data to calculate the confidence interval of the mean. :type data: Series :param ci_level: Size of the confidence interval to calculate :type ci_level: float, optional
- Returns
Confidence interval value where the CI is [mu - h, mu + h], where mu is the mean.
- Return type
float
-
pyveg.src.data_analysis_utils.
cball
(x=range(1, 13), alpha=1.5, n=150.0, xbar=8.0, sigma=2.0)[source]¶ Calculates the Crystal Ball pdf on the values 1 to 12 by default (i.e. monthly) Default parameter values give a fit close to those we would expect from offset50 time series :param x: Index values going from 1 to the length of the annual time series :type x: Time series :param alpha: Parameters used in Crystal Ball pdf calculation :type alpha: Model parameters, int :param n: Parameters used in Crystal Ball pdf calculation :type n: Model parameters, int :param xbar: Parameters used in Crystal Ball pdf calculation :type xbar: Model parameters, int :param sigma: Parameters used in Crystal Ball pdf calculation :type sigma: Model parameters, int
- Returns
The values of the Crystal Ball pdf for each index of x
- Return type
ndarray
-
pyveg.src.data_analysis_utils.
cball_parfit
(p0, timeseries, plot_name='CB_fit.png', output_dir='')[source]¶ Uses least squares regression to optimise the parameters in cball to fit the timeseries supplied. The supplied time series should be the original series as this function finds the mean annual ts and reverses and normalises it :param p0: A list a parameters (alpha, n, xbar, sigma) to use in the Crystal Ball calculation as an initial estimate :type p0: Initial parameters, list :param timeseries: Original time series to calculate mean annual time series on, reverse and normalise
and then use to optimise the parameters on
- Parameters
plot_name (string) – Name for the data/fit comparison plot
output_dir (str) – Directory to save the plots in.
- Returns
ndarray – A list of optimised parameters (alpha, n, xbar, sigma)
int – A indication that the optimisation works (if output is 1,2,3 or 4 then ok)
float – The residuals from the best CB fit
-
pyveg.src.data_analysis_utils.
coarse_dataframe
(geodf, side_square)[source]¶ Coarse the granularity of a dataframe by grouping lat,long points that are close to each other in a square of L = size_square :param geodf: Input dataframe :type geodf: Dataframe :param side_square: Side of the square :type side_square: integer
- Returns
A coarser dataframe
- Return type
A dataframe
-
pyveg.src.data_analysis_utils.
convert_to_geopandas
(df)[source]¶ Given a pandas DatFrame with lat and long columns, convert to geopandas DataFrame. :param df: Pandas DatFrame with lat and long columns. :type df: DataFrame
- Returns
- Return type
geopandas DataFrame
-
pyveg.src.data_analysis_utils.
create_lat_long_metric_figures
(geodf, metric, output_dir)[source]¶ From input data-frame with processed network metrics create 2D gird figure for each date available using Geopandas. :param geodf: Input dataframe :type geodf: GeoDataframe :param metric: Variable to plot :type metric: string :param output_dir: Directory to save the figures
Returns
- Parameters
---------- –
-
pyveg.src.data_analysis_utils.
decay_rate
(x, resolution=12, method='basic')[source]¶ Calculates the decay rate between the max and min values of a time series. :param x: Time series to calculate decay rate on. mean_annual_ts is calculated
on this series within this function so raw time series is expected.
- Parameters
resolution (int) – Number of values each year in a time series (12 is monthly for example)
method ('basic' (default) or 'adjusted') – A choice on whether to calculate the decay rate on the mean annual time series calculated within the function or to adjust the time series such that the min value is set to 1 by substracting the minimum plus 1 of the mean annual time series (useful for offset50 values)
- Returns
The decay rate value
- Return type
float
-
pyveg.src.data_analysis_utils.
early_warnings_null_hypothesis
(series, indicators=['var', 'ac'], roll_window=0.4, smooth='Lowess', span=0.1, band_width=0.2, lag_times=[1], n_simulations=1000)[source]¶ Function to estimate the significance of the early warnings analysis by performing a null hypothesis test. The function estimate distributions of trends in early warning indicators from different surrogate timeseries generated after fitting an ARMA(p,q) model on the original data. The trends are estimated by the nonparametric Kendall tau correlation coefficient and can be compared to the trends estimated in the original timeseries to produce probabilities of false positives. The function returns a dataframe that contains the Kendall tau rank correlation estimates for orignal data and surrogates. :param series: Time series observations. :type series: pandas Series :param indicators: The statistics (leading indicator) selected for which the sensitivity analysis is perfomed. :type indicators: list of strings :param roll_window: Rolling window size as a proportion of the length of the time-series
data.
- Parameters
smooth (string) – Type of detrending. It can be {‘Gaussian’, ‘Lowess’, ‘None’}.
span (float) – Span of time-series data used for Lowess filtering. Taken as a proportion of time-series length if in (0,1), otherwise taken as absolute.
band_width (float) – Bandwidth of Gaussian kernel. Taken as a proportion of time-series length if in (0,1), otherwise taken as absolute.
lag_times (list of int) – List of lag times at which to compute autocorrelation.
n_simulations (int) – The number of surrogate data. Default is 1000.
- Returns
A dataframe that contains the Kendall tau rank correlation estimates for each indicator estimated on each surrogate dataset.
- Return type
DataFrame
-
pyveg.src.data_analysis_utils.
early_warnings_sensitivity_analysis
(series, indicators=['var', 'ac'], winsizerange=[0.1, 0.8], incrwinsize=0.1, smooth='Gaussian', bandwidthrange=[0.05, 1.0], spanrange=[0.05, 1.1], incrbandwidth=0.2, incrspanrange=0.1)[source]¶ Function to estimate the sensitivity of the early warnings analysis to the smoothing and windowsize used. The function returns a dataframe that contains the Kendall tau rank correlation estimates for the rolling window sizes (winsize variable) and bandwidths or span sizes depending on the de-trending (smooth variable). This function is inspired in the sensitivity_ews.R function from Vasilis Dakos, Leo Lahti in the early-warnings-R package: https://github.com/earlywarningtoolbox/earlywarnings-R. :param series: Time series observations. :type series: pandas Series :param indicators: The statistics (leading indicator) selected for which the sensitivity analysis is perfomed. :type indicators: list of strings :param winsizerange: Range of the rolling window sizes expressed as ratio of the timeseries length (must be numeric between 0 and 1). Default is 0.25 - 0.75. :type winsizerange: list of float :param incrwinsize: Increments the rolling window size (must be numeric between 0 and 1). Default is 0.25. :type incrwinsize: float :param smooth: Type of detrending. It can be {‘Gaussian’, ‘Lowess’, ‘None’}. :type smooth: string :param bandwidthrange: Range of the bandwidth used for the Gaussian kernel when gaussian filtering is selected. It is expressed as percentage of the timeseries length (must be numeric between 0 and 100). Default is 5% - 100%. :type bandwidthrange: list of float :param spanrange: Parameter that controls the degree of Lowess smoothing (numeric between 0 and 1). Default is 0.05 - 1. :type spanrange: list of float :param incrbandwidth: Size to increment the bandwidth used for the Gaussian kernel when gaussian filtering is applied. It is expressed as percentage of the timeseries length (must be numeric between 0 and 1). Default is 0.2. :type incrbandwidth: float :param incrspanrange: Size to increment the the span used for the Lowess smoothing :type incrspanrange: float
- Returns
A dataframe that contains the Kendall tau rank correlation estimates for the rolling window sizes (winsize variable) and bandwidths or span sizes depending on the de-trending (smooth variable).
- Return type
DataFrame
-
pyveg.src.data_analysis_utils.
err_func
(params, ts)[source]¶ Calculates the difference between the cball function with supplied params and a supplied time series of the same length. err_func is used within cball_parfit function below where full time series needs to be supplied :param params: Parameters used in Crystal Ball pdf calculation
alpha, n, xbar, sigma
- Parameters
ts (Time series) – Time series to compare output of cball function to
- Returns
Residuals/differences between Crytal Ball pdf and supplied time series
- Return type
ndarray
-
pyveg.src.data_analysis_utils.
exp_model_fit
(x, resolution=12, method='basic')[source]¶ Fits an exponential model from the maximum to the minimum of the mean annual time series. A raw time series is expected as an input. :param x: Time series to calculate decay rate on. mean_annual_ts is calculated
on this series within this function so raw time series is expected.
- Parameters
resolution (int) – Number of values each year in a time series (12 is monthly for example)
method ('basic' (default) or 'adjusted') – A choice on whether to fit the expoenential model on the mean annual time series calculated within the function or to adjust the time series such that the min value is set to 1 by substracting the minimum plus 1 of the mean annual time series (useful for offset50 values)
- Returns
The coefficient values from the exponential model fit
- Return type
ndarray
-
pyveg.src.data_analysis_utils.
fft_series
(time_series)[source]¶ Perform Fast Fourier Transform on an input series (assume one row per day). :param time_series: :type time_series: a pandas Series with one row per day, and datetime index (which we’ll ignore)
- Returns
xvals, yvals – Ready to be plotted directly in a matplotlib plot.
- Return type
np.arrays of frequencies (1/day) and strengths in frequency space.
-
pyveg.src.data_analysis_utils.
get_AR1_parameter_estimate
(ys)[source]¶ Fit an AR(1) model to the time series data and return the associated parameter of the model. :param ys: Input time series data. :type ys: array
- Returns
float – The parameter value of the AR(1) model..
float – The parameter standard error
-
pyveg.src.data_analysis_utils.
get_ar1_var_timeseries_df
(series, window_size=0.5)[source]¶ Given a time series calculate AR1 and variance using a moving window. Put the two resulting time series into a new DataFrame and return the result. :param series: Time series observations. :type series: pandas Series :param window_size: Size of the moving window as a fraction of the time series length. :type window_size: float (optional)
- Returns
The AR1 and variance results in a time series dataframe.
- Return type
DataFrame
-
pyveg.src.data_analysis_utils.
get_confidence_intervals
(df, column, ci_level=0.99)[source]¶ Calculate the confidence interval at each time point of a DataFrame containing data for a large image. :param df: Time series data for multiple sub-image locations. :type df: DataFrame :param column: Name of the column to calculate the CI of. :type column: str :param ci_level: Size of the confidence interval to calculate :type ci_level: float, optional
- Returns
Time series data for multiple sub-image locations with added column for the ci.
- Return type
DataFrame
-
pyveg.src.data_analysis_utils.
get_correlation_lag_ts
(series_A, series_B, window_size=0.5)[source]¶ Given two time series and a lag betweent them, calculate the lagged correlation between the two time series using a moving window. Additionally calculate the lag of the maximum precipitation using the moving window.. :param series_A: Observations of the first time series. :type series_A: pandas Series :param series_B: Observations of the second time series. :type series_B: pandas Series :param window_size: Size of the moving window as a fraction of the time series length. :type window_size: float (optional)
- Returns
Lagged corrleation and lag which maximises the correlation time series.s
- Return type
DataFrame
-
pyveg.src.data_analysis_utils.
get_datetime_xs
(df)[source]¶ Return the date column of df as datetime objects.
-
pyveg.src.data_analysis_utils.
get_kendell_tau
(ys)[source]¶ Kendall’s tau gives information about the trend of the time series. It is just a rank correlation test with one variable being time (or the vector 1 to the length of the time series), and the other variable being the data itself. A tau value of 1 means that the time series is always increasing, whereas -1 mean always decreasing, and 0 signifies no overall trend. :param ys: Input time series data. :type ys: array
- Returns
float – The value of tau.
float – The p value of the rank correlation test.
-
pyveg.src.data_analysis_utils.
get_max_lagged_cor
(dirname, veg_prefix)[source]¶ Convenience function which returns the maximum correlation as a function of lag (using a file saved earlier). :param dirname: Path to the analysis/ directory of the current analysis job. :type dirname: str :param veg_prefix: Compact representation of the satellite collection name used to
obtain vegetation data.
- Returns
Max correlation, and lag, for smoothed and unsmoothed vegetation time series.
- Return type
tuple
-
pyveg.src.data_analysis_utils.
mean_annual_ts
(x, resolution=12)[source]¶ Calculate mean annual time series from time series. Also fills in missing values by linear interpolation. NB Fails if there is missing value at the start or end. :param x: Time series to calculate mean annual time series for :type x: Time series :param resolution: Number of values each year in a time series (12 is monthly for example) :type resolution: float
- Returns
Array of length equal to resolution that is the mean annual time series
- Return type
ndarray
-
pyveg.src.data_analysis_utils.
moving_window_analysis
(df, output_dir, window_size=0.5)[source]¶ Run moving window AR1 and variance calculations for several input time series time series. :param df: Input time series DataFrame containing several time series. :type df: DataFrame :param output_dir: Path output plotting directory. :type output_dir: str :param window_size: Size of the moving window as a fraction of the time series length. :type window_size: float (optional)
- Returns
AR1 and variance time-series for each of the input time series.
- Return type
DataFrame
-
pyveg.src.data_analysis_utils.
network_figure
(df, date, metric, vmin, vmax, output_dir)[source]¶ Make 2D heatmap plot with network centrality measures :param df: Input dataframe :type df: Dataframe :param date: Date to be plot :type date: String :param metric: Which metric is going to be plot :type metric: string :param vmin: Colorbar minimum values :type vmin: int :param vmax: Colorbar max values :type vmax: int :param output_dir: Directory where to save the plots :type output_dir: string
-
pyveg.src.data_analysis_utils.
reverse_normalise_ts
(x)[source]¶ Takes what is expected to be a mean annual time series (from mean_annual_ts), arranges it so the first value is the last, reverses it and then normalises it. It is to be used within cball function below. :param x: Time series reverse and normalise. Assumed this is from mean_annual_ts output :type x: time series
- Returns
The reversed and normalised time series
- Return type
ndarray
-
pyveg.src.data_analysis_utils.
stl_decomposition
(series, period=12)[source]¶ Run STL decomposition on a pandas Series object. :param series: The observations to be deseasonalised. :type series: Series object :param period: Length of the seasonal period in observations. :type period: int (optional)
-
pyveg.src.data_analysis_utils.
variance_moving_average_time_series
(series, length)[source]¶ Calculate a variance time series using a moving average :param series: Time series observations. :type series: pandas Series :param length: Length of the moving window in number of observations. :type length: int
- Returns
pandas Series with datetime index, and one column, one row per date.
- Return type
pandas Series
pyveg.src.date_utils module¶
Useful functions for manipulating dates and date strings, e.g. splitting a period into sub-periods.
When dealing with date strings, ALWAYS use the ISO format YYYY-MM-DD
-
pyveg.src.date_utils.
assign_dates_to_tasks
(date_list, n_tasks)[source]¶ For batch jobs, will want to split dates as evenly as possible over some number of tasks.
-
pyveg.src.date_utils.
find_mid_period
(start_date, end_date)[source]¶ Given two strings in the format YYYY-MM-DD return a string in the same format representing the middle (to the nearest day)
- Parameters
start_date (str, date in format YYYY-MM-DD) –
end_date (str, date in format YYYY-MM-DD) –
- Returns
mid_date
- Return type
str, mid point of those dates, format YYYY-MM-DD
-
pyveg.src.date_utils.
get_date_range_for_collection
(date_range, coll_dict)[source]¶ Return the intersection of the date range asked for by the user, and the min and max dates for that collection.
- Parameters
date_range (list or tuple of strings, format YYYY-MM-DD) –
coll_dict (dictionary containing min_date and max_date kyes) –
- Returns
- Return type
tuple of strings, format YYYY-MM-DD
-
pyveg.src.date_utils.
get_date_strings_for_time_period
(start_date, end_date, period_length)[source]¶ Use the two functions above to slice a time period into sub-periods, then find the mid-date of each of these.
- Parameters
start_date (str, format YYYY-MM-DD) –
end_date (str, format YYYY-MM-DD) –
period_length (str, format '<integer><d|w|m|y>', e.g. 30d) –
- Returns
periods – each of which is the mid-point of a sub-period
- Return type
list of strings in format YYYY-MM-DD,
-
pyveg.src.date_utils.
get_num_n_day_slices
(start_date, end_date, days_per_chunk)[source]¶ Divide the full period between the start_date and end_date into n equal-length (to the nearest day) chunks. The size of the chunk is defined by days_per_chunk. Takes start_date and end_date as strings ‘YYYY-MM-DD’. Returns an integer with the number of possible points avalaible in that time period]
-
pyveg.src.date_utils.
get_time_diff
(date1, date2, units='years')[source]¶ calculate the time difference between two dates, :param date1: :type date1: str, dates in format YYYY-MM-DD :param date2: :type date2: str, dates in format YYYY-MM-DD :param units: :type units: str, can be “years”, “months”, “days”
- Returns
time_diff
- Return type
int, difference in times, in specified units
-
pyveg.src.date_utils.
slice_time_period
(start_date, end_date, period_length)[source]¶ Slice a time period into chunks, whose length is determined by the period_length, which will be e.g. ‘30d’ for 30 days, or ‘1m’ for one month.
- Parameters
start_date (str, format YYYY-MM-DD) –
end_date (str, format YYYY-MM-DD) –
period_length (str, format '<integer><d|w|m|y>', e.g. 30d) –
- Returns
periods – each of which is the start and end of a sub-period
- Return type
list of lists of strings in format YYYY-MM-DD,
pyveg.src.download_modules module¶
pyveg.src.file_utils module¶
-
pyveg.src.file_utils.
consolidate_json_to_list
(json_dir, output_dir=None, output_filename=None)[source]¶ Load all the json files (e.g. from individual sub-images), and return a list of dictionaries, to be written out into one json file.
- Parameters
json_dir (str, full path to directory containing temporary json files) –
output_dir (str, full path to desired output directory.) – Can be None, in which case no output written to disk.
output_filename (str, name of the output json file.) – Can be None, in which case no output written to disk.
- Returns
results
- Return type
list of dicts.
-
pyveg.src.file_utils.
construct_filename_from_metadata
(metadata, suffix)[source]¶ Given a dictionary of metadata, construct a filename. Will be used for the results summary json, and the summary stats csv as they are uploaded to Zenodo.
-
pyveg.src.file_utils.
construct_image_savepath
(output_dir, collection_name, coords, date_range, image_type)[source]¶ Function to abstract output image filename construction. Current approach is to create a new dir inside output_dir for the satellite, and then save date and coordinate stamped images in this dir.
-
pyveg.src.file_utils.
download_and_unzip
(url, output_tmpdir)[source]¶ Given a URL from GEE, download it (will be a zipfile) to a temporary directory, then extract archive to that same dir. Then find the base filename of the resulting .tif files (there should be one-file-per-band) and return that.
- Parameters
url (str, URL of zipfile on GEE server.) –
output_tmpdir (str, full path of directory into which to unpack zipfile.) –
- Returns
tif_filenames
- Return type
list of strings, the full paths to unpacked tif files.
-
pyveg.src.file_utils.
get_filepath_after_directory
(path, dirname, include_dirname=False)[source]¶ Return part of a filepath from a certain point onwards. e.g. if we have path /a/b/c/d/e/f and we say dirname=c, then this will return d/e/f if include_dirname==False, or c/d/e/f if it is True.
- Parameters
path (str, full filepath) –
dirname (str, delimeter, from where we will take the remaining filepath) –
include_dirname (bool, if True, the returned path will have dirname as its root.) –
-
pyveg.src.file_utils.
save_image
(image, output_dir, output_filename, verbose=False)[source]¶ Given a PIL.Image (list of pixel values), save to requested filename - note that the file extension will determine the output file type, can be .png, .tif, probably others…
pyveg.src.gee_interface module¶
pyveg.src.image_utils module¶
Modify, and slice up tif and png images using Python Image Library
Needs a relatively recent version of pillow (fork of PIL):
`
pip install --upgrade pillow
`
-
pyveg.src.image_utils.
adaptive_threshold
(img)[source]¶ Threshold a grayscale image using the mean pixel value of a local area to set the threshold at each pixel location. At the moment set above average brightness pixels to the max (255) and vice versa for below average brightness pixels.
@param img 2D numpy array representing a grayscale image @return thresholded image
-
pyveg.src.image_utils.
check_image_ok
(rgb_image, black_pix_threshold=0.05)[source]¶ Check the quality of an RGB image. Currently checking if we have > X% pixels being masked. This indicates problems with cloud masking in previous steps.
- Parameters
rgb_image (Pillow.Image) – Input image to check the quality of
- Returns
True if image passes quality requirements, else False.
- Return type
bool
-
pyveg.src.image_utils.
combine_tif
(band_dict)[source]¶ Read tif files - one per specified band, and rescale and combine pixel values to r,g,b values betweek 0 and 255 in a combined output image.
- Parameters
band_dict (dict, format {'<r|g|b>': {'band': <band_name>, 'filename': <filename>}}) –
- Returns
new_img
- Return type
PIL Image, 8-bit rgb image.
-
pyveg.src.image_utils.
compare_binary_image_files
(filename1, filename2)[source]¶ Wrapper for compare_binary_images that opens and closes the image files.
-
pyveg.src.image_utils.
compare_binary_images
(image1, image2)[source]¶ Return the fraction of pixels that are the same in the two images.
-
pyveg.src.image_utils.
convert_to_bw
(input_image, threshold, invert=False)[source]¶ Given an RGB input, apply a threshold to each pixel. If pix(r,g,b)>threshold, set to 255,255,255, if <threshold, set to 0,0,0
-
pyveg.src.image_utils.
convert_to_rgb
(band_dict)[source]¶ If we are given three or more bands, interpret the first as red, the second as green, the third as blue, and scale them to be between 0 and 255 using the combine_tif function. If we are only given one band, use the scale_tif function to scale the range of input values to between 0 and 255 then apply this to all of r,g,b
- Parameters
band_dict (dict, format {'<r|g|b|rgb>': {'band': <band_name>, 'filename': <filename>}}) –
-
pyveg.src.image_utils.
create_gif_from_images
(directory_path, output_name, string_in_filename='')[source]¶ Loop through a directory and convert all images in it into a gif chronologically
- Parameters
directory_path – directory where all the files are.
output_name – name to be given to the output gif
string_in_filename – select only files that containsa particular string, default is “” which implies all in directory files are selected
- Returns
-
pyveg.src.image_utils.
crop_and_convert_all
(input_dir, output_dir, threshold=470, num_x=50, num_y=50)[source]¶ Loop through a whole directory and crop and convert to black+white all files within it.
-
pyveg.src.image_utils.
crop_and_convert_to_bw
(input_filename, output_dir, threshold=470, num_x=50, num_y=50)[source]¶ Open an image file, convert to monochrome, and crop into sub-images.
-
pyveg.src.image_utils.
crop_image_nparts
(input_image, n_parts_x, n_parts_y=None)[source]¶ Divide an image into n_parts_x*n_parts_y equal smaller sub-images.
-
pyveg.src.image_utils.
crop_image_npix
(input_image, n_pix_x, n_pix_y=None, region_size=None, coords=None)[source]¶ Divide an image into smaller sub-images with fixed pixel size. If region_size and coordinates are provided, we want to return the coordinates of the sub-images along with the sub-images themselves.
-
pyveg.src.image_utils.
hist_eq
(img, clip_limit=2)[source]¶ Perform contrast limited local histogram equalisation on an imput image.
@param img 2D numpy array representing a grayscale image @param clip_limit controls the strength of the equalisation @return 2D numpy array representing the equalised image
-
pyveg.src.image_utils.
image_all_same_colour
(image, colour=255, 255, 255, threshold=0.99)[source]¶ Return true if all (or nearly all) pixels are same colour
-
pyveg.src.image_utils.
image_file_all_same_colour
(image_filename, colour=255, 255, 255, threshold=0.99)[source]¶ Wrapper for image_all_same_colour that opens and closes the image file
-
pyveg.src.image_utils.
image_file_to_array
(input_filename)[source]¶ Read an image file and convert to a 2D numpy array, with values 0 for background pixels and 255 for signal. Assume that the input image has only two colours, and take the one with higher sum(r,g,b) to be “signal”.
-
pyveg.src.image_utils.
image_from_array
(input_array, output_size=None, sel_val=200)[source]¶ Convert a 2D numpy array of values into an image where each pixel has r,g,b set to the corresponding value in the array. If an output size is specified, rescale to this size.
-
pyveg.src.image_utils.
invert_binary_image
(image)[source]¶ Swap (255,255,255) with (0,0,0) for all pixels
-
pyveg.src.image_utils.
median_filter
(img, r=3)[source]¶ Convolve a median filter over the image.
@param img 2D numpy array representing a grayscale image @param r the size of the grid to convolve @return 2D numpy array representing the smoothed image
-
pyveg.src.image_utils.
numpy_to_pillow
(numpy_image)[source]¶ Convert a 2D numpy array to a PIL Image object.
@param img 2D numpy array to convert @return PIL Image object
-
pyveg.src.image_utils.
pillow_to_numpy
(pil_image)[source]¶ Convert a PIL Image object to a numpy array (used by openCV).
@param img PIL Image object to convert @return 2D or 3D numpy array (depending on input image)
-
pyveg.src.image_utils.
plot_band_values
(input_filebase, bands=['B4', 'B3', 'B2'])[source]¶ Plot histograms of the values in the chosen bands of the input image
pyveg.src.pattern_generation module¶
Translation of Matlab code to model patterned vegetation in semi-arid landscapes.
-
class
pyveg.src.pattern_generation.
PatternGenerator
[source]¶ Bases:
object
Class that can generate simulated veget ation patterns, optionally from a loaded starting pattern, and propagate through time according to various amounts of rainfall and/or surface and soil water density.
-
static
calc_plant_change
(plant_biomass, soil_water, uptake, uptake_saturation, growth_constant, senescence, grazing_loss)[source]¶ Change in plant biomass as a function of available soil water and various constants.
-
static
calc_soil_water_change
(soil_water, surface_water, plant_biomass, frac_surface_water_available, bare_soil_infilt, infilt_saturation, plant_growth, soil_water_evap, uptake_saturation)[source]¶ Change in soil water as a function of surface water, plant_biomass, and various constants.
-
static
calc_surface_water_change
(surface_water, plant_biomass, rainfall, frac_surface_water_available, bare_soil_infilt, infilt_saturation)[source]¶ Change in surface water as a function of rainfall, plant_biomass, and various constants.
-
make_binary
(threshold=None)[source]¶ if not given a threshold to use, look at the (max+min)/2 value - for anything below, set to zero, for anything above, set to 1
-
static
pyveg.src.plotting module¶
Plotting code.
-
pyveg.src.plotting.
kendall_tau_histograms
(series_name, df, output_dir)[source]¶ Produce histograms with kendall tau distribution from surrogates for significance analysis
- Parameters
series_name (str) – String containing data collection and time series variable.
df (Dataframe) – The output dataframe from the sensitivity analysis function.
output_dir – Path to the directory to save the produced figures
-
pyveg.src.plotting.
plot_autocorrelation_function
(df, output_dir, filename_suffix='')[source]¶ Given a time series DataFrames (constructed with make_time_series), plot the autocorrelation function relevant columns.
- Parameters
df (DataFrame) – Time series DataFrame.
output_dir (str) – Directory to save the plots in.
-
pyveg.src.plotting.
plot_correlation_mwa
(df, output_dir, filename_suffix='')[source]¶ Given a moving window time series DataFrame, plot the time series of veg-precip correlation.
- Parameters
df (DataFrame) – The time-series results for veg-precip correlation coeff and lag.
output_dir (str) – Directory to save the plot in.
filename_suffix (str) – Add suffix string to file name
-
pyveg.src.plotting.
plot_cross_correlations
(df, output_dir)[source]¶ Plot a scatterplot matrix showing correlations between vegetation and precipitation time series, with different lags. Additionally write out the correlations as a function of the lag for later use.
- Parameters
df (DataFrame) – Time-series data.
output_dir (str) – Directory to save the plot in.
-
pyveg.src.plotting.
plot_ews_resiliance
(series_name, EWSmetrics_df, Kendalltau_df, dates, output_dir)[source]¶ Make early warning signals resiliance plots using the output from the ewstools package.
- Parameters
series_name (str) – String containing data collection and time series variable.
EWSmetrics_df (DataFrame) – DataFrame from ewstools containing ews time series.
Kendalltau_df (DataFrame) – DataFrame from ewstools containing Kendall tau values for EWSmetrics_df time series
output_dir (str) – Output dir to save plot in.
-
pyveg.src.plotting.
plot_feature_vector
(output_dir)[source]¶ Read feature vectors from csv (if they exist) and then make feature vector plots.
- Parameters
output_dir (str) – Directory to save the plot in.
-
pyveg.src.plotting.
plot_moving_window_analysis
(df, output_dir, filename_suffix='')[source]¶ Given a moving window time series DataFrame, plot the time series of AR1 and Variance.
- Parameters
df (DataFrame) – The time-series results for variance and AR1.
output_dir (str) – Directory to save the plot in.
filename_suffix (str) – Add suffix string to file name
-
pyveg.src.plotting.
plot_sensitivity_heatmap
(series_name, df, output_dir)[source]¶ Produce heatmap plot for the sensitivy analysis
- Parameters
df (Dataframe) – The output dataframe from the sensitivity analysis function.
output_dir – Path to the directory to save the produced figures
-
pyveg.src.plotting.
plot_stl_decomposition
(df, period, output_dir)[source]¶ Run the STL decomposition and plot the results network centrality and precipitation DataFrames in df.
- Parameters
df (DataFrame) – The time-series results.
period (float) – Periodicity to model.
output_dir (str) – Directory to save the plot in.
pyveg.src.processor_modules module¶
Class for holding analysis modules that can be chained together to build a sequence.
-
class
pyveg.src.processor_modules.
NDVICalculator
(name=None)[source]¶ Bases:
pyveg.src.processor_modules.ProcessorModule
Class to look at NDVI on sub-images images, and return the results as json. Note that the input directory is expected to be the level above the subdirectories for the date sub-ranges.
-
check_sub_image
(ndvi_filename, input_path)[source]¶ Check the RGB sub-image corresponding to this NDVI image looks OK.
-
process_single_date
(date_string)[source]¶ Each date will have a subdirectory called ‘SPLIT’ with ~400 NDVI sub-images.
-
-
class
pyveg.src.processor_modules.
NetworkCentralityCalculator
(name=None)[source]¶ Bases:
pyveg.src.processor_modules.ProcessorModule
Class to run network centrality calculation on small black+white images, and return the results as json. Note that the input directory is expected to be the level above the subdirectories for the date sub-ranges.
-
check_sub_image
(ndvi_filename, input_path)[source]¶ Check the RGB sub-image corresponding to this NDVI image looks OK.
-
-
class
pyveg.src.processor_modules.
ProcessorModule
(name)[source]¶ Bases:
pyveg.src.pyveg_pipeline.BaseModule
-
check_input_data_exists
(date_string)[source]¶ Processor modules will look for inputs in <input_location>/<date_string>/<input_location_subdirs> Check that the subdirs exist and are not empty.
- Parameters
date_string (str, format YYYY-MM-DD) –
- Returns
- Return type
True if input directories exist and are not empty, False otherwise.
-
check_output_data_exists
(date_string)[source]¶ Processor modules will write output to <output_location>/<date_string>/<output_location_subdirs> Check
- Parameters
date_string (str, format YYYY-MM-DD) –
- Returns
True if expected number of output files are already in output location, – AND self.replace_existing_files is set to False
False otherwise
-
get_dependent_batch_tasks
()[source]¶ When running in batch, we are likely to depend on tasks submitted by the previous Module in the Sequence. This Module should be in the “depends_on” attribute of this one.
Task dependencies will be a dict of format {“task_id”: <task_id>, “date_range”: [<dates>]}
-
run_batch
()[source]¶ ” Write a config json file for each set of dates. If this module depends on another module running in batch, we first get the tasks on which this modules tasks will depend on. If not, we look at the input dates subdirectories and divide them up amongst the number of batch nodes.
We want to create a list of dictionaries [{“task_id”: <task_id>, “config”: <config_dict>, “depends_on”: [<task_ids>]}] to pass to the batch_utils.submit_tasks function.
-
-
class
pyveg.src.processor_modules.
VegetationImageProcessor
(name=None)[source]¶ Bases:
pyveg.src.processor_modules.ProcessorModule
Class to convert tif files downloaded from GEE into png files that can be looked at or used as input to further analysis.
Current default is to output: 1) Full-size RGB image 2) Full-size NDVI image (greyscale) 3) Full-size black+white NDVI image (after processing, thresholding, …) 4) Many 50x50 pixel sub-images of RGB image 5) Many 50x50 pixel sub-images of black+white NDVI image.
-
construct_image_savepath
(date_string, coords_string, image_type='RGB')[source]¶ Function to abstract output image filename construction. Current approach is to create a ‘PROCESSED’ subdir inside the sub-directory corresponding to the mid-period of the date range for the full-size images and a ‘SPLIT’ subdirectory for the sub-images.
-
process_single_date
(date_string)[source]¶ For a single set of .tif files corresponding to a date range (normally a sub-range of the full date range for the pipeline), construct RGB, and NDVI greyscale images. Then do processing and thresholding to make black+white NDVI images. Split the RGB and black+white NDVI ones into small (50x50pix) sub-images.
- Parameters
date_string (str, format YYYY-MM-DD) –
- Returns
- Return type
True if everything was processed and saved OK, False otherwise.
-
save_rgb_image
(band_dict, date_string, coords_string)[source]¶ Merge the seperate tif files for the R,G,B bands into one image, and save it.
-
set_default_parameters
()[source]¶ Set some basic defaults. Note that these might get overriden by a parent Sequence, or by calling configure() with a dict of values
-
split_and_save_sub_images
(image, date_string, coords_string, image_type, npix=50)[source]¶ Split the full-size image into lots of small sub-images
image: pillow Image date_string: str, format YYYY-MM-DD coords_string: str, format long_lat image_type: str, typically ‘RGB’ or ‘BWNDVI’ npix: dimension in pixels of side of sub-image. Default is 50x50
True if all sub-images saved correctly.
-
-
class
pyveg.src.processor_modules.
WeatherImageToJSON
(name=None)[source]¶ Bases:
pyveg.src.processor_modules.ProcessorModule
Read the weather-related tif files downloaded from GEE, and write the temp and precipitation values out as a JSON file.
pyveg.src.pyveg_pipeline module¶
Definitions:¶
A PIPELINE is the whole analysis procedure for one set of coordinates. It will likely consist of a couple of SEQUENCES - e.g. one for vegetation data and one for weather data.
A SEQUENCE is composed of one or more MODULES, that each do specific tasks, e.g. download data, process images, calculate quantities from image.
A special type of MODULE may be placed at the end of a PIPELINE to combine the results of the different SEQUENCES into one output file.
-
class
pyveg.src.pyveg_pipeline.
BaseModule
(name=None)[source]¶ Bases:
object
A “Module” is a building block of a sequence - takes some input, does something (e.g. Downloads from GEE, processes some images, …) and produces some output. The working directory for all modules within a sequence will be given by the sequence - modules may write output to subdirectories of this (e.g. for different dates), but what we call “output_location” will be the base directory common to all modules, and will contain info about the image collection name, and the coordinates.
-
check_config
()[source]¶ Loop through list of parameters, which will each be a tuple (name, [allowed_types]) and check that the parameter exists, and is of the correct type.
-
check_for_existing_files
(location, num_files_expected)[source]¶ See if there are already num_files in the specified location. If “replace_existing_files” is set to True, always return False
-
configure
(config_dict=None)[source]¶ Order of preference for configuriation: 1) config_dict 2) values held by the parent Sequence 3) default values So we set them in reverse order here, so higher priorities will override.
-
copy_to_output_location
(tmpdir, output_location, file_endings=[])[source]¶ Copy contents of a temporary directory to a specified output location.
- Parameters
tmpdir (str, location of temporary directory) –
output_location (str, either path to a local directory (if self.output_location_type is "local")) – or to Azure <container>/<blob_path> if self.output_location_type==”azure”)
file_endings (list of str, optional. If given, only files with those endings will be copied.) –
-
get_file
(filename, location_type)[source]¶ Just return the filename if location _type is “local”. Otherwise return a tempfile with the contents of a blob if the location is “azure”.
-
join_path
(*path_elements)[source]¶ If output_location_type is ‘local’, we will just use os.path.join, which puts a “/” separator in for posix, or “” for windows. However, if output_location_type is ‘azure’, we always want “/”.
- Parameters
path_elements (list of strings. Directory-like path elements.) –
- Returns
path
- Return type
str, the path elements joined by “/” or “”.
-
list_directory
(directory_path, location_type)[source]¶ List contents of a directory, either on local file system or Azure blob storage.
-
-
class
pyveg.src.pyveg_pipeline.
Pipeline
(name)[source]¶ Bases:
object
A Pipeline contains all the Sequences we want to run on a particular set of coordinates and a date range. e.g. there might be one Sequence for vegetation data and one for weather data.
-
class
pyveg.src.pyveg_pipeline.
Sequence
(name)[source]¶ Bases:
object
A Sequence is a collection of Modules where the output of one module is typically the input to the next one. It will typically correspond to a particular data collection, e.g. for vegetation imagery, we might have one module to download the images, one to process them, and one to analyze the processed images.
-
check_if_finished
()[source]¶ Only relevant when one or more modules are running in batch mode, Sequences that depend on this Sequence will call this function while they wait for all Modules to finish.
-
create_batch_job_if_needed
()[source]¶ If any modules in this sequence are to be run in batch mode, create a batch job for them.
-
join_path
(*path_elements)[source]¶ If output_location_type is ‘local’, we will just use os.path.join, which puts a “/” separator in for posix, or “” for windows. However, if output_location_type is ‘azure’, we always want “/”.
- Parameters
path_elements (list of strings. Directory-like path elements.) –
- Returns
path
- Return type
str, the path elements joined by “/” or “”.
-
print_run_status
()[source]¶ For all modules in the sequence, print out how many jobs succeeded or failed.
-
pyveg.src.subgraph_centrality module¶
Python version of mao_pollen.m matlab code to look at connectedness of pixels on a binary image, using “Subgraph Centrality” as described in:
Mander et.al. “A morphometric analysis of vegetation patterns in dryland ecosystems”, R. Soc. open sci. (2017) https://royalsocietypublishing.org/doi/10.1098/rsos.160443
Mander et.al. “Classification of grass pollen through the quantitative analysis of surface ornamentation and texture”, Proc R Soc B 280: 20131905. https://royalsocietypublishing.org/doi/pdf/10.1098/rspb.2013.1905
Estrada et.al. “Subgraph Centrality in Complex Networks” https://arxiv.org/pdf/cond-mat/0504730.pdf
-
pyveg.src.subgraph_centrality.
calc_adjacency_matrix
(distance_matrix, include_diagonal_neighbours=False)[source]¶ Return a symmetric matrix of (n-pixels-over-threshold)x(n-pixel-over-threshold) where each element ij is 0 or 1 depending on whether the distance between pixel i and pixel j is < or > neighbour_threshold.
-
pyveg.src.subgraph_centrality.
calc_and_sort_sc_indices
(adjacency_matrix)[source]¶ Given an input adjacency matrix, calculate eigenvalues and eigenvectors, calculate the subgraph centrality (ref: <== ADD REF), then sort.
-
pyveg.src.subgraph_centrality.
calc_distance_matrix
(signal_coords)[source]¶ calculate the distances between all signal pixels in the original image
-
pyveg.src.subgraph_centrality.
calc_euler_characteristic
(pix_indices, graph)[source]¶ Find the edges where both ends are within the pix_indices list
-
pyveg.src.subgraph_centrality.
crop_image_array
(input_image, x_range, y_range)[source]¶ return a new image from specified pixel range of input image
-
pyveg.src.subgraph_centrality.
feature_vector_metrics
(feature_vector, output_csv=None)[source]¶ Calculate different metrics for the feature vector
-
pyveg.src.subgraph_centrality.
fill_feature_vector
(pix_indices, coords, adj_matrix, num_quantiles=20)[source]¶ Given indices and coordinates of signal pixels ordered by SC value, put them into quantiles and calculate an element of a feature vector for each quantile. by using the Euler Characteristic.
- Will return:
selected_pixels, feature_vector
where selected_pixels is a vector of the pixel coordinates in each quantile, and a feature_vector is either num-connected-components or Euler characteristic, for each quantile.
-
pyveg.src.subgraph_centrality.
fill_sc_pixels
(sel_pixels, orig_image, val=200)[source]¶ Given an original 2D array where all the elements are 0 (background) or 255 (signal), fill in a selected subset of signal pixels as 123 (grey).
-
pyveg.src.subgraph_centrality.
generate_sc_images
(sel_pixels, orig_image, val=200)[source]¶ Return a dict of images with the selected subsets of signal pixels filled in in cyan.
-
pyveg.src.subgraph_centrality.
get_signal_pixels
(input_array, threshold=255, lower_threshold=True, invert_y=False)[source]¶ Find coordinates of all pixels within the image that are > or < the threshold ( require < threshold if lower_threshold==True) NOTE - if invert_y is set, we make the second coordinate negative, for reasons.
-
pyveg.src.subgraph_centrality.
invert_y_coord
(coord_list)[source]¶ Convert [(x1,y1),(x2,y2),…] to [(x1,-y1),(x2,-y2),…]
-
pyveg.src.subgraph_centrality.
make_graph
(adj_matrix)[source]¶ Use igraph to create a graph from our adjacency matrix
-
pyveg.src.subgraph_centrality.
save_sc_images
(image_dict, file_prefix)[source]¶ Saves images from dictionary.
-
pyveg.src.subgraph_centrality.
subgraph_centrality
(image, use_diagonal_neighbours=False, num_quantiles=20, threshold=255, lower_threshold=True, output_csv=None)[source]¶ Go through the whole calculation, from input image to output vector of pixels in each SC quantile, and feature vector (either connected-components or Euler characteristic).
-
pyveg.src.subgraph_centrality.
text_file_to_array
(input_filename)[source]¶ Read a csv-like representation of an image, where each row (representing a row of pixels in the image) is a comma-separated list of pixel values 0 (for black) or 255 (for white).
pyveg.src.zenodo_utils module¶
Use the Zenodo API to deposit or retrieve data.
Needs an API token - to create one: Sign-in or create an account at https://zenodo.org Create an API token by going to this page: https://zenodo.org/account/settings/applications/tokens/new/
tick “deposit:actions” and “deposit:write” in the “Scopes” section
and click Create. Then copy the created token into a file called “zenodo_api_token” in the pyveg/configs/ directory.
OR, to use the “Sandbox” API for testing, follow the same steps but replacing “zenodo.org” with “sandbox.zenodo.org” in the URLs, and put the token into a file named “zenodo_test_api_token” then call the functions in this module with the “test” argument set to True.
-
pyveg.src.zenodo_utils.
create_deposition
(test=False)[source]¶ Create a new, empty deposition.
- Parameters
test (bool, True if we will use the sandbox API, False otherwise) –
- Returns
r
- Return type
dict, response from the API with info about the newly created deposition
-
pyveg.src.zenodo_utils.
delete_file
(filename, deposition_id, test=False)[source]¶ Delete a file from a deposition.
- Parameters
filename (str, full path to the file to be deleted) –
deposition_id (int, ID of the deposition containing this file) –
test (bool, True if we will use the sandbox API, False otherwise) –
- Returns
- Return type
True if file was deleted OK, False otherwise.
-
pyveg.src.zenodo_utils.
download_file
(filename, deposition_id, destination_path='.', test=False)[source]¶ Upload a file to a deposition.
- Parameters
filename (str, full path to the file to be uploaded) –
deposition_id (int, ID of the deposition containing this file) –
destination_path (str, where to put the downloaded file) –
test (bool, True if we will use the sandbox API, False otherwise) –
- Returns
filepath
- Return type
str, location of downloaded file.
-
pyveg.src.zenodo_utils.
download_results_by_coord_id
(coords_id, json_or_csv='json', destination_path=None, deposition_id=None, test=False)[source]¶ Search the deposition (defined by the deposition_id in zenodo_config.py) for results_summary json or summary_stats csv files beginning with ‘coord_id’ and download the most recent one.
- Parameters
coords_id (str, two-digit string identifiying the row of the location in coordinates.py) –
json_or_csv (str, if "json", download 'results_summary.json', otherwise download 'ts_summary_stats.csv'.) –
destination_path (str, directory to download to. If not given, put in temporary dir) –
deposition_id (str, deposition ID in Zenodo. If not given, use the one from zenodo_config.py) –
test (bool, if True, use the sandbox Zenodo repository) –
-
pyveg.src.zenodo_utils.
get_base_url_and_token
(test=False)[source]¶ Get the base URL for the API, and the API token, for use in requests.
- Parameters
test (bool, True if we will use the sandbox API, False otherwise) –
- Returns
base_url (str, the first part of the URL for the API)
api_token (str, the personal access token, read from a file.)
-
pyveg.src.zenodo_utils.
get_bucket_url
(deposition_id, test=False)[source]¶ For a given deposition_id, find the URL needed to upload a file.
- Parameters
deposition_id (int, ID of the deposition.) –
test (bool, if True use the sandbox API, if False will use the real one.) –
- Returns
bucket_url
- Return type
str, the URL of the bucket for this deposition, or empty string if id not found
-
pyveg.src.zenodo_utils.
get_deposition_id
(json_or_csv='json', test=False)[source]¶ If we have previously created a deposition, we hopefully stored its ID in the zenodo_config.py file.
-
pyveg.src.zenodo_utils.
get_deposition_info
(deposition_id, test=False)[source]¶ Get the JSON object containing details of a deposition.
- Parameters
deposition_id (int, ID of the deposition.) –
test (bool, if True use the sandbox API, if False will use the real one.) –
- Returns
dep_info
- Return type
dict, information about the deposition
-
pyveg.src.zenodo_utils.
get_results_summary_json
(coords_string, collection, deposition_id, test=False)[source]¶ Assuming the zipfile is named following the convention results_<long>_<lat>_<collection>.zip download this from the deposition, and extract the results_summary.json.
-
pyveg.src.zenodo_utils.
list_depositions
(test=False)[source]¶ List all the depositions created by this account.
- Parameters
test (bool, True if we will use the sandbox API, False otherwise) –
- Returns
r
- Return type
list of dicts, response from the API with info about the depositions
-
pyveg.src.zenodo_utils.
list_files
(deposition_id, json_or_csv='json', test=False)[source]¶ List all the files in a deposition.
- Parameters
deposition_id (int, ID of the deposition on which to list files) –
json_or_csv (str, if 'json', list the deposition containing the results_summary.json) – otherwise list the one containing ts_summary_stats.csv
test (bool, True if using the sandbox API, False otherwise) –
- Returns
files
- Return type
list[str], list of all filenames in the deposition.
-
pyveg.src.zenodo_utils.
prepare_results_zipfile
(collection_name, png_location, png_location_type='local', json_location=None, json_location_type='local')[source]¶ Create a zipfile called <results_long_lat_collection> containing the ‘results_summary.json’, and the outputs of the analysis.
- Parameters
collection_name (str, typically "Sentinel2" or "Landsat8" or similar) –
base_png_location (str, directory containing analysis/ subdirectory) –
png_location_type (str, either "local" or "azure") –
base_json_location (str, directory containing "results_summary.json.) – If not specified, assume same as base_png_location
json_location_type (str, either "local" or "azure") –
- Returns
zip_filename
- Return type
str, location of the produced zipfile
-
pyveg.src.zenodo_utils.
publish_deposition
(deposition_id, test=False)[source]¶ Submit the deposition, so it will be findable on Zenodo and have a DOI.
-
pyveg.src.zenodo_utils.
unlock_deposition
(deposition_id, test=False)[source]¶ Unlock a previously submitted deposition, so we can add to it.
-
pyveg.src.zenodo_utils.
upload_custom_metadata
(title, upload_type, description, creators, deposition_id, test=False)[source]¶ Upload a dict to the deposition containing metadata with the format:
- {
- ‘metadata’: {
‘title’: ‘My first upload’, ‘upload_type’: ‘poster’, ‘description’: ‘This is my first upload’, ‘creators’: [{‘name’: ‘Doe, John’,
‘affiliation’: ‘Zenodo’}]
}
}
title: str, title of the deposition upload_type: str, type of upload, typically “dataset” description: str, description of the deposition creators: dict, format {“name”: <str:name>, “affiliation”: <str:affiliation>}
- Returns
r
- Return type
dict, JSON response from the API.
-
pyveg.src.zenodo_utils.
upload_file
(filename, deposition_id, test=False)[source]¶ Upload a file to a deposition.
- Parameters
filename (str, full path to the file to be uploaded) –
deposition_id (int, ID of the deposition to which we want to upload.) –
test (bool, True if we will use the sandbox API, False otherwise) –
- Returns
uploaded_ok
- Return type
bool, True if we get status code 200 from the API
-
pyveg.src.zenodo_utils.
upload_standard_metadata
(deposition_id, json_or_csv='json', test=False)[source]¶ Upload the metadata dict defined in zenodo_config.py to the specified deposition ID.Kcontaining metadata with the format:
deposition_id: int, ID of the deposition to which to upload json_or_csv: str, can be either ‘json’ to upload the metadata for results_summary.json
or csv to upload the metadata for ts_summary_stats.csv
test: if True, use the sandbox API, if False use the production one.
- Returns
r
- Return type
dict, JSON response from the API.