@@ -33,7 +33,7 @@ class s3_details: #Variables for defining S3 file details and how much to
withh5py.File(self.s3_endpoint+"/"+self.bucket_name+"/"+self.prefix+"_"+column_to_read+".h5",driver='ros3',secret_id=bytes(self.access_id,encoding='utf-8'),secret_key=bytes(self.access_key,encoding='utf-8'),aws_region=bytes(self.region,encoding='utf-8'))asf:#Encoding is important for h5py to properly understand the strings
ifself.dimension==1:#2d case
data_part=f[column_to_read][0,self.target[1]-self.window_width:self.target[1]+self.window_width+1,self.target[2]-self.window_width:self.target[2]+self.window_width+1]#download target element and window_width around it only, for 2d case
elifself.dimension==3:#3d case
elifself.dimension>1:#3d case
data_part=f[column_to_read][self.target[0]-self.window_width:self.target[0]+self.window_width+1,self.target[1]-self.window_width:self.target[1]+self.window_width+1,self.target[2]-self.window_width:self.target[2]+self.window_width+1]#download target element and window_width around it only, for 3d case
returndata_part
...
...
@@ -41,19 +41,19 @@ class s3_details: #Variables for defining S3 file details and how much to
pool=mp.Pool(len(var_list))#Declare pool of threads, one for each variable to download
forcount,prefixinenumerate(prefix_list):
print("Prefix being read is "+prefix)
target=target_list[count]#Set the target centre element
s3file=s3_details(s3_endpoint,bucket_name,access_id,access_key,region,dimension,target,window_width,prefix)#Set details of download
#start_time=time.time() #Start timing if benchmarking
data_payload=np.array(pool.map(s3file.read_variable,var_list))#Download all variables in parallel
#print(f"\n elapsed time is %f"% (time.time()-start_time)) #Print timing if benchmarking
#print("number of elements is " +str(data_payload.size)) #Print timing if benchmarking
np.save(prefix+"_out",data_payload)#Save array to file, based on each original file name. All variables saved to one array in corrected order, constructed infirst half of workflow
deldata_payload#If data_payload is very large, it will need to be deleted between attempts to function properly
Two-part workflow for partial download of specific dataset (SC) from CIAO-output HDF5 files. Developed in the context of CoEC. Solution entails:
- CIAO_split_and_upload.py
- Takes a folder of HDF5 files, splits the SC dataset into constituent variables (for faster file access) and uploads them to a specified S3 endpoint and bucket, along with metadata .json files
- Takes a folder of HDF5 files, splits the SC dataset into constituent variables (for faster file access) and uploads them to a specified S3 endpoint and bucket, along with metadata .json files.
- More performant than with the whole, monolithic file for each timestep, which is slower to access files and not performant enough for remote post-process, e.g. to find the highest temperature element in a timestep.
- CIAO_s3_remote_read.py
- Takes the the S3 bucket and saves a numpy array for each file, downloading a subsection of each file in a window around a target centre element, either definded manually by the user or targeted around the hottest element of each simulation timestep
...
...
@@ -27,7 +28,7 @@ Two-part workflow for partial download of specific dataset (SC) from CIAO-output
- All files in the same folder should ideally be from the same simulation. Script assumes each file has the same structure. Top level group is based on the experiment name, and SC dataset constituent variables depend on the chemical species simulated.
- Define splitdir, the sub-directory within datadir to write split files to.
- Not deleted automatically, in case files are useful. Delete if necessary.
- Edit "S3 Connection Details section:
- Edit "S3 Connection Details" section:
- Define S3 Endpoint - "https://s3-coec.jsc.fz-juelich.de" is likely to be the best endpoint for CoEC purposes, with region "us-east-1".
- https://apps.fz-juelich.de/jsc/hps/judac/object-storage.html is also available, using `region = "just"`, but performance in testing is lower than the s3-coec endpoint. Documentation for gaining access when one already has a JSC account can be found at https://apps.fz-juelich.de/jsc/hps/judac/object-storage.html
- Define Bucket_name - Intended use of script assumes new bucket for each set of simulations to share.
...
...
@@ -41,3 +42,71 @@ Two-part workflow for partial download of specific dataset (SC) from CIAO-output
-`var_list.json` - List of variables in SC, generated from the first file in `prefix_list`, assumed to be the same for all files in folder/bucket
-`dim.json` - Dimension of each SC variable, generated from first file in `prefix_list`, assumed to be the same for all files in folder/bucket. Used by read script to determine whether the file is two or three dimensional.
-`T_max_arg_list.json` - List of elements with highest temperature for each timestep of CIAO simulation. Used by read script for targeting window to download.
- Edit "S3 Connection Details" section to details defined above.
- Edit "Target Window" section to details defined above.
- Define `window_width` in each direction around the target element to download.
- Choose `target_type` from three options and uncomment:
-`manual` - supply a single element that is static for all timesteps
-`manual_list` - supply a manually defined list of target elements, the same length as the number of timesteps
-`auto` - use `T_max_arg_list.json` to use the hottest element in each timestep as the target.
- Run CIAO_s3_remote_read.py
- metadata .json files should be present in the folder the script is run from.
- Numpy array dump saved, containing the target window in all variables for each timestep.
## Section descriptions (to assist with modification and extension)
### CIAO_split_and_upload.py
This script is split into several sections:
- Folder details
- S3 connection Details
- Functional Section
- File List Construction
- File Structure Read
- Pre-processing
- Upload sections
For as-provided use of the script, only Folder Details and S3 Connection Details need to be modified.
In Folder Details the user can set the local folder containing the files, datadir, and the subfolder to be created in this folder splitdir, for the split HDF5 files to be created in, alongside the metadata .json files. The script does not currently delete those files following its process, in case they are useful to the user. These should be deleted in order to save space if not required, and if the same datadir is to be reused the splitdir should be deleted.
The S3 connection details section should be filled with the details (as strings) of the endpoint and bucket to upload to.
- s3_endpoint should be set to the endpoint detailed above: https://s3-coec.jsc.fz-juelich.de.
- bucket_name should be set to the name of a new and empty bucket for most predictable results.
- access_id and access_key should be set to the access details for that user, with access_key being the S3 secret key.
- region should generally be set to us-east-1, but certain endpoints will require other details, such as the JUDAC endpoint, which requires region set to “JUST” to correctly function with up to date signature version.
For the intended use, the user should not have to modify further sections of the script, but the script was developed so it could be modified for the end-users desired workflow, or to other applications that output similarly-structured HDF5 files. To this end, most lines of code are commented with the intent to make it easy to modify and understand.
A general overview of each section and the logic behind it follows:
File List Construction scans datadir for HDF5 files and generates a list of their filenames, makes splitdir if it does not already exist, and saves the list of filenames to a json file named prefix_list.json. This is not prefixed with names of files, as it is assumed that each experiment will be uploaded to a different bucket. Each metadata .json file is also uploaded to the bucket along with the HDF5 files for later reference if necessary.
The File Structure Read section finds the name of the top-level Group/Dataset of the first file in the list constructed in the previous section. As h5py reports the variable names in a different order than they are stored by “Column index,” the ordering that HDF5 internally considers, the script constructs a correctly ordered list of variable names relative to the mapping of their Column index in the HDF5 file. It then saves this to var_list.json.
The dimensions of the first variable of the SC index are obtained for reference when reading, whether the experiment was in two dimensions or three, and this is saved in dim.json.
In the Pre-processing section, the highest temperature for each file is recorded in a list, so that the download script can be targeted at these elements with the assumption that this is the flame jet tip. This is then saved in T_max_arg_list.json.
In the Upload section, an S3 client session is declared through the boto3 module, and all HDF5 files and .json files in splitdir are uploaded to the chosen S3 endpoint and bucket.
### CIAO_s3_remote_read.py
This script consists of four major sections:
- Function definition
- S3 connection Details
- Target Window
- Functional Section.
For as-provided use of the script, only S3 Connection Details and Target Window need to be modified.
Function Definition holds initial definition of functions for the s3_details class, required for the Multiprocessing module to work properly, as in the simplest case a function is required to use Multiprocessing, and this function can only take one argument, the argument which is to be varied. The class is initialised with the details the user provides in S3 connection Details and Target Window and with the local metadata files for the simulation. These metadata files can also be edited to only target a subset of the files uploaded if required. The function read_variable takes the name of one the variables in the original file’s SC dataset, supplied from var_list.json, and extracts, from either a two dimensional or three dimensional set of output files, the target element and a window of user-defined size around it, and returns this data.
The section S3 Connection Details contains the details of the S3_endpoint, bucket and user details. These details should be filled as detailed in the previous section of this document.
The section Target Window holds the definition of the target element and the window around it in each direction to download, i.e. the returned number of elements per variable will be the square or cube of 2*window_width+1 depending on dimensionality. window_width should be supplied as an integer. Larger windows will require greater time to complete the request. Next, the user must choose target_type, set to auto as the script is supplied, which will use the contents of the file T_max_arg_list.json as the target for each timestep, i.e. the element with the highest temperature in each frame. Setting target_type to manual uses a single target element, target, for all timesteps. Setting target_type to manual_list allows for a user defined list to be supplied through the variable manual_target. This list must be the same length as the number of timesteps.
For use as-provided, the user should not have to modify further sections of the script, but as stated previously, the script was developed to be straightforward so it could be modified for the end-users desired workflow, or for other applications that output similarly-structured HDF5 files. To this end, most lines of code are commented with the intent to make it easy to modify and understand. A general overview of each remaining section and the logic behind it follows:
The section Read Metadata loads information from the local copies of metadata files, and sets the dimension of the output being read.
The section Targeting switches preparatory steps for the target_types detailed above.
The Section Download creates a Multiprocessing thread pool, loops through each timestep, setting the base s3 and file details initialising the s3_details class then uses multiprocessing to download from each separated variable simultaneously. The split files enable this multiprocessing approach to be performed quickly, and also allow for post-processing such as that done with the Temperature in the previous step, to be done as a post-process step. This is not possible with the monolithic form of the output, due to poor performance. This output is then saved as a binary numpy array. This is a commonly available format, and can be easily read in again and saved out in another format with numpy and a range of compatible python modules, satisfying the requirement for flexibility and portability of the data.