Skip to content
Snippets Groups Projects
Commit e00acdfe authored by Martin Schultz's avatar Martin Schultz
Browse files

initial version

parents
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# ERA5 data re-organisation to investigate compression options
Idea: save a lot of disk space by "compressing" information on vertical levels and along the time dimension
Previous attempts at weather and climate model output compression looked at standard compression algorithms, reduced numerical precision, ensemble statistics, or statistical aggregation on model levels. Here, we suggest to investigate compression along model levesl and the time axis instead.
Specifically, we will calculate tendencies of vertical profiles and along the time axis and try to define functions (possibly using machine learning) which encode these tendencies. It should be possible to define individual model levels which shall act as anchor points. For these, the full information will be maintained, while other model levels and time steps shall be reconstruced with the learned functions.
Advantages:
* substantial reduction of output expected. For example, the ECMWF model of 2020 has 137 model levels and produces hourly data. If we can achivee good reproducability of the entire data with, say 14 vertical levels and 6-hourly anchor points in time, the saving would be a factor of 60 minus the additional space required for storing the functional mappings.
* this method could be implemented in the model output code so that output would be much smaller from the start and post-processing could become much mor eefficient.
* fields are preserved with no information loss at specific model levels (e.g. standard analysis levels at 1000, 900, 700, 500, 300 hPa) and lossy compression is only applied elsewhere where precision generally matters less.
* due to functional mapping, interpolation of model results to any desired resolution (in pressure and time) will be straightforward and much mor efficient than interpolation of uncompressed data. This could be particularly useful for visualisation applications.
Disadvantages:
* method will require definition of a new output format and development of tools to manage the new format
Possible issues/questions to address:
* efficiency of encoding and decoding (time to re-generate a map at an arbitrary model level)
* error quantification and limitation (see Baker et al, 2017)
* how to determine anchor points in model level and time - we will probably need to add layers on top of user specified model levels
* how to accomplish the functional mapping
%% Cell type:markdown id: tags:
## Data location and pre-processing
ERA-5 data are stored on the JSC HPFS in /p/fastdata/slmet/slmet111/met_data/ecmwf/era5/grib/. They are stored in GRIB format. There are 3 files per hour of day and all daily files are contained in subdirectories YYYY/MM/. For our experiments we extracted modellevel data (ml) for three days from 15 to 17 February 2020, including the storm event Dennis, which was a rather strong storm over Western Europe. Each ml file contains 13 variables and has a size of ~1.8 GBytes.
The GRIB files are first converted to netCDF and stored in $SCRATCH/schultz1 using the CDO command:
`for hh in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ; do cdo -f nc copy /p/fastdata/slmet/slmet111/met_data/ecmwf/era5/grib/2020/02/20200217${hh}_ml.grb 20200217${hh}_ml.nc ; echo Done with $hh; done`
The resulting netCDF files are about 4 GBytes each.
Next, we re-assemble the data so that we generate one file per variable including all 72 time steps of the 3 days of the study period:
`for vv in t z v q vo lnsp d u v o3 clwc ciwc cc; do ncrcat -v hyai,hybi,lev,lev_2,${vv} 2020021???_ml.nc era5_20200215-17_${vv}.nc; echo Done with ${vv}; done`
This creates 13 files of 27 GBytes each (except lnsp and z which are only 207 MBytes).
With the code in this notebook we visualize a few model fields to illustrate their features, frequency distributions and variability.
Finally, we extract a few selected vertical columns for assessing the potential and developing the compression methods. The selected coordinates are:
*
*
The command for extracting columnn data is:
`for vv in t z w q vo lnsp d u v o3 clwc ciwc cc; do ncks -d lon,LONVALUE -d lat,LATVALUE era5_20200215-17_${vv}.nc era5_20200215-17_${vv}_${LATVALUE}_${LONVALUE}.nc; done`
where vv, LONVALUE and LATVALUE must be set as environment variables beforehand.
_Note:_
Before CDO and NCO commands can be used in the Jupyter-JSC lab, module load commands must be issued. See `module spider CDO`for further information.
%% Cell type:code id: tags:
``` python
# Code for plotting some maps and frequency distributions
import netCDF4 as nc
PATH_TO_NC = '/p/scratch/deepacf/schultz1/'
```
%% Cell type:code id: tags:
``` python
# test vertical coordinates (print lev)
with nc.Dataset(PATH_TO_NC + 'era5_20200215-17_t.nc') as f:
hyam = f.variables['hyam'][:]
hybm = f.variables['hybm'][:]
for i in range(len(hyam)):
print(f'{i:4d} : {hyam[i]+101300.*hybm[i]:.3f}') # pressure of model level in Pa
```
%% Output
0 : 1.000
1 : 2.551
2 : 3.884
3 : 5.747
4 : 8.287
5 : 11.676
6 : 16.107
7 : 21.797
8 : 28.986
9 : 37.932
10 : 48.917
11 : 62.238
12 : 78.208
13 : 97.156
14 : 119.421
15 : 145.352
16 : 175.309
17 : 209.654
18 : 248.754
19 : 292.980
20 : 342.702
21 : 398.287
22 : 460.104
23 : 528.515
24 : 603.877
25 : 686.542
26 : 776.856
27 : 875.156
28 : 981.773
29 : 1097.027
30 : 1221.232
31 : 1354.690
32 : 1497.697
33 : 1650.536
34 : 1813.484
35 : 1986.808
36 : 2170.764
37 : 2365.601
38 : 2571.559
39 : 2788.870
40 : 3017.755
41 : 3258.432
42 : 3511.106
43 : 3775.979
44 : 4053.211
45 : 4342.874
46 : 4644.983
47 : 4959.522
48 : 5286.443
49 : 5625.668
50 : 5977.209
51 : 6341.510
52 : 6719.409
53 : 7111.872
54 : 7519.985
55 : 7944.959
56 : 8388.157
57 : 8851.116
58 : 9335.268
59 : 9841.629
60 : 10370.988
61 : 10924.153
62 : 11501.948
63 : 12105.214
64 : 12734.810
65 : 13391.613
66 : 14076.515
67 : 14790.427
68 : 15534.276
69 : 16309.004
70 : 17115.573
71 : 17954.960
72 : 18828.158
73 : 19736.176
74 : 20680.039
75 : 21660.784
76 : 22679.468
77 : 23737.161
78 : 24834.948
79 : 25973.926
80 : 27155.209
81 : 28379.922
82 : 29649.204
83 : 30964.208
84 : 32326.094
85 : 33736.039
86 : 35195.229
87 : 36704.863
88 : 38266.147
89 : 39880.299
90 : 41548.545
91 : 43272.117
92 : 45052.260
93 : 46890.226
94 : 48787.277
95 : 50742.193
96 : 52748.323
97 : 54793.841
98 : 56866.839
99 : 58957.355
100 : 61055.156
101 : 63149.942
102 : 65231.545
103 : 67290.129
104 : 69316.360
105 : 71301.555
106 : 73237.829
107 : 75118.187
108 : 76936.580
109 : 78687.968
110 : 80368.311
111 : 81974.546
112 : 83504.569
113 : 84957.157
114 : 86331.895
115 : 87629.107
116 : 88849.768
117 : 89995.398
118 : 91067.988
119 : 92069.912
120 : 93003.843
121 : 93872.682
122 : 94679.484
123 : 95427.418
124 : 96119.696
125 : 96759.529
126 : 97350.097
127 : 97894.526
128 : 98395.858
129 : 98857.021
130 : 99280.828
131 : 99669.967
132 : 100026.989
133 : 100354.299
134 : 100654.164
135 : 100928.721
136 : 101179.966
%% Cell type:code id: tags:
``` python
# specify date, variable and model level
DATE = 30 # use integer index for simplicity
VAR = 'u'
LEV = 132 # use index for simplicity
```
%% Cell type:code id: tags:
``` python
filename = PATH_TO_NC + f'era5_20200215-17_{VAR}.nc'
with nc.Dataset(filename) as f:
v = f.variables[VAR]
val = v[DATE, LEV, :, :]
name = getattr(v, 'long_name')
lon = f.variables['lon'][:]
lat = f.variables['lat'][:]
# convert lon coordinates to -180..180 and re-arrange data
lon[lon > 180.] -= 360.
l180 = 601 # index of 180 degrees longitude
lon = np.concatenate([lon[l180:], lon[:l180]])
# val = np.concatenate([val[:,l180:], val[:,:l180]])
val = np.hstack([val[:,l180:], val[:,:l180]])
print(name, val.shape, val.min(), val.max(), val.mean())
print('lon: ', lon[0:10], lon[-10:])
print('lat: ', lat[0:10], lat[-10:])
```
%% Output
U component of wind (601, 1200) -27.909231 31.76069 0.19587262
lon: [-179.7 -179.4 -179.1 -178.8 -178.5 -178.2 -177.9 -177.6 -177.3 -177. ] [177.3 177.6 177.9 178.2 178.5 178.8 179.1 179.4 179.7 180. ]
lat: [90. 89.7 89.4 89.1 88.8 88.5 88.2 87.9 87.6 87.3] [-87.3 -87.6 -87.9 -88.2 -88.5 -88.8 -89.1 -89.4 -89.7 -90. ]
%% Cell type:code id: tags:
``` python
"""
!module load Stages/2020 GCC/9.3.0 ParaStationMPI/5.4.7-1 Cartopy
%matplotlib inline
import matplotlib.pyplot as plt
#import cartopy.crs as ccrs
import numpy as np
import cartopy
# prep some data for contourf plot
# extents: upper-right of the map
x = np.linspace(-65, -30, 30)
y = np.linspace(-30, 15, 30)
matrixLon, matrixLat = np.meshgrid(x, y)
matrixTec = 10*np.sin(matrixLon**2 + matrixLat**2)/(matrixLon**2 + matrixLat**2)
ax = plt.axes(projection=cartopy.crs.PlateCarree())
# prep increasing values of v covering values of Z (matrixTec)
v = np.arange(-0.15, 0.15, 0.025)
# plot with appropriate parameters
# zorder: put the filled-contour on top
# alpha: set transparency to allow some visibility of graphics below
cp = plt.contourf(matrixLon, matrixLat, matrixTec, v, \
transform=cartopy.crs.PlateCarree(), \
zorder=2, \
alpha=0.65, \
cmap=plt.cm.copper)
plt.colorbar(cp)
ax.add_feature(cartopy.feature.LAND)
ax.add_feature(cartopy.feature.OCEAN)
ax.add_feature(cartopy.feature.COASTLINE)
ax.add_feature(cartopy.feature.BORDERS, linestyle=':')
ax.set_extent([-85, -30, -60, 15])
plt.title('TEC Map')
plt.show()
"""
```
%% Output
"\n!module load Stages/2020 GCC/9.3.0 ParaStationMPI/5.4.7-1 Cartopy\n%matplotlib inline\nimport matplotlib.pyplot as plt\n#import cartopy.crs as ccrs\nimport numpy as np\nimport cartopy\n\n# prep some data for contourf plot\n# extents: upper-right of the map\nx = np.linspace(-65, -30, 30)\ny = np.linspace(-30, 15, 30)\nmatrixLon, matrixLat = np.meshgrid(x, y)\nmatrixTec = 10*np.sin(matrixLon**2 + matrixLat**2)/(matrixLon**2 + matrixLat**2)\n\nax = plt.axes(projection=cartopy.crs.PlateCarree())\n\n# prep increasing values of v covering values of Z (matrixTec)\nv = np.arange(-0.15, 0.15, 0.025)\n\n# plot with appropriate parameters\n# zorder: put the filled-contour on top\n# alpha: set transparency to allow some visibility of graphics below\ncp = plt.contourf(matrixLon, matrixLat, matrixTec, v, transform=cartopy.crs.PlateCarree(), zorder=2, alpha=0.65, cmap=plt.cm.copper)\nplt.colorbar(cp)\n\nax.add_feature(cartopy.feature.LAND)\nax.add_feature(cartopy.feature.OCEAN)\nax.add_feature(cartopy.feature.COASTLINE)\nax.add_feature(cartopy.feature.BORDERS, linestyle=':')\nax.set_extent([-85, -30, -60, 15])\nplt.title('TEC Map')\nplt.show()\n"
%% Cell type:code id: tags:
``` python
%matplotlib inline
import matplotlib.pyplot as plt
cs = plt.contourf(lon, lat, val)
plt.title(name)
plt.colorbar()
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
# transform data to range 0..1
vmin = val.min()
vmax = val.max()
sval = (val-vmin)/(vmax-vmin)
# create locations vector
# locations must contain triples of lat, lon, value
locations = []
"""
for j in range(50, 550):
for i in range(0, 1200):
locations.append([lat[j], lon[i], sval[j,i]])
"""
for j in range(50, 550, 3):
for i in range(0, 1200, 3):
locations.append([lat[j], lon[i], sval[j,i]])
```
%% Cell type:code id: tags:
``` python
from ipyleaflet import Map, basemaps, Heatmap
zoom = 2
center = [0., 0.]
# map = Map(basemap=basemaps.NASAGIBS.ViirsTrueColorCR, center=center, zoom=zoom) # loads current satellite image
# map = Map(basemap=basemaps.NASAGIBS.ViirsEarthAtNight2012, center=center, zoom=zoom)
map = Map(basemap=basemaps.CartoDB.Positron, center=center, zoom=zoom)
# add heatmap
# heatmap arguments:
# locations [] List of center locations
# min_opacity 0.05 Minimum opacity the heat will start at
# max_zoom 18 Zoom level where max intensity is reached
# max 1.0 Maximum point intensity
# radius 25.0 Radius of each “point” of the heatmap
# blur 15.0 Amount of blur
# gradient {0.4: ‘blue’, 0.6: ‘cyan’, 0.7: ‘lime’, 0.8: ‘yellow’, 1.0: ‘red’}
heatmap = Heatmap(
locations=locations,
radius=3,
max=1,
max_zoom=3,
blur=0.
)
map.add_layer(heatmap);
map
```
%% Output
%% Cell type:code id: tags:
``` python
import geojsoncontour
```
%% Output
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-3cf6325828cf> in <module>
----> 1 import geojsoncontour
ModuleNotFoundError: No module named 'geojsoncontour'
%% Cell type:code id: tags:
``` python
```
Source diff could not be displayed: it is too large. Options to address this: view the blob.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment