Commit 191480dd authored by Bing Gong's avatar Bing Gong
Browse files

update hickle package afer remove egg info from gitignore

parent f0b39975
Pipeline #39310 failed with stages
in 1 minute and 37 seconds
Metadata-Version: 2.1
Name: hickle
Version: 3.4.3
Summary: Hickle - a HDF5 based version of pickle
Author: Danny Price
License: UNKNOWN
Description: [![Build Status](](
[![JOSS Status](](
Hickle is a [HDF5]( based clone of `pickle`, with a twist: instead of serializing to a pickle file,
Hickle dumps to a HDF5 file (Hierarchical Data Format). It is designed to be a "drop-in" replacement for pickle (for common data objects), but is
really an amalgam of `h5py` and `dill`/`pickle` with extended functionality.
That is: `hickle` is a neat little way of dumping python variables to HDF5 files that can be read in most programming
languages, not just Python. Hickle is fast, and allows for transparent compression of your data (LZF / GZIP).
Why use Hickle?
While `hickle` is designed to be a drop-in replacement for `pickle` (or something like `json`), it works very differently.
Instead of serializing / json-izing, it instead stores the data using the excellent [h5py]( module.
The main reasons to use hickle are:
1. It's faster than pickle and cPickle.
2. It stores data in HDF5.
3. You can easily compress your data.
The main reasons not to use hickle are:
1. You don't want to store your data in HDF5. While hickle can serialize arbitrary python objects, this functionality is provided only for convenience, and you're probably better off just using the pickle module.
2. You want to convert your data in human-readable JSON/YAML, in which case, you should do that instead.
So, if you want your data in HDF5, or if your pickling is taking too long, give hickle a try.
Hickle is particularly good at storing large numpy arrays, thanks to `h5py` running under the hood.
Documentation for hickle can be found at [](
Usage example
Hickle is nice and easy to use, and should look very familiar to those of you who have pickled before.
In short, `hickle` provides two methods: a [hickle.load](
method, for loading hickle files, and a [hickle.dump](
method, for dumping data into HDF5. Here's a complete example:
import os
import hickle as hkl
import numpy as np
# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')
# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')
# Dump data, with compression
hkl.dump(array_obj, 'test_gzip.hkl', mode='w', compression='gzip')
# Compare filesizes
print('uncompressed: %i bytes' % os.path.getsize('test.hkl'))
print('compressed: %i bytes' % os.path.getsize('test_gzip.hkl'))
# Load data
array_hkl = hkl.load('test_gzip.hkl')
# Check the two are the same file
assert array_hkl.dtype == array_obj.dtype
assert np.all((array_hkl, array_obj))
### HDF5 compression options
A major benefit of `hickle` over `pickle` is that it allows fancy HDF5 features to
be applied, by passing on keyword arguments on to `h5py`. So, you can do things like:
hkl.dump(array_obj, 'test_lzf.hkl', mode='w', compression='lzf', scaleoffset=0,
chunks=(100, 100), shuffle=True, fletcher32=True)
A detailed explanation of these keywords is given at,
but we give a quick rundown below.
In HDF5, datasets are stored as B-trees, a tree data structure that has speed benefits over contiguous
blocks of data. In the B-tree, data are split into [chunks](,
which is leveraged to allow [dataset resizing]( and
compression via [filter pipelines]( Filters such as
`shuffle` and `scaleoffset` move your data around to improve compression ratios, and `fletcher32` computes a checksum.
These file-level options are abstracted away from the data model.
Recent changes
* December 2018: Accepted to Journal of Open-Source Software (JOSS).
* June 2018: Major refactor and support for Python 3.
* Aug 2016: Added support for scipy sparse matrices `bsr_matrix`, `csr_matrix` and `csc_matrix`.
Performance comparison
Hickle runs a lot faster than pickle with its default settings, and a little faster than pickle with `protocol=2` set:
In [1]: import numpy as np
In [2]: x = np.random.random((2000, 2000))
In [3]: import pickle
In [4]: f = open('foo.pkl', 'w')
In [5]: %time pickle.dump(x, f) # slow by default
CPU times: user 2 s, sys: 274 ms, total: 2.27 s
Wall time: 2.74 s
In [6]: f = open('foo.pkl', 'w')
In [7]: %time pickle.dump(x, f, protocol=2) # actually very fast
CPU times: user 18.8 ms, sys: 36 ms, total: 54.8 ms
Wall time: 55.6 ms
In [8]: import hickle
In [9]: f = open('foo.hkl', 'w')
In [10]: %time hickle.dump(x, f) # a bit faster
dumping <type 'numpy.ndarray'> to file <HDF5 file "foo.hkl" (mode r+)>
CPU times: user 764 us, sys: 35.6 ms, total: 36.4 ms
Wall time: 36.2 ms
So if you do continue to use pickle, add the `protocol=2` keyword (thanks @mrocklin for pointing this out).
For storing python dictionaries of lists, hickle beats the python json encoder, but is slower than uJson. For a dictionary with 64 entries, each containing a 4096 length list of random numbers, the times are:
json took 2633.263 ms
uJson took 138.482 ms
hickle took 232.181 ms
It should be noted that these comparisons are of course not fair: storing in HDF5 will not help you convert something into JSON, nor will it help you serialize a string. But for quick storage of the contents of a python variable, it's a pretty good option.
Installation guidelines (for Linux and Mac OS).
### Easy method
Install with `pip` by running `pip install hickle` from the command line.
### Manual install
1. You should have Python 2.7 and above installed
2. Install h5py
(Official page:
3. Install hdf5
(Official page:
4. Download `hickle`:
via terminal: git clone
via manual download: Go to and on right hand side you will find `Download ZIP` file
5. cd to your downloaded `hickle` directory
6. Then run the following command in the `hickle` directory:
`python install`
### Testing
Once installed from source, run `python test` to check it's all working.
Bugs & contributing
Contributions and bugfixes are very welcome. Please check out our [contribution guidelines](
for more details on how to contribute to development.
Referencing hickle
If you use `hickle` in academic research, we would be grateful if you could reference [our paper]( in the [Journal of Open-Source Software (JOSS)](
Price et al., (2018). Hickle: A HDF5-based python pickle replacement. Journal of Open Source Software, 3(32), 1115,
Keywords: pickle,hdf5,data storage,data export
Platform: Cross platform (Linux
Platform: Mac OSX
Platform: Windows)
Requires-Python: >=2.7
Description-Content-Type: text/markdown
\ No newline at end of file
import re
import six
def get_type_and_data(h_node):
""" Helper function to return the py_type and data block for a HDF node """
py_type = h_node.attrs["type"][0]
data = h_node[()]
# if h_node.shape == ():
# data = h_node.value
# else:
# data = h_node[:]
return py_type, data
def get_type(h_node):
""" Helper function to return the py_type for a HDF node """
py_type = h_node.attrs["type"][0]
return py_type
def sort_keys(key_list):
""" Take a list of strings and sort it by integer value within string
key_list (list): List of keys
key_list_sorted (list): List of keys, sorted by integer
# Py3 h5py returns an irritating KeysView object
# Py3 also complains about bytes and strings, convert all keys to bytes
if six.PY3:
key_list2 = []
for key in key_list:
if isinstance(key, str):
key = bytes(key, 'ascii')
key_list = key_list2
# Check which keys contain a number
numbered_keys = ['\d+', key) for key in key_list]
# Sort the keys on number if they have it, or normally if not
if(len(key_list) and not numbered_keys.count(None)):
to_int = lambda x: int('\d+', x).group(0))
return(sorted(key_list, key=to_int))
def check_is_iterable(py_obj):
""" Check whether a python object is iterable.
Note: this treats unicode and string as NON ITERABLE
py_obj: python object to test
iter_ok (bool): True if item is iterable, False is item is not
if six.PY2:
string_types = (str, unicode)
string_types = (str, bytes, bytearray)
if isinstance(py_obj, string_types):
return False
return True
except TypeError:
return False
def check_is_hashable(py_obj):
""" Check if a python object is hashable
Note: this function is currently not used, but is useful for future
py_obj: python object to test
return True
except TypeError:
return False
def check_iterable_item_type(iter_obj):
""" Check if all items within an iterable are the same type.
iter_obj: iterable object
iter_type: type of item contained within the iterable. If
the iterable has many types, a boolean False is returned instead.
iseq = iter(iter_obj)
first_type = type(next(iseq))
except StopIteration:
return False
except Exception as ex:
return False
return first_type if all((type(x) is first_type) for x in iseq) else False
# encoding: utf-8
Created by Danny Price 2012-05-28.
Hickle is a HDF5 based clone of Pickle. Instead of serializing to a
pickle file, Hickle dumps to a HDF5 file. It is designed to be as similar
to pickle in usage as possible.
## Notes
This is a legacy handler, for hickle v1 files.
If V2 reading fails, this will be called as a fail-over.
import os
import sys
import numpy as np
import h5py as h5
if sys.version_info.major == 3:
NoneType = type(None)
from types import NoneType
__version__ = "1.3.0"
__author__ = "Danny Price"
## Error handling ##
class FileError(Exception):
""" An exception raised if the file is fishy"""
def __init__(self):
def __str__(self):
print("Error: cannot open file. Please pass either a filename string, a file object, "
"or a h5py.File")
class NoMatchError(Exception):
""" An exception raised if the object type is not understood (or supported)"""
def __init__(self):
def __str__(self):
print("Error: this type of python object cannot be converted into a hickle.")
class ToDoError(Exception):
""" An exception raised for non-implemented functionality"""
def __init__(self):
def __str__(self):
print("Error: this functionality hasn't been implemented yet.")
class H5GroupWrapper(h5.Group):
def create_dataset(self, *args, **kwargs):
kwargs['track_times'] = getattr(self, 'track_times', True)
return super(H5GroupWrapper, self).create_dataset(*args, **kwargs)
def create_group(self, *args, **kwargs):
group = super(H5GroupWrapper, self).create_group(*args, **kwargs)
group.__class__ = H5GroupWrapper
group.track_times = getattr(self, 'track_times', True)
return group
class H5FileWrapper(h5.File):
def create_dataset(self, *args, **kwargs):
kwargs['track_times'] = getattr(self, 'track_times', True)
return super(H5FileWrapper, self).create_dataset(*args, **kwargs)
def create_group(self, *args, **kwargs):
group = super(H5FileWrapper, self).create_group(*args, **kwargs)
group.__class__ = H5GroupWrapper
group.track_times = getattr(self, 'track_times', True)
return group
def file_opener(f, mode='r', track_times=True):
""" A file opener helper function with some error handling.
This can open files through a file object, a h5py file, or just the filename.
# Were we handed a file object or just a file name string?
if isinstance(f, file):
filename, mode =, f.mode
h5f = h5.File(filename, mode)
elif isinstance(f, h5._hl.files.File):
h5f = f
elif isinstance(f, str):
filename = f
h5f = h5.File(filename, mode)
raise FileError
h5f.__class__ = H5FileWrapper
h5f.track_times = track_times
return h5f
## dumpers ##
def dump_ndarray(obj, h5f, **kwargs):
""" dumps an ndarray object to h5py file"""
h5f.create_dataset('data', data=obj, **kwargs)
h5f.create_dataset('type', data=['ndarray'])
def dump_np_dtype(obj, h5f, **kwargs):
""" dumps an np dtype object to h5py file"""
h5f.create_dataset('data', data=obj)
h5f.create_dataset('type', data=['np_dtype'])
def dump_np_dtype_dict(obj, h5f, **kwargs):
""" dumps an np dtype object within a group"""
h5f.create_dataset('data', data=obj)
h5f.create_dataset('_data', data=['np_dtype'])
def dump_masked(obj, h5f, **kwargs):
""" dumps an ndarray object to h5py file"""
h5f.create_dataset('data', data=obj, **kwargs)
h5f.create_dataset('mask', data=obj.mask, **kwargs)
h5f.create_dataset('type', data=['masked'])
def dump_list(obj, h5f, **kwargs):
""" dumps a list object to h5py file"""
# Check if there are any numpy arrays in the list
contains_numpy = any(isinstance(el, np.ndarray) for el in obj)
if contains_numpy:
_dump_list_np(obj, h5f, **kwargs)
h5f.create_dataset('data', data=obj, **kwargs)
h5f.create_dataset('type', data=['list'])
def _dump_list_np(obj, h5f, **kwargs):
""" Dump a list of numpy objects to file """
np_group = h5f.create_group('data')