Apache Arrow for variable-length data
In this guide, we present an example for storing any type of data using the efficient and easy-to-use file format Apache Arrow with Python. We will use fake image classification data with different image sizes to create and read from an Apache Arrow file.
Especially the API for writing and reading variable-length byte data in Apache Arrow is – while taking some getting-used-to – a great trade-off between usability and efficiency. We can use Apache Arrow from Python using PyArrow.
While the principles of this guide will work with any type of data
due to being based on variable-length byte sequences, data reading and
especially random access data reading will greatly speed up if you can
specify structured data. There is an example in fixlength_arrow.py
for this. Please also check the PyArrow
documentation or PyArrow
cookbook to find out
more.
Setup
There are two options for setting up the required packages for this guide:
-
Either create a new Python
venv
using the includedrequirements.txt
, like:python3 -m venv ./env source env/bin/activate python -m pip install -U pip python -m pip install -r requirements.txt
-
Or instead, use the provided modules from the module system. Note that the most recent software stage does not have the Arrow module, so we use an older software stage for this guide:
source modules.sh
The rest of this guide will assume that you used the first option. In
all further occurrences, replace source env/bin/activate
with
source modules.sh
instead if you used the module system option.
Data creation
To create a train and validation split of fake data in the directory
./data
, execute the following:
source env/bin/activate
python varlength_arrow.py ./data
This will create two files of roughly 120 MB each, each containing 1000 fake image samples.
Data reading
To use the example PyTorch
Dataset
implementation for reading, try the following:
source env/bin/activate
python
>>> import varlength_arrow as va
>>> dset = va.VarlengthArrow('./data/varlength-data-train.arrow')
>>> # Now you can use `dset` like a normal image classification dataset.
>>> pil_img, label = next(iter(dset))
>>> for (pil_img, label) in dset:
... print(f'{pil_img.size = }, {label = }')