Skip to content
Snippets Groups Projects
Select Git revision
14 results

jsbeautify

  • Clone with SSH
  • Clone with HTTPS
  • Apache Arrow for variable-length data

    In this guide, we present an example for storing any type of data using the efficient and easy-to-use file format Apache Arrow with Python. We will use fake image classification data with different image sizes to create and read from an Apache Arrow file.

    Especially the API for writing and reading variable-length byte data in Apache Arrow is – while taking some getting-used-to – a great trade-off between usability and efficiency. We can use Apache Arrow from Python using PyArrow.

    While the principles of this guide will work with any type of data due to being based on variable-length byte sequences, data reading and especially random access data reading will greatly speed up if you can specify structured data. There is an example in fixlength_arrow.py for this. Please also check the PyArrow documentation or PyArrow cookbook to find out more.

    Setup

    There are two options for setting up the required packages for this guide:

    1. Either create a new Python venv using the included requirements.txt, like:

      python3 -m venv ./env
      source env/bin/activate
      python -m pip install -U pip
      python -m pip install -r requirements.txt
    2. Or instead, use the provided modules from the module system. Note that the most recent software stage does not have the Arrow module, so we use an older software stage for this guide:

      source modules.sh

    The rest of this guide will assume that you used the first option. In all further occurrences, replace source env/bin/activate with source modules.sh instead if you used the module system option.

    Data creation

    To create a train and validation split of fake data in the directory ./data, execute the following:

    source env/bin/activate
    python varlength_arrow.py ./data

    This will create two files of roughly 120 MB each, each containing 1000 fake image samples.

    Data reading

    To use the example PyTorch Dataset implementation for reading, try the following:

    source env/bin/activate
    python
    >>> import varlength_arrow as va
    >>> dset = va.VarlengthArrow('./data/varlength-data-train.arrow')
    >>> # Now you can use `dset` like a normal image classification dataset.
    >>> pil_img, label = next(iter(dset))
    >>> for (pil_img, label) in dset:
    ...     print(f'{pil_img.size = }, {label = }')