02_NumPy_concepts.ipynb

 
 NumPy - an HPC perspective

19 Nov 2020 | Olav Zimmermann
 Python is an interpreted language and as such it is extremely flexible, allowing to define everything, including code itself, 
at runtime. This entails that the Python interpreter uses a lot of magic behind the scenes (e.g. type inferencing) to keep things
as simple and productive as possible for the programmer. This flexibility comes at the price of a markedly reduced runtime speed 
compared to compiled languages such as C++.
Many problems encoded in Python do not require the full range of flexibility and could easily trade it for improved runtime 
performance without giving up the development performance which is the key for Python's success.
 Background
 NumPy is a library, written in C, to enable fast numerical computing in Python. Since its inception in 1995, NumPy has become first choice for numeric data processing in python. The main sacrifice to enable its high speed is the restriction to 
containers with uniform memory layout, i.e., arrays that contain items of uniform static datatype.
 Despite this apparently severe restriction NumPy is quite versatile and a full introduction into NumPy would require a
course of its own. Hence we will focus on the concepts that allowed its wide adoption by developers and HPC vendors alike. For an overview of the available functionality the numpy website https://www.numpy.org is a good starting point.
 Note: NumPy is not a part of the vanilla Python distribution and needs to be installed separately. Then it can be used  like any other Python library.
 The numpy BLAS and LAPACK bindings
 NumPy does not implement everything itself. Instead it uses the fact that with BLAS (low level routines), and LAPACK (high level routines, based on BLAS), two widely adopted interfaces for linear algebra routines exist and that for almost all distributions and operation systems at least one library suporting the BLAS/LAPACK interface exists.
 While not mandatory, NumPy can be linked against high performance linear algebra libraries such as Intel's MKL, AMD's BLIS or IBM's ESSL folowing the BLAS/LAPACK APIs. These libraries can give you speed-ups of up to two orders of magnitude compared to the standard LAPACK library. In fact a number of distributions provide numpy implementations that are statically linked to Intel's MKL and when building, NumPy gives a preference to the optimized libraries (MKL, BLIS) over the slower, more hardware independent libraries (OpenBLAS, ATLAS).
Notes: 

Not all routines of MKL are necessarily faster than those in OpenBLAS, e.g. random number generation. 
By default NumPy prefers Intel MKL over AMD BLIS, even on AMD processors. MKL works with AMD processors, but by default does not use the fast AVX2 and AVX512 BLAS routines that it uses when it has detected Intel hardware but only the much slower SSE2 routines. While there are hacks to change that behaviour it may therefore be more reasonable to change the order of preference for the different BLAS libraries on AMD platforms.
 You can check which BLAS library has been linked during your build of NumPy by calling the show_config() function.
 import numpy as np
print(np.__version__)
print(np.show_config())
 
 ndarray
 The main datatype in NumPy is an n-dimensional array object called ndarray. By default it has C-array memory layout 
(last index fastest, 0 based indices) and a uniform fixed element type defined at instantiation of the ndarray. Element sizes range from 1 byte (bool, uint8) to 16 byte (complex256) with the default being float64.
 introspection, shape
 Several meta data of an ndarray can be obtained among them a.ndim,a.size,a.nbytes,a.flags,type(a).
The array interface enables in-depth introspection such as obtaining the address of an object, e.g. for debugging purposes.
 myarr=np.array([[1,2,3],[4,5,6]])
print(f" myarr is a {myarr.ndim}-dimensional {type(myarr)} of type {myarr.dtype} and shape {myarr.shape}. It starts at adress {hex(myarr.__array_interface__['data'][0])}.")
print(f"Each of its {myarr.size} {myarr.dtype}-elements has a size of {myarr.dtype.itemsize} byte therefore it has {myarr.nbytes} bytes in total and its strides (in bytes) are {myarr.strides}.")
print("The flags stored for this array are:")
print(myarr.flags)
 
 NumPy's ndarray as a foundation
 ndarrays support important concepts to enable vectorized processing of data, such as broadcasting, ufuncs, and reductions. Why just use this for numeric data? Almost anything can be cast into an ndarray and also labeled access is possible. 
 dt=np.dtype([('name','S10'),('income',np.int16),\
             ('revenue',([('q1',np.float32),('q2',np.float32),('q3',np.float32),('q4',np.float32)]))])
f=np.loadtxt('test.dat',dtype=dt, skiprows=2,delimiter=',')
print(f)
print(f"this array has {f.ndim} dimension(s)")
 
 However, note that most functionality in NumPy is geared towards arrays of uniform numerical type. The above data type indicates that the entire row is treated as a single (albeit very complicated) type and the data in the table thereby become a one-dimensional array. If you need spreadsheet-like functionality, NumPy's 'structured arrays' or 'record type' arrays that are based on ndarrays are not adequate. Luckily ndarrays can be extended in functionality and even reimplemented in a compatible way.
 Design features of the ndarray API
 Due to its dual interface in Python and C and its often (not always) regular memory layout ndarrays can relatively easy be mapped to data types described in other language contexts. 
 A couple of clever design choices adds to the popularity of ndarrays as a standard API for high performance computing in Python:

narrays support both C-style and Fortran-style memory layouts, as well as memory alignment and view interfaces.
A Mixin class NDArrayOperatorsMixin provides all the necessary standard python interfaces like __add__, __lt__ etc. making custom array containers more convenient to write.
ndarray.__array_ufunc__ and __array_function__ are interfaces that split the function API from the excution engine. By overriding them, these functions allow to intercept calls to ufuncs and non-ufuncs and to dispatch these calls to alternative implementations. Several important packages already use this approach.
 Packages that implement the ndarray interface:
 
    package/module extension(s) other HPC relevant characteristics
    cuPy cuPy array: subset of ndarray functionality ndarrays on GPUs
    pandas pandas dataframe: labeled data, non-numerical data, multi-indexing, grouping fast CSV-parser
    cuDF cuDF dataframe: subset of pandas dataframe functionality pandas dataframes on GPUs
    dask dask array: only subset of ndarray functionality tiled ndarrays larger than main memory, distributed processing on multiple nodes
    dask dask dataframe: only subset of pandas dataframe functionality tiled dataframes larger than main memory, distributed processing on multiple nodes
    dask-cuDF cuDF dataframe: subset of pandas dataframe functionality tiled dataframes on multiple GPUs and multiple nodes
    sparse ndarray functionality on sparse arrays (COO layout) 
 Notable packages that do not implement the ndarray interface are numpy.matrix, scipy.sparse, pytorch, JAX while tensorflow has an experimental interface.
 End