Introduction to Data Analysis and Plotting with Pandas

JSC Tutorial

Andreas Herten, Forschungszentrum Jülich, 26 February 2019

My Motivation

  • I like Python
  • I like plotting data
  • I like sharing
  • I think Pandas is awesome and you should use it too

Motto: »Pandas as early as possible!«

Tutorial Setup

  • 60 minutes (we might do this again for some advanced stuff if you want to)
    • Well, as it turns out, 60 minutes weren't nearly enought
    • We ended up spending nearly 2 hours on it, and we needed to rush quickly through the material
  • Alternating between lecture and hands-on
  • Please give status of hands-ons via pollev.com/aherten538
  • Please open Jupyter Notebook of this session
    • … either on your local machine (pip install --user pandas seaborn)
    • … or on the JSC Jupyter service at https://jupyter-jsc.fz-juelich.de/
      Pandas and seaborn should already be there!
  • Tell me when you're done on pollev.com/aherten538

About Pandas

  • Python package (Python 2, Python 3)
  • For data analysis
  • With data structures (multi-dimensional table; time series), operations
  • Name from »Panel Data« (multi-dimensional time series in economics)
  • Since 2008
  • https://pandas.pydata.org/
  • Install via PyPI: pip install pandas

Pandas Cohabitation

First Steps

In [1]:
import pandas
In [2]:
import pandas as pd
In [3]:
pd.__version__
Out[3]:
'0.24.1'
In [4]:
%pdoc pd
Class docstring:
    pandas - a powerful data analysis and manipulation library for Python
    =====================================================================
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic and explicit data alignment: objects can be explicitly aligned
        to a set of labels, or the user can simply ignore the labels and let
        `Series`, `DataFrame`, etc. automatically align the data for you in
        computations.
      - Powerful, flexible group by functionality to perform split-apply-combine
        operations on data sets, for both aggregating and transforming data.
      - Make it easy to convert ragged, differently-indexed data in other Python
        and NumPy data structures into DataFrame objects.
      - Intelligent label-based slicing, fancy indexing, and subsetting of large
        data sets.
      - Intuitive merging and joining data sets.
      - Flexible reshaping and pivoting of data sets.
      - Hierarchical labeling of axes (possible to have multiple labels per tick).
      - Robust IO tools for loading data from flat files (CSV and delimited),
        Excel files, databases, and saving/loading data from the ultrafast HDF5
        format.
      - Time series-specific functionality: date range generation and frequency
        conversion, moving window statistics, moving window linear regressions,
        date shifting and lagging, etc.

DataFrames

It's all about DataFrames

  • Main data containers of Pandas
    • Linear: Series
    • Multi Dimension: DataFrame
  • Series is only special case of DataFrame
  • → Talk about DataFrames as the more general case

DataFrames

Construction

  • To show features of DataFrame, let's construct one!
  • Many construction possibilities
    • From lists, dictionaries, numpy objects
    • From CSV, HDF5, JSON, Excel, HTML, fixed-width files
    • From pickled Pandas data
    • From clipboard
    • From Feather, Parquest, SAS, SQL, Google BigQuery, STATA

DataFrames

Examples, finally

In [5]:
ages  = [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]
In [6]:
pd.DataFrame(ages)
Out[6]:
0
0 41
1 56
2 56
3 57
4 39
5 59
6 43
7 56
8 38
9 60
In [7]:
df_ages = pd.DataFrame(ages)
df_ages.head(3)
Out[7]:
0
0 41
1 56
2 56
  • Let's add names to ages; put everything into a dict()
In [8]:
data = {
    "Names": ["Liu", "Rowland", "Rivers", "Waters", "Rice", "Fields", "Kerr", "Romero", "Davis", "Hall"],
    "Ages": ages
}
print(data)
{'Names': ['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr', 'Romero', 'Davis', 'Hall'], 'Ages': [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]}
In [9]:
df_sample = pd.DataFrame(data)
df_sample.head(4)
Out[9]:
Names Ages
0 Liu 41
1 Rowland 56
2 Rivers 56
3 Waters 57
  • Two columns now; one for names, one for ages
In [10]:
df_sample.columns
Out[10]:
Index(['Names', 'Ages'], dtype='object')
  • DataFrame always have indexes; auto-generated or custom
In [11]:
df_sample.index
Out[11]:
RangeIndex(start=0, stop=10, step=1)
  • Make Names be index with .set_index()
  • inplace=True will modifiy the parent frame (I don't like it)
In [12]:
df_sample.set_index("Names", inplace=True)
df_sample
Out[12]:
Ages
Names
Liu 41
Rowland 56
Rivers 56
Waters 57
Rice 39
Fields 59
Kerr 43
Romero 56
Davis 38
Hall 60
  • Some more operations
In [13]:
df_sample.describe()
Out[13]:
Ages
count 10.000000
mean 50.500000
std 9.009255
min 38.000000
25% 41.500000
50% 56.000000
75% 56.750000
max 60.000000
In [14]:
df_sample.T
Out[14]:
Names Liu Rowland Rivers Waters Rice Fields Kerr Romero Davis Hall
Ages 41 56 56 57 39 59 43 56 38 60
In [15]:
df_sample.T.columns
Out[15]:
Index(['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr',
       'Romero', 'Davis', 'Hall'],
      dtype='object', name='Names')
  • Also: Arithmetic operations
In [16]:
df_sample.multiply(2).head(3)
Out[16]:
Ages
Names
Liu 82
Rowland 112
Rivers 112
In [17]:
df_sample.reset_index().multiply(2).head(3)
Out[17]:
Names Ages
0 LiuLiu 82
1 RowlandRowland 112
2 RiversRivers 112
In [18]:
(df_sample / 2).head(3)
Out[18]:
Ages
Names
Liu 20.5
Rowland 28.0
Rivers 28.0
In [19]:
(df_sample * df_sample).head(3)
Out[19]:
Ages
Names
Liu 1681
Rowland 3136
Rivers 3136

Logical operations allowed as well

In [20]:
df_sample > 40
Out[20]:
Ages
Names
Liu True
Rowland True
Rivers True
Waters True
Rice False
Fields True
Kerr True
Romero True
Davis False
Hall True

Task 1

  • Create data frame with
    • 10 names of dinosaurs,
    • their favourite prime number,
    • and their favourite color
  • Play around with the frame
  • Tell me on poll when you're done: pollev.com/aherten538
In [21]:
happy_dinos = {
    "Dinosaur Name": [],
    "Favourite Prime": [],
    "Favourite Color": []
}
#df_dinos = 
In [22]:
happy_dinos = {
    "Dinosaur Name": ["Aegyptosaurus", "Tyrannosaurus", "Panoplosaurus", "Isisaurus", "Triceratops", "Velociraptor"],
    "Favourite Prime": ["4", "8", "15", "16", "23", "42"],
    "Favourite Color": ["blue", "white", "blue", "purple", "violet", "gray"]
}
df_dinos = pd.DataFrame(happy_dinos).set_index("Dinosaur Name")
df_dinos.T
Out[22]:
Dinosaur Name Aegyptosaurus Tyrannosaurus Panoplosaurus Isisaurus Triceratops Velociraptor
Favourite Prime 4 8 15 16 23 42
Favourite Color blue white blue purple violet gray

Some more DataFrame examples

In [24]:
df_demo = pd.DataFrame({
    "A": 1.2,
    "B": pd.Timestamp('20180226'),
    "C": [(-1)**i * np.sqrt(i) + np.e * (-1)**(i-1) for i in range(5)],
    "D": pd.Categorical(["This", "column", "has", "entries", "entries"]),
    "E": "Same"
})
df_demo
Out[24]:
A B C D E
0 1.2 2018-02-26 -2.718282 This Same
1 1.2 2018-02-26 1.718282 column Same
2 1.2 2018-02-26 -1.304068 has Same
3 1.2 2018-02-26 0.986231 entries Same
4 1.2 2018-02-26 -0.718282 entries Same
In [25]:
df_demo.sort_values("C")
Out[25]:
A B C D E
0 1.2 2018-02-26 -2.718282 This Same
2 1.2 2018-02-26 -1.304068 has Same
4 1.2 2018-02-26 -0.718282 entries Same
3 1.2 2018-02-26 0.986231 entries Same
1 1.2 2018-02-26 1.718282 column Same
In [26]:
df_demo.round(2).tail(2)
Out[26]:
A B C D E
3 1.2 2018-02-26 0.99 entries Same
4 1.2 2018-02-26 -0.72 entries Same
In [27]:
df_demo.round(2).sum()
Out[27]:
A                              6
C                          -2.03
D    Thiscolumnhasentriesentries
E           SameSameSameSameSame
dtype: object
In [28]:
print(df_demo.round(2).to_latex())
\begin{tabular}{lrlrll}
\toprule
{} &    A &          B &     C &        D &     E \\
\midrule
0 &  1.2 & 2018-02-26 & -2.72 &     This &  Same \\
1 &  1.2 & 2018-02-26 &  1.72 &   column &  Same \\
2 &  1.2 & 2018-02-26 & -1.30 &      has &  Same \\
3 &  1.2 & 2018-02-26 &  0.99 &  entries &  Same \\
4 &  1.2 & 2018-02-26 & -0.72 &  entries &  Same \\
\bottomrule
\end{tabular}

Reading External Data

(Links to documentation)

Example:

{
    "Character": ["Sawyer", "…", "Walt"],
    "Actor": ["Josh Holloway", "…", "Malcolm David Kelley"],
    "Main Cast": [true,  "…", false]
}
In [29]:
pd.read_json("lost.json").set_index("Character").sort_index()
Out[29]:
Actor Main Cast
Character
Hurley Jorge Garcia True
Jack Matthew Fox True
Kate Evangeline Lilly True
Locke Terry O'Quinn True
Sawyer Josh Holloway True
Walt Malcolm David Kelley False

Task 2

  • Read in nest-data.csv to DataFrame; call it df
    Data was produced with JUBE, Pandas works very well together with JUBE
  • Get to know it and play a bit with it
  • Tell me when you're done: pollev.com/aherten538
In [30]:
!cat nest-data.csv | head -3
id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay
5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5
5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5
In [31]:
df = pd.read_csv("nest-data.csv")
df.head()
Out[31]:
id Nodes Tasks/Node Threads/Task Runtime Program / s Scale Plastic Avg. Neuron Build Time / s Min. Edge Build Time / s Max. Edge Build Time / s ... Max. Init. Time / s Presim. Time / s Sim. Time / s Virt. Memory (Sum) / kB Local Spike Counter (Sum) Average Rate (Sum) Number of Neurons Number of Connections Min. Delay Max. Delay
0 5 1 2 4 420.42 10 True 0.29 88.12 88.18 ... 1.20 17.26 311.52 46560664.0 825499 7.48 112500 1265738500 1.5 1.5
1 5 1 4 4 200.84 10 True 0.15 46.03 46.34 ... 1.01 7.87 142.97 46903088.0 802865 7.03 112500 1265738500 1.5 1.5
2 5 1 2 8 202.15 10 True 0.28 47.98 48.48 ... 1.20 7.95 142.81 47699384.0 802865 7.03 112500 1265738500 1.5 1.5
3 5 1 4 8 89.57 10 True 0.15 20.41 23.21 ... 3.04 3.19 60.31 46813040.0 821491 7.23 112500 1265738500 1.5 1.5
4 5 2 2 4 164.16 10 True 0.20 40.03 41.09 ... 1.58 6.08 114.88 46937216.0 802865 7.03 112500 1265738500 1.5 1.5

5 rows × 21 columns

Read CSV Options

  • See also full API documentation
  • Important parameters
    • sep: Set separator (for example : instead of ,)
    • header: Specify info about headers for columns; able to use multi-index for columns!
    • names: Alternative to header – provide your own column titles
    • usecols: Don't read whole set of columns, but only these; works with any list (range(0:20:2))…
    • skiprows: Don't read in these rows
    • na_values: What string(s) to recognize as N/A values (which will be ignored during operations on data frame)
    • parse_dates: Try to parse dates in CSV; different behaviours as to provided data structure; optionally used together with date_parser
    • compression: Treat input file as compressed file ("infer", "gzip", "zip", …)
    • decimal: Decimal point divider – for German data…
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)

Slicing of Data Frames

Slicing Columns

  • Use square-bracket operators to slice data frame: []
    • Use column name to select column
    • Also: Slice horizontally
  • Example: Select only columnn C from df_demo
In [32]:
df_demo.head(3)
Out[32]:
A B C D E
0 1.2 2018-02-26 -2.718282 This Same
1 1.2 2018-02-26 1.718282 column Same
2 1.2 2018-02-26 -1.304068 has Same
In [33]:
df_demo["C"]
Out[33]:
0   -2.718282
1    1.718282
2   -1.304068
3    0.986231
4   -0.718282
Name: C, dtype: float64
  • Select more than one column by providing list [] to slice operator []
  • You usually end up forgetting one of the brackets…
  • Example: Select list of columns A and C, ["A", "C"] from df_demo
In [34]:
df_demo[["A", "C"]]
Out[34]:
A C
0 1.2 -2.718282
1 1.2 1.718282
2 1.2 -1.304068
3 1.2 0.986231
4 1.2 -0.718282

Slicing of Data Frames

Slicing rows

  • Use numberical values to slice into rows
  • Use ranges just like with Python lists
In [35]:
df_demo[1:3]
Out[35]:
A B C D E
1 1.2 2018-02-26 1.718282 column Same
2 1.2 2018-02-26 -1.304068 has Same
  • Get a certain range as per the current sort structure
In [36]:
df_demo.iloc[1:3]
Out[36]:
A B C D E
1 1.2 2018-02-26 1.718282 column Same
2 1.2 2018-02-26 -1.304068 has Same
In [37]:
df_demo.iloc[1:6:2]
Out[37]:
A B C D E
1 1.2 2018-02-26 1.718282 column Same
3 1.2 2018-02-26 0.986231 entries Same
  • Attention: .iloc[] location might change after re-sorting!
In [38]:
df_demo.sort_values("C").iloc[1:3]
Out[38]:
A B C D E
2 1.2 2018-02-26 -1.304068 has Same
4 1.2 2018-02-26 -0.718282 entries Same
  • One more row-slicing option: .loc[]
  • See the difference with a proper index (and not the auto-generated default index from before)
In [39]:
df_demo_indexed = df_demo.set_index("D")
df_demo_indexed
Out[39]:
A B C E
D
This 1.2 2018-02-26 -2.718282 Same
column 1.2 2018-02-26 1.718282 Same
has 1.2 2018-02-26 -1.304068 Same
entries 1.2 2018-02-26 0.986231 Same
entries 1.2 2018-02-26 -0.718282 Same
In [40]:
df_demo_indexed.loc["entries"]
Out[40]:
A B C E
D
entries 1.2 2018-02-26 0.986231 Same
entries 1.2 2018-02-26 -0.718282 Same

Advanced Slicing: Logical Slicing

In [41]:
df_demo[df_demo["C"] > 0]
Out[41]:
A B C D E
1 1.2 2018-02-26 1.718282 column Same
3 1.2 2018-02-26 0.986231 entries Same
In [42]:
df_demo[(df_demo["C"] < 0) & (df_demo["D"] == "entries")]
Out[42]:
A B C D E
4 1.2 2018-02-26 -0.718282 entries Same

Adding to Existing Data Frame

  • Add new columns with frame["new col"] = something or .insert()
  • Add new rows with frame.append()
  • Combine data frames
    • Concat: Combine several data frames along an axis
    • Merge: Combine data frames on basis of common columns; database-style
    • (Join)
    • See user guide on merging
In [43]:
df_demo.head(3)
Out[43]:
A B C D E
0 1.2 2018-02-26 -2.718282 This Same
1 1.2 2018-02-26 1.718282 column Same
2 1.2 2018-02-26 -1.304068 has Same
In [44]:
df_demo["F"] = df_demo["C"] - df_demo["A"]
df_demo.head(3)
Out[44]:
A B C D E F
0 1.2 2018-02-26 -2.718282 This Same -3.918282
1 1.2 2018-02-26 1.718282 column Same 0.518282
2 1.2 2018-02-26 -1.304068 has Same -2.504068
In [45]:
df_demo.insert(df_demo.shape[1], "G", df_demo["C"] ** 2)
In [46]:
df_demo.tail(3)
Out[46]:
A B C D E F G
2 1.2 2018-02-26 -1.304068 has Same -2.504068 1.700594
3 1.2 2018-02-26 0.986231 entries Same -0.213769 0.972652
4 1.2 2018-02-26 -0.718282 entries Same -1.918282 0.515929
In [47]:
df_demo.append(
    {"A": 1.3, "B": pd.Timestamp("2018-02-27"), "C": -0.777, "D": "has it?", "E": "Same", "F": 23},
    ignore_index=True
)
Out[47]:
A B C D E F G
0 1.2 2018-02-26 -2.718282 This Same -3.918282 7.389056
1 1.2 2018-02-26 1.718282 column Same 0.518282 2.952492
2 1.2 2018-02-26 -1.304068 has Same -2.504068 1.700594
3 1.2 2018-02-26 0.986231 entries Same -0.213769 0.972652
4 1.2 2018-02-26 -0.718282 entries Same -1.918282 0.515929
5 1.3 2018-02-27 -0.777000 has it? Same 23.000000 NaN

Combining Frames

  • First, create some simpler data frame to show .concat() and .merge()
In [48]:
df_1 = pd.DataFrame({"Key": ["First", "Second"], "Value": [1, 1]})
df_1
Out[48]:
Key Value
0 First 1
1 Second 1
In [49]:
df_2 = pd.DataFrame({"Key": ["First", "Second"], "Value": [2, 2]})
df_2
Out[49]:
Key Value
0 First 2
1 Second 2
  • Concatenate list of data frame vertically (axis=0)
In [50]:
pd.concat([df_1, df_2])
Out[50]:
Key Value
0 First 1
1 Second 1
0 First 2
1 Second 2
  • Same, but re-index
In [51]:
pd.concat([df_1, df_2], ignore_index=True)
Out[51]:
Key Value
0 First 1
1 Second 1
2 First 2
3 Second 2
  • Concat, but horizontally
In [52]:
pd.concat([df_1, df_2], axis=1)
Out[52]:
Key Value Key Value
0 First 1 First 2
1 Second 1 Second 2
  • Merge on common column
In [53]:
pd.merge(df_1, df_2, on="Key")
Out[53]:
Key Value_x Value_y
0 First 1 2
1 Second 1 2

Task 3

  • Add a column to the Nest data frame called Virtual Processes which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)
  • Remember to tell me when you're done: pollev.com/aherten538
In [54]:
df["Virtual Processes"] = df["Nodes"] * df["Tasks/Node"] * df["Threads/Task"]
df.head()
Out[54]:
id Nodes Tasks/Node Threads/Task Runtime Program / s Scale Plastic Avg. Neuron Build Time / s Min. Edge Build Time / s Max. Edge Build Time / s ... Presim. Time / s Sim. Time / s Virt. Memory (Sum) / kB Local Spike Counter (Sum) Average Rate (Sum) Number of Neurons Number of Connections Min. Delay Max. Delay Virtual Processes
0 5 1 2 4 420.42 10 True 0.29 88.12 88.18 ... 17.26 311.52 46560664.0 825499 7.48 112500 1265738500 1.5 1.5 8
1 5 1 4 4 200.84 10 True 0.15 46.03 46.34 ... 7.87 142.97 46903088.0 802865 7.03 112500 1265738500 1.5 1.5 16
2 5 1 2 8 202.15 10 True 0.28 47.98 48.48 ... 7.95 142.81 47699384.0 802865 7.03 112500 1265738500 1.5 1.5 16
3 5 1 4 8 89.57 10 True 0.15 20.41 23.21 ... 3.19 60.31 46813040.0 821491 7.23 112500 1265738500 1.5 1.5 32
4 5 2 2 4 164.16 10 True 0.20 40.03 41.09 ... 6.08 114.88 46937216.0 802865 7.03 112500 1265738500 1.5 1.5 16

5 rows × 22 columns

In [55]:
df.columns
Out[55]:
Index(['id', 'Nodes', 'Tasks/Node', 'Threads/Task', 'Runtime Program / s',
       'Scale', 'Plastic', 'Avg. Neuron Build Time / s',
       'Min. Edge Build Time / s', 'Max. Edge Build Time / s',
       'Min. Init. Time / s', 'Max. Init. Time / s', 'Presim. Time / s',
       'Sim. Time / s', 'Virt. Memory (Sum) / kB', 'Local Spike Counter (Sum)',
       'Average Rate (Sum)', 'Number of Neurons', 'Number of Connections',
       'Min. Delay', 'Max. Delay', 'Virtual Processes'],
      dtype='object')

Aside: Plotting without Pandas

Matplotlib 101

  • Matplotlib: de-facto standard for plotting in Python
  • Main interface: pyplot; provides MATLAB-like interface
  • Better: Use object-oriented API with Figure and Axis
  • Great integration into Jupyter Notebooks
  • Since v. 3: Only support for Python 3
  • https://matplotlib.org/
In [56]:
import matplotlib.pyplot as plt
%matplotlib inline
In [57]:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
In [58]:
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Use like this')
ax.set_xlabel("Numbers again");
ax.set_ylabel("$\sqrt{x}$");
  • Plot multiple lines into one canvas
  • Call ax.plot() multiple times
In [59]:
y2 = y/np.exp(y*1.5)
In [60]:
fig, ax = plt.subplots()
ax.plot(x, y, label="y")
ax.plot(x, y2, label="y2")
ax.legend()
ax.set_title("This plot makes no sense");

Task 4

  • Sort the data frame by the virtual proccesses
  • Plot "Presim. Time / s" and "Sim. Time / s" of our data frame df as a function of the virtual processes
  • Use a dashed, red line for "Presim. Time / s", a blue line for "Sim. Time / s" (see API description)
  • Don't forget to label your axes and to add a legend
  • Submit when you're done: pollev.com/aherten538
In [61]:
df.sort_values(["Virtual Processes", "Nodes", "Tasks/Node", "Threads/Task"], inplace=True)
In [62]:
fig, ax = plt.subplots()
ax.plot(df["Virtual Processes"], df["Presim. Time / s"], linestyle="dashed", color="red")
ax.plot(df["Virtual Processes"], df["Sim. Time / s"], "-b")
ax.set_xlabel("Virtual Process")
ax.set_ylabel("Time / s")
ax.legend();

Plotting with Pandas

  • Each data frame hast a .plot() function (see API)
  • Plots with Matplotlib
  • Important API options:
    • kind: line (default), bar[h], hist, box, kde, scatter, hexbin
    • subplots: Make a sub-plot for each column (good together with sharex, sharey)
    • figsize
    • grid: Add a grid to plot (use Matplotlib options)
    • style: Line style per column (accepts list or dict)
    • logx, logy, loglog: Logarithmic plots
    • xticks, yticks: Use values for ticks
    • xlim, ylim: Limits of axes
    • yerr, xerr: Add uncertainty to data points
    • stacked: Stack a bar plot
    • secondary_y: Use a secondary y axis for this plot
    • Labeling
      • title: Add title to plot (Use a list of strings if subplots=True)
      • legend: Add a legend
      • table: If true, add table of data under plot
    • **kwds: Every non-parsed keyword is passed through to Matplotlib's plotting methods
  • Either slice and plot…
In [63]:
df_demo["C"].plot(figsize=(10, 2));
  • … or plot and select
In [64]:
df_demo.plot(y="C", figsize=(10, 2));
  • I prefer slicing first, as it allows for further operations on the sliced data frame
In [65]:
df_demo["C"].plot(kind="bar");
  • There are pseudo-sub-functions for each of the plot kinds
  • I prefer to just call .plot(kind="smthng")
In [66]:
df_demo["C"].plot.bar();
In [67]:
df_demo["C"].plot(kind="bar", legend=True, figsize=(12, 4), ylim=(-1, 3), title="This is a C plot");

Task 5

Use the NEST data frame df to:

  1. Make the virtual processes the index of the data frame (.set_index())
  2. Plot "Presim. Program / s" and "Sim. Time / s" individually
  3. Plot them onto one common canvas!
  4. Make them have the same line colors and styles as before
  5. Add a legend, add missing labels

  6. Done? Tell me! pollev.com/aherten538

In [68]:
df.set_index("Virtual Processes", inplace=True)
In [69]:
df["Presim. Time / s"].plot(figsize=(10, 3));
In [70]:
df["Sim. Time / s"].plot(figsize=(10, 3));
In [71]:
df["Presim. Time / s"].plot();
df["Sim. Time / s"].plot();
In [72]:
ax = df[["Presim. Time / s", "Sim. Time / s"]].plot();
ax.set_ylabel("Time / s");

More Plotting with Pandas

Our first proper Pandas plot

In [73]:
df[["Presim. Time / s", "Sim. Time / s"]].plot();
  • That's why I think Pandas is great!
  • It has great defaults to quickly plot data
  • Plotting functionality is very versatile
  • Before plotting, data can be massaged within data frames, if needed

More Plotting with Pandas

Some versatility

In [74]:
df_demo[["A", "C", "F"]].plot(kind="bar", stacked=True);
In [75]:
df_demo[df_demo["F"] < 0][["A", "C", "F"]].plot(kind="bar", stacked=True);
In [76]:
df_demo[df_demo["F"] < 0][["A", "C", "F"]]\
    .plot(kind="barh", subplots=True, sharex=True, title="Subplots", figsize=(12, 4));
In [77]:
df_demo[df_demo["F"] < 0][["A", "F"]]\
    .plot(
        style=["-*r", "--ob"], 
        secondary_y="A", 
        figsize=(12, 6),
        table=True
    );
In [78]:
df_demo[df_demo["F"] < 0][["A", "F"]]\
    .plot(
        style=["-*r", "--ob"], 
        secondary_y="A", 
        figsize=(12, 6),
        yerr={
            "A": df_demo[df_demo["F"] < 0]["C"], 
            "F": 0.2
        }, 
        capsize=4,
        title="Bug: style is ignored with yerr",
        marker="P"
    );  

Combine Pandas with Matplotlib

  • Pandas shortcuts very handy
  • But sometimes, one needs to access underlying Matplotlib functionality
  • No problemo!
  • Option 1: Pandas always returns axis
    • Use this to manipulate the canvas
    • Get underlying figure with ax.get_figure() (for fig.savefig())
  • Option 2: Create figure and axes with Matplotlib, use when drawing
    • .plot(): Use ax option

Option 1: Pandas Returns Axis

In [79]:
ax = df_demo["C"].plot(figsize=(10, 4))
ax.set_title("Hello there!");
fig = ax.get_figure()
fig.suptitle("This title is super!");

Option 2: Draw on Matplotlib Axes

In [80]:
fig, ax = plt.subplots(figsize=(10, 4))
df_demo["C"].plot(ax=ax)
ax.set_title("Hello there!");
fig.suptitle("This title is super!");
  • We can also get fancy!
In [81]:
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 4))
for ax, column, color in zip([ax1, ax2], ["C", "F"], ["blue", "#b2e123"]):
    df_demo[column].plot(ax=ax, legend=True, color=color)

Aside: Seaborn

  • Python package on top of Matplotlib
  • Powerful API shortcuts for plotting of statistical data
  • Manipulate color palettes
  • Works well together with Pandas
  • Also: New, well-looking defaults for Matplotlib (IMHO)
  • https://seaborn.pydata.org/
In [82]:
import seaborn as sns
sns.set()
In [83]:
df_demo[["A", "C"]].plot();

Seaborn Color Palette Example

In [84]:
sns.palplot(sns.color_palette())
In [85]:
sns.palplot(sns.color_palette("hls", 10))
In [86]:
sns.palplot(sns.color_palette("hsv", 20))
In [87]:
sns.palplot(sns.color_palette("Paired", 10))
In [129]:
sns.palplot(sns.color_palette("cubehelix", 8))
In [131]:
sns.palplot(sns.color_palette("colorblind", 10))

Seaborn Plot Examples

  • Most of the time, I use a regression plot from Seaborn
In [89]:
with sns.color_palette("hls", 2):
    sns.regplot(x="C", y="F", data=df_demo);
    sns.regplot(x="C", y="G", data=df_demo);
  • A joint plot combines two plots relating to distribution of values into one
  • Very handy for showing a fuller picture of two-dimensionally scattered variables
In [90]:
x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T
In [91]:
sns.jointplot(x=x, y=y, kind="reg");

Task 6

  • To your df NEST data frame, add a column with the unaccounted time (Unaccounted Time / s), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time.
    (I know this is technically not super correct, but it will do for our example.)
  • Plot a stacked bar plot of all these columns (except for program runtime) over the virtual processes
  • Remember: pollev.com/aherten538
In [92]:
cols = [
    'Avg. Neuron Build Time / s', 
    'Min. Edge Build Time / s', 
    'Min. Init. Time / s', 
    'Presim. Time / s', 
    'Sim. Time / s'
]
df["Unaccounted Time / s"] = df['Runtime Program / s']
for entry in cols:
    df["Unaccounted Time / s"] = df["Unaccounted Time / s"] - df[entry]
In [93]:
df[["Runtime Program / s", "Unaccounted Time / s", *cols]].head(2)
Out[93]:
Runtime Program / s Unaccounted Time / s Avg. Neuron Build Time / s Min. Edge Build Time / s Min. Init. Time / s Presim. Time / s Sim. Time / s
Virtual Processes
8 420.42 2.09 0.29 88.12 1.14 17.26 311.52
16 202.15 2.43 0.28 47.98 0.70 7.95 142.81
In [94]:
df[["Unaccounted Time / s", *cols]].plot(kind="bar", stacked=True, figsize=(12, 4));
  • Make it relative to the total program run time
  • Slight complication: Our virtual processes as indexes are not unique; we need to find new unique indexes
  • Let's use a multi index!
In [95]:
df_multind = df.set_index(["Nodes", "Tasks/Node", "Threads/Task"])
df_multind.head()
Out[95]:
id Runtime Program / s Scale Plastic Avg. Neuron Build Time / s Min. Edge Build Time / s Max. Edge Build Time / s Min. Init. Time / s Max. Init. Time / s Presim. Time / s Sim. Time / s Virt. Memory (Sum) / kB Local Spike Counter (Sum) Average Rate (Sum) Number of Neurons Number of Connections Min. Delay Max. Delay Unaccounted Time / s
Nodes Tasks/Node Threads/Task
1 2 4 5 420.42 10 True 0.29 88.12 88.18 1.14 1.20 17.26 311.52 46560664.0 825499 7.48 112500 1265738500 1.5 1.5 2.09
8 5 202.15 10 True 0.28 47.98 48.48 0.70 1.20 7.95 142.81 47699384.0 802865 7.03 112500 1265738500 1.5 1.5 2.43
4 4 5 200.84 10 True 0.15 46.03 46.34 0.70 1.01 7.87 142.97 46903088.0 802865 7.03 112500 1265738500 1.5 1.5 3.12
2 2 4 5 164.16 10 True 0.20 40.03 41.09 0.52 1.58 6.08 114.88 46937216.0 802865 7.03 112500 1265738500 1.5 1.5 2.45
1 2 12 6 141.70 10 True 0.30 32.93 33.26 0.62 0.95 5.41 100.16 50148824.0 813743 7.27 112500 1265738500 1.5 1.5 2.28
In [96]:
df_multind[["Unaccounted Time / s", *cols]]\
    .divide(df_multind["Runtime Program / s"], axis="index")\
    .plot(kind="bar", stacked=True, figsize=(14, 6), title="Relative Time Distribution");

Next Level: Hierarchical Data

  • MultiIndex only a first level
  • More powerful:
    • Grouping: .groupby() (API)
    • Pivoting: .pivot_table() (API); also .pivot() (API)
In [97]:
df.groupby("Nodes").mean()
Out[97]:
id Tasks/Node Threads/Task Runtime Program / s Scale Plastic Avg. Neuron Build Time / s Min. Edge Build Time / s Max. Edge Build Time / s Min. Init. Time / s ... Presim. Time / s Sim. Time / s Virt. Memory (Sum) / kB Local Spike Counter (Sum) Average Rate (Sum) Number of Neurons Number of Connections Min. Delay Max. Delay Unaccounted Time / s
Nodes
1 5.333333 3.0 8.0 185.023333 10.0 True 0.220000 42.040000 42.838333 0.583333 ... 7.226667 132.061667 4.806585e+07 816298.000000 7.215000 112500.0 1.265738e+09 1.5 1.5 2.891667
2 5.333333 3.0 8.0 73.601667 10.0 True 0.168333 19.628333 20.313333 0.191667 ... 2.725000 48.901667 4.975288e+07 818151.000000 7.210000 112500.0 1.265738e+09 1.5 1.5 1.986667
3 5.333333 3.0 8.0 43.990000 10.0 True 0.138333 12.810000 13.305000 0.135000 ... 1.426667 27.735000 5.511165e+07 820465.666667 7.253333 112500.0 1.265738e+09 1.5 1.5 1.745000
4 5.333333 3.0 8.0 31.225000 10.0 True 0.116667 9.325000 9.740000 0.088333 ... 1.066667 19.353333 5.325783e+07 819558.166667 7.288333 112500.0 1.265738e+09 1.5 1.5 1.275000
5 5.333333 3.0 8.0 24.896667 10.0 True 0.140000 7.468333 7.790000 0.070000 ... 0.771667 14.950000 6.075634e+07 815307.666667 7.225000 112500.0 1.265738e+09 1.5 1.5 1.496667
6 5.333333 3.0 8.0 20.215000 10.0 True 0.106667 6.165000 6.406667 0.051667 ... 0.630000 12.271667 6.060652e+07 815456.333333 7.201667 112500.0 1.265738e+09 1.5 1.5 0.990000

6 rows × 21 columns

Pivoting

  • Combine categorically-similar columns
  • Creates hierarchical index
  • Respected during plotting!
  • A pivot table has three layers; if confused, think about these questions
    • index: »What's on the x axis?«
    • values: »What value do I want to plot?«
    • columns: »What categories do I want [to be in the legend]?«
  • All can be populated from base data frame
  • Might be aggregated, if needed
In [98]:
df_demo["H"] = [(-1)**n for n in range(5)]
In [99]:
df_pivot = df_demo.pivot_table(
    index="F",
    values="G",
    columns="H"
)
df_pivot
Out[99]:
H -1 1
F
-3.918282 NaN 7.389056
-2.504068 NaN 1.700594
-1.918282 NaN 0.515929
-0.213769 0.972652 NaN
0.518282 2.952492 NaN
In [100]:
df_pivot.plot();

Task 7

  • Create a pivot table based on the NEST df data frame
  • Let the x axis show the number of nodes; display the values of the simulation time "Sim. Time / s" for the tasks per node and threas per task configurations
  • Please plot a bar plot
  • Done? pollev.com/aherten538
In [101]:
df.pivot_table(
    index=["Nodes"],
    columns=["Tasks/Node", "Threads/Task"],
    values="Sim. Time / s",
).plot(kind="bar", figsize=(12, 4));

  • Bonus task
    • Same pivot table as before (that is, x with nodes, and columns for Tasks/Node and Threads/Task)
    • But now, use Sim. Time / s and Presim. Time / s as values to show
    • Show them as a stack of those two values inside the pivot table

The End

  • Pandas works on data frames
  • Slice frames to your likings
  • Plot frames
    • Together with Matplotlib, Seaborn, others
  • Pivot tables are next level greatness
  • Remember: Pandas as early as possible!
  • Thanks for being here! 😍

Next slide: Further reading

Poll Results

Poll Results