import pandas
Introduction-to-Pandas--solution.ipynb
Data Analysis and Plotting in Python with Pandas
Carolin Penke, Jülich Supercomputing Centre, Forschungszentrum Jülich, 23 October 2024 Based on material by Andreas Herten
Version: Slides
My Motivation
- I like Python
- I like plotting data
- I like sharing
- I think Pandas is awesome and you should use it too
- …but I'm no Python expert!
Motto: »Pandas as early as possible!«
Course Setup
- 3½ hours, including break around 10:30
- Alternating between lecture and hands-on
- Please give status of hands-ons via 👍 as BigBlueButton status
- TAs and me in the room can help with issues, either in public chat or in 1:1 chat
- Please now open Jupyter Notebook of this session: https://go.fzj.de/jsc-pd
- Give thumbs up! 👍
About Pandas

- Python package
- For data analysis and manipulation
- With data structures (multi-dimensional table; time series), operations
- Name from »Panel Data« (multi-dimensional time series in economics)
- Since 2008
- Now at Pandas 2.2.3
- https://pandas.pydata.org/
- Install via PyPI:
pip install pandas
- Cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Pandas Cohabitation
- Pandas works great together with other established Python tools
- Jupyter Notebooks
- Plotting with
matplotlib
- Numerical analysis with
numpy
- Modelling with
statsmodels
,scikit-learn
- Nicer plots with
seaborn
,altair
,plotly
- Performance enhancement with Cython, Numba, …
- Tools building up on Pandas: cuDF (GPU-accelerated DataFrames in Rapids, now as drop-in replacement), pyarrow (Apache Arrow bindings in Python) …
- Faster alternatives with similar syntax: Polars, …
First Steps
import pandas as pd
pd.__version__
'2.1.4'
%pdoc pd
DataFrames
It's all about DataFrames

- Data containers of Pandas:
- Linear:
Series
- Multi Dimension:
DataFrame
- Linear:
Series
is only special (1D) case ofDataFrame
- → We use
DataFrame
s as the more general case here
DataFrames
Construction
- To show features of
DataFrame
, let's construct one and show by example! - Many construction possibilities
- From lists, dictionaries,
numpy
objects - From CSV, HDF5, JSON, Excel, HTML, fixed-width files
- From pickled Pandas data
- From clipboard
- From Feather, Parquet, SAS, SQL, Google BigQuery, STATA
- From lists, dictionaries,
DataFrames
Examples, finally
ages = [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]
pd.DataFrame(ages)
(Click to sort ascending) | 0 (Click to sort ascending) |
---|---|
0 | 41 |
1 | 56 |
2 | 56 |
3 | 57 |
4 | 39 |
5 | 59 |
6 | 43 |
7 | 56 |
8 | 38 |
9 | 60 |
df_ages = pd.DataFrame(ages)
df_ages.head(3)
(Click to sort ascending) | 0 (Click to sort ascending) |
---|---|
0 | 41 |
1 | 56 |
2 | 56 |
- Let's add names to ages; put everything into a
dict()
data = {
"Name": ["Liu", "Rowland", "Rivers", "Waters", "Rice", "Fields", "Kerr", "Romero", "Davis", "Hall"],
"Age": ages
}
print(data)
{'Name': ['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr', 'Romero', 'Davis', 'Hall'], 'Age': [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]}
df_sample = pd.DataFrame(data)
df_sample.head(4)
(Click to sort ascending) | Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|---|
0 | Liu | 41 |
1 | Rowland | 56 |
2 | Rivers | 56 |
3 | Waters | 57 |
- Automatically creates columns from dictionary
- Two columns now; one for names, one for ages
df_sample.columns
Index(['Name', 'Age'], dtype='object')
- First column is index
DataFrame
always have indexes; auto-generated or custom
df_sample.index
RangeIndex(start=0, stop=10, step=1)
- Make
Name
be index with.set_index()
inplace=True
will modifiy the parent frame (I don't like it)
df_sample.set_index("Name", inplace=True)
df_sample
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | 41 |
Rowland | 56 |
Rivers | 56 |
Waters | 57 |
Rice | 39 |
Fields | 59 |
Kerr | 43 |
Romero | 56 |
Davis | 38 |
Hall | 60 |
- Some more operations
df_sample.describe()
(Click to sort ascending) | Age (Click to sort ascending) |
---|---|
count | 10.000000 |
mean | 50.500000 |
std | 9.009255 |
min | 38.000000 |
25% | 41.500000 |
50% | 56.000000 |
75% | 56.750000 |
max | 60.000000 |
df_sample.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, Liu to Hall
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 10 non-null int64
dtypes: int64(1)
memory usage: 160.0+ bytes
df_sample.T
Name (Click to sort ascending) | Liu (Click to sort ascending) | Rowland (Click to sort ascending) | Rivers (Click to sort ascending) | Waters (Click to sort ascending) | Rice (Click to sort ascending) | Fields (Click to sort ascending) | Kerr (Click to sort ascending) | Romero (Click to sort ascending) | Davis (Click to sort ascending) | Hall (Click to sort ascending) |
---|---|---|---|---|---|---|---|---|---|---|
Age | 41 | 56 | 56 | 57 | 39 | 59 | 43 | 56 | 38 | 60 |
df_sample.T.columns
Index(['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr',
'Romero', 'Davis', 'Hall'],
dtype='object', name='Name')
- Also: Arithmetic operations
df_sample.multiply(2).head(3)
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | 82 |
Rowland | 112 |
Rivers | 112 |
df_sample.reset_index().multiply(2).head(3)
(Click to sort ascending) | Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|---|
0 | LiuLiu | 82 |
1 | RowlandRowland | 112 |
2 | RiversRivers | 112 |
(df_sample / 2).head(3)
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | 20.5 |
Rowland | 28.0 |
Rivers | 28.0 |
(df_sample * df_sample).head(3)
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | 1681 |
Rowland | 3136 |
Rivers | 3136 |
def mysquare(number: float) -> float:
return number*number
df_sample.apply(mysquare).head()
# or: df_sample.apply(lambda x: x*x).head()
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | 1681 |
Rowland | 3136 |
Rivers | 3136 |
Waters | 3249 |
Rice | 1521 |
import numpy as np
df_sample.apply(np.square).head()
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | 1681 |
Rowland | 3136 |
Rivers | 3136 |
Waters | 3249 |
Rice | 1521 |
Logical operations allowed as well
df_sample > 40
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | True |
Rowland | True |
Rivers | True |
Waters | True |
Rice | False |
Fields | True |
Kerr | True |
Romero | True |
Davis | False |
Hall | True |
df_sample.apply(mysquare).head() == df_sample.apply(lambda x: x*x).head()
Name (Click to sort ascending) | Age (Click to sort ascending) |
---|---|
Liu | True |
Rowland | True |
Rivers | True |
Waters | True |
Rice | True |
happy_dinos = {
"Dinosaur Name": [],
"Favourite Prime": [],
"Favourite Color": []
}
#df_dinos =
happy_dinos = {
"Dinosaur Name": ["Aegyptosaurus", "Tyrannosaurus", "Panoplosaurus", "Isisaurus", "Triceratops", "Velociraptor"],
"Favourite Prime": ["4", "8", "15", "16", "23", "42"],
"Favourite Color": ["blue", "white", "blue", "purple", "violet", "gray"]
}
df_dinos = pd.DataFrame(happy_dinos).set_index("Dinosaur Name")
df_dinos.T
Dinosaur Name (Click to sort ascending) | Aegyptosaurus (Click to sort ascending) | Tyrannosaurus (Click to sort ascending) | Panoplosaurus (Click to sort ascending) | Isisaurus (Click to sort ascending) | Triceratops (Click to sort ascending) | Velociraptor (Click to sort ascending) |
---|---|---|---|---|---|---|
Favourite Prime | 4 | 8 | 15 | 16 | 23 | 42 |
Favourite Color | blue | white | blue | purple | violet | gray |
More DataFrame
examples
df_demo = pd.DataFrame({
"A": 1.2,
"B": pd.Timestamp('20180226'),
"C": [(-1)**i * np.sqrt(i) + np.e * (-1)**(i-1) for i in range(5)],
"D": pd.Categorical(["This", "column", "has", "entries", "entries"]),
"E": "Same"
})
df_demo
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
df_demo.sort_values("C")
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
df_demo.round(2).tail(2)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
3 | 1.2 | 2018-02-26 | 0.99 | entries | Same |
4 | 1.2 | 2018-02-26 | -0.72 | entries | Same |
df_demo.round(2)[["A", "C"]].sum()
A 6.00
C -2.03
dtype: float64
print(df_demo.round(2).to_latex())
\begin{tabular}{lrlrll}
\toprule
& A & B & C & D & E \\
\midrule
0 & 1.200000 & 2018-02-26 00:00:00 & -2.720000 & This & Same \\
1 & 1.200000 & 2018-02-26 00:00:00 & 1.720000 & column & Same \\
2 & 1.200000 & 2018-02-26 00:00:00 & -1.300000 & has & Same \\
3 & 1.200000 & 2018-02-26 00:00:00 & 0.990000 & entries & Same \\
4 & 1.200000 & 2018-02-26 00:00:00 & -0.720000 & entries & Same \\
\bottomrule
\end{tabular}
Reading External Data
(Links to documentation)
Example:
{
"Character": ["Sawyer", "…", "Walt"],
"Actor": ["Josh Holloway", "…", "Malcolm David Kelley"],
"Main Cast": [true, "…", false]
}
pd.read_json("data-lost.json").set_index("Character").sort_index()
Character (Click to sort ascending) | Actor (Click to sort ascending) | Main Cast (Click to sort ascending) |
---|---|---|
Hurley | Jorge Garcia | True |
Jack | Matthew Fox | True |
Kate | Evangeline Lilly | True |
Locke | Terry O'Quinn | True |
Sawyer | Josh Holloway | True |
Walt | Malcolm David Kelley | False |
Task 2
- Read in
data-nest.csv
toDataFrame
; call itdf
(Data was produced with JUBE) - Get to know it and play a bit with it
- Tell me when you're done with status icon in BigBlueButton: 👍
!head data-nest.csv
id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay
5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5
5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5
5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5
5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5
5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,2,8,79.60,10,true,0.20,21.63,21.91,0.19,0.47,2.98,53.12,46847168.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,4,8,37.20,10,true,0.13,10.08,11.60,0.10,1.63,1.24,23.29,47065232.00,818198,7.33,112500,1265738500,1.5,1.5
5,3,2,4,96.51,10,true,0.15,26.54,27.41,0.36,1.22,3.33,64.28,52256880.00,813743,7.27,112500,1265738500,1.5,1.5
df = pd.read_csv("data-nest.csv")
df.head()
(Click to sort ascending) | id (Click to sort ascending) | Nodes (Click to sort ascending) | Tasks/Node (Click to sort ascending) | Threads/Task (Click to sort ascending) | Runtime Program / s (Click to sort ascending) | Scale (Click to sort ascending) | Plastic (Click to sort ascending) | Avg. Neuron Build Time / s (Click to sort ascending) | Min. Edge Build Time / s (Click to sort ascending) | Max. Edge Build Time / s (Click to sort ascending) | ... (Click to sort ascending) | Max. Init. Time / s (Click to sort ascending) | Presim. Time / s (Click to sort ascending) | Sim. Time / s (Click to sort ascending) | Virt. Memory (Sum) / kB (Click to sort ascending) | Local Spike Counter (Sum) (Click to sort ascending) | Average Rate (Sum) (Click to sort ascending) | Number of Neurons (Click to sort ascending) | Number of Connections (Click to sort ascending) | Min. Delay (Click to sort ascending) | Max. Delay (Click to sort ascending) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 1 | 2 | 4 | 420.42 | 10 | True | 0.29 | 88.12 | 88.18 | ... | 1.20 | 17.26 | 311.52 | 46560664.0 | 825499 | 7.48 | 112500 | 1265738500 | 1.5 | 1.5 |
1 | 5 | 1 | 4 | 4 | 200.84 | 10 | True | 0.15 | 46.03 | 46.34 | ... | 1.01 | 7.87 | 142.97 | 46903088.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 |
2 | 5 | 1 | 2 | 8 | 202.15 | 10 | True | 0.28 | 47.98 | 48.48 | ... | 1.20 | 7.95 | 142.81 | 47699384.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 |
3 | 5 | 1 | 4 | 8 | 89.57 | 10 | True | 0.15 | 20.41 | 23.21 | ... | 3.04 | 3.19 | 60.31 | 46813040.0 | 821491 | 7.23 | 112500 | 1265738500 | 1.5 | 1.5 |
4 | 5 | 2 | 2 | 4 | 164.16 | 10 | True | 0.20 | 40.03 | 41.09 | ... | 1.58 | 6.08 | 114.88 | 46937216.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 |
Read CSV Options
- See also full API documentation
- Important parameters
sep
: Set separator (for example:
instead of,
)header
: Specify info about headers for columns; able to use multi-index for columns!names
: Alternative toheader
– provide your own column titlesusecols
: Don't read whole set of columns, but only these; works with any list (range(0:20:2)
)…skiprows
: Don't read in these rowsna_values
: What string(s) to recognize asN/A
values (which will be ignored during operations on data frame)parse_dates
: Try to parse dates in CSV; different behaviours as to provided data structure; optionally used together withdate_parser
compression
: Treat input file as compressed file ("infer", "gzip", "zip", …)decimal
: Decimal point divider – for German data…
pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)
Slicing of Data Frames
- Slicing: Select a sub-range / sub-set of entire data frame
- Pandas documentation: Detailed documentation, short documentation
Quick Slices
- Use square-bracket operators to slice data frame quickly:
[]
- Use column name to select column
- Use numerical value to select row
- Example: Select only columnn
C
fromdf_demo
df_demo.head(3)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo['C']
0 -2.718282
1 1.718282
2 -1.304068
3 0.986231
4 -0.718282
Name: C, dtype: float64
- Instead of column name in quotes and square brackets: Name of column directly
df_demo.C
0 -2.718282
1 1.718282
2 -1.304068
3 0.986231
4 -0.718282
Name: C, dtype: float64
- I'm not a friend, because no spaces allowed
(And Pandas as early as possible means labelling columns well and adding spaces)
- Select more than one column by providing
list
to slice operator[]
- Example: Select list of columns
A
andC
,['A', 'C']
fromdf_demo
my_slice = ['A', 'C']
df_demo[my_slice]
(Click to sort ascending) | A (Click to sort ascending) | C (Click to sort ascending) |
---|---|---|
0 | 1.2 | -2.718282 |
1 | 1.2 | 1.718282 |
2 | 1.2 | -1.304068 |
3 | 1.2 | 0.986231 |
4 | 1.2 | -0.718282 |
- Use numerical values in brackets to slice along rows
- Use ranges just like with Python lists
df_demo[1:3]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo[1:6:2]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
- Attention: location might change after re-sorting!
df_demo[1:3]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo.sort_values("C")[1:3]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
Slicing of Data Frames
Better Slicing
.iloc[]
and.loc[]
: Faster slicing interfaces with more options
df_demo.iloc[1:3]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
- Also slice along columns (second argument)
df_demo.iloc[1:3, [0, 2]]
(Click to sort ascending) | A (Click to sort ascending) | C (Click to sort ascending) |
---|---|---|
1 | 1.2 | 1.718282 |
2 | 1.2 | -1.304068 |
.iloc[]
: Slice by position (numerical/integer).loc[]
: Slice by label (named)- See difference with a proper index (and not the auto-generated default index from before)
df_demo_indexed = df_demo.set_index("D")
df_demo_indexed
D (Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|
This | 1.2 | 2018-02-26 | -2.718282 | Same |
column | 1.2 | 2018-02-26 | 1.718282 | Same |
has | 1.2 | 2018-02-26 | -1.304068 | Same |
entries | 1.2 | 2018-02-26 | 0.986231 | Same |
entries | 1.2 | 2018-02-26 | -0.718282 | Same |
df_demo_indexed.loc["entries"]
D (Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|
entries | 1.2 | 2018-02-26 | 0.986231 | Same |
entries | 1.2 | 2018-02-26 | -0.718282 | Same |
df_demo_indexed.loc[["has", "entries"], ["A", "C"]]
D (Click to sort ascending) | A (Click to sort ascending) | C (Click to sort ascending) |
---|---|---|
has | 1.2 | -1.304068 |
entries | 1.2 | 0.986231 |
entries | 1.2 | -0.718282 |
Slicing of Data Frames
Advanced Slicing: Logical Slicing
- Slice can also be array of booleans
df_demo[df_demo["C"] > 0]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
df_demo["C"] > 0
0 False
1 True
2 False
3 True
4 False
Name: C, dtype: bool
df_demo[(df_demo["C"] < 0) & (df_demo["D"] == "entries")]
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
Adding to Existing Data Frame
- Add new columns with
frame["new col"] = something
or.insert()
- Combine data frames
- Concat: Combine several data frames along an axis
- Merge: Combine data frames on basis of common columns; database-style
- (Join)
- See user guide on merging
df_demo.head(3)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) |
---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo["F"] = df_demo["C"] - df_demo["A"]
df_demo.head(3)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) | F (Click to sort ascending) |
---|---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same | -3.918282 |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same | 0.518282 |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | -2.504068 |
.insert()
allows to specify position of insertion.shape
gives tuple of size of data frame,vertical, horizontal
df_demo.insert(df_demo.shape[1] - 1, "E2", df_demo["C"] ** 2)
df_demo.head(3)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) | E2 (Click to sort ascending) | F (Click to sort ascending) |
---|---|---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same | 7.389056 | -3.918282 |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same | 2.952492 | 0.518282 |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | 1.700594 | -2.504068 |
df_demo.tail(3)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) | E2 (Click to sort ascending) | F (Click to sort ascending) |
---|---|---|---|---|---|---|---|
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | 1.700594 | -2.504068 |
3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same | 0.972652 | -0.213769 |
4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same | 0.515929 | -1.918282 |
Combining Frames
- First, create some simpler data frame to show
.concat()
and.merge()
df_1 = pd.DataFrame({"Key": ["First", "Second"], "Value": [1, 1]})
df_1
(Click to sort ascending) | Key (Click to sort ascending) | Value (Click to sort ascending) |
---|---|---|
0 | First | 1 |
1 | Second | 1 |
df_2 = pd.DataFrame({"Key": ["First", "Second"], "Value": [2, 2]})
df_2
(Click to sort ascending) | Key (Click to sort ascending) | Value (Click to sort ascending) |
---|---|---|
0 | First | 2 |
1 | Second | 2 |
- Concatenate list of data frame vertically (
axis=0
)
pd.concat([df_1, df_2])
(Click to sort ascending) | Key (Click to sort ascending) | Value (Click to sort ascending) |
---|---|---|
0 | First | 1 |
1 | Second | 1 |
0 | First | 2 |
1 | Second | 2 |
- Same, but re-index
pd.concat([df_1, df_2], ignore_index=True)
(Click to sort ascending) | Key (Click to sort ascending) | Value (Click to sort ascending) |
---|---|---|
0 | First | 1 |
1 | Second | 1 |
2 | First | 2 |
3 | Second | 2 |
- Concat, but horizontally
pd.concat([df_1, df_2], axis=1)
(Click to sort ascending) | Key (Click to sort ascending) | Value (Click to sort ascending) | Key (Click to sort ascending) | Value (Click to sort ascending) |
---|---|---|---|---|
0 | First | 1 | First | 2 |
1 | Second | 1 | Second | 2 |
- Merge on common column
pd.merge(df_1, df_2, on="Key")
(Click to sort ascending) | Key (Click to sort ascending) | Value_x (Click to sort ascending) | Value_y (Click to sort ascending) |
---|---|---|---|
0 | First | 1 | 2 |
1 | Second | 1 | 2 |
.concat()
can also be used to append rows to a DataFrame:
pd.concat(
[
df_demo,
pd.DataFrame({"A": 1.3, "B": pd.Timestamp("2018-02-27"), "C": -0.777, "D": "has it?", "E": "Same", "F": 23}, index=[0])
], ignore_index=True
)
(Click to sort ascending) | A (Click to sort ascending) | B (Click to sort ascending) | C (Click to sort ascending) | D (Click to sort ascending) | E (Click to sort ascending) | E2 (Click to sort ascending) | F (Click to sort ascending) |
---|---|---|---|---|---|---|---|
0 | 1.2 | 2018-02-26 | -2.718282 | This | Same | 7.389056 | -3.918282 |
1 | 1.2 | 2018-02-26 | 1.718282 | column | Same | 2.952492 | 0.518282 |
2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | 1.700594 | -2.504068 |
3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same | 0.972652 | -0.213769 |
4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same | 0.515929 | -1.918282 |
5 | 1.3 | 2018-02-27 | -0.777000 | has it? | Same | NaN | 23.000000 |
df["Threads"] = df["Nodes"] * df["Tasks/Node"] * df["Threads/Task"]
df.head()
(Click to sort ascending) | id (Click to sort ascending) | Nodes (Click to sort ascending) | Tasks/Node (Click to sort ascending) | Threads/Task (Click to sort ascending) | Runtime Program / s (Click to sort ascending) | Scale (Click to sort ascending) | Plastic (Click to sort ascending) | Avg. Neuron Build Time / s (Click to sort ascending) | Min. Edge Build Time / s (Click to sort ascending) | Max. Edge Build Time / s (Click to sort ascending) | ... (Click to sort ascending) | Presim. Time / s (Click to sort ascending) | Sim. Time / s (Click to sort ascending) | Virt. Memory (Sum) / kB (Click to sort ascending) | Local Spike Counter (Sum) (Click to sort ascending) | Average Rate (Sum) (Click to sort ascending) | Number of Neurons (Click to sort ascending) | Number of Connections (Click to sort ascending) | Min. Delay (Click to sort ascending) | Max. Delay (Click to sort ascending) | Threads (Click to sort ascending) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 1 | 2 | 4 | 420.42 | 10 | True | 0.29 | 88.12 | 88.18 | ... | 17.26 | 311.52 | 46560664.0 | 825499 | 7.48 | 112500 | 1265738500 | 1.5 | 1.5 | 8 |
1 | 5 | 1 | 4 | 4 | 200.84 | 10 | True | 0.15 | 46.03 | 46.34 | ... | 7.87 | 142.97 | 46903088.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 16 |
2 | 5 | 1 | 2 | 8 | 202.15 | 10 | True | 0.28 | 47.98 | 48.48 | ... | 7.95 | 142.81 | 47699384.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 16 |
3 | 5 | 1 | 4 | 8 | 89.57 | 10 | True | 0.15 | 20.41 | 23.21 | ... | 3.19 | 60.31 | 46813040.0 | 821491 | 7.23 | 112500 | 1265738500 | 1.5 | 1.5 | 32 |
4 | 5 | 2 | 2 | 4 | 164.16 | 10 | True | 0.20 | 40.03 | 41.09 | ... | 6.08 | 114.88 | 46937216.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 16 |
df.columns
Index(['id', 'Nodes', 'Tasks/Node', 'Threads/Task', 'Runtime Program / s',
'Scale', 'Plastic', 'Avg. Neuron Build Time / s',
'Min. Edge Build Time / s', 'Max. Edge Build Time / s',
'Min. Init. Time / s', 'Max. Init. Time / s', 'Presim. Time / s',
'Sim. Time / s', 'Virt. Memory (Sum) / kB', 'Local Spike Counter (Sum)',
'Average Rate (Sum)', 'Number of Neurons', 'Number of Connections',
'Min. Delay', 'Max. Delay', 'Threads'],
dtype='object')
Aside: Plotting without Pandas
Matplotlib 101
- Matplotlib: de-facto standard for plotting in Python
- Main interface:
pyplot
; provides MATLAB-like interface - Better: Use object-oriented API with
Figure
andAxis
- Great integration into Jupyter Notebooks
- Since v. 3: Only support for Python 3
- → https://matplotlib.org/
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Use like this')
ax.set_xlabel("Numbers");
ax.set_ylabel("$\sqrt{x}$");
<>:5: SyntaxWarning: invalid escape sequence '\s'
<>:5: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_106956/3587136147.py:5: SyntaxWarning: invalid escape sequence '\s'
ax.set_ylabel("$\sqrt{x}$");
- Plot multiple lines into one canvas
- Call
ax.plot()
multiple times
y2 = y/np.exp(y*1.5)
fig, ax = plt.subplots()
ax.plot(x, y, label="y")
ax.plot(x, y2, label="y2")
ax.legend()
ax.set_title("This plot makes no sense");
- Matplotlib can also plot DataFrame data
- Because DataFrame data is only array-like data with stuff on top
fig, ax = plt.subplots()
ax.plot(df_demo.index, df_demo["C"], label="C")
ax.legend()
ax.set_title("Nope, no sense at all");
Task 4
- Sort the Nest data frame by threads
- Plot
"Presim. Time / s"
and"Sim. Time / s"
of our data framedf
as a function of threads - Use a dashed, red line for
"Presim. Time / s"
, a blue line for"Sim. Time / s"
(see API description) - Don't forget to label your axes and to add a legend (1st rule of plotting)
- Tell me when you're done with status icon in BigBlueButton: 👍
df.sort_values(["Threads", "Nodes", "Tasks/Node", "Threads/Task"], inplace=True) # multi-level sort
fig, ax = plt.subplots(figsize=(10, 3))
ax.plot(df["Threads"], df["Presim. Time / s"], linestyle="dashed", color="red", label="Presim. Time / s")
ax.plot(df["Threads"], df["Sim. Time / s"], "-b", label="Sim. Time / s")
ax.set_xlabel("Threads")
ax.set_ylabel("Time / s")
ax.legend(loc='best');
Plotting with Pandas
- Each data frame has a
.plot()
function (see API) - Plots with Matplotlib
- Important API options:
kind
:'line'
(default),'bar[h]'
,'hist'
,'box'
,'kde'
,'scatter'
,'hexbin'
subplots
: Make a sub-plot for each column (good together withsharex
,sharey
)figsize
grid
: Add a grid to plot (use Matplotlib options)style
: Line style per column (accepts list or dict)logx
,logy
,loglog
: Logarithmic plotsxticks
,yticks
: Use values for ticksxlim
,ylim
: Limits of axesyerr
,xerr
: Add uncertainty to data pointsstacked
: Stack a bar plotsecondary_y
: Use a secondaryy
axis for this plot- Labeling
title
: Add title to plot (Use a list of strings ifsubplots=True
)legend
: Add a legendtable
: Iftrue
, add table of data under plot
**kwds
: Non-parsed keyword passed to Matplotlib's plotting methods
- Either slice and plot…
df_demo["C"].plot(figsize=(10, 2));
- … or plot and select
df_demo.plot(y="C", figsize=(10, 2));
- I prefer slicing first:
→ Allows for further operations on the sliced data frame
df_demo["C"].plot(kind="bar");
- There are pseudo-sub-functions for each of the plot
kind
s - I prefer to just call
.plot(kind="smthng")
df_demo["C"].plot.bar();
df_demo["C"].plot(kind="bar", legend=True, figsize=(12, 4), ylim=(-1, 3), title="This is a C plot");
Task 5
Use the Nest data frame df
to:
- Make threads index of the data frame (
.set_index()
) - Plot
"Presim. Time / s"
and"Sim. Time / s
" individually - Plot them onto one common canvas!
- Make them have the same line colors and styles as before
- Add a legend, add missing axes labels
- Tell me when you're done with status icon in BigBlueButton: 👍
df.set_index("Threads", inplace=True)
df["Presim. Time / s"].plot(figsize=(10, 3), style="--", color="red");
df["Sim. Time / s"].plot(figsize=(10, 3), style="-b");
df["Presim. Time / s"].plot(style="--r", figsize=(10,3));
df["Sim. Time / s"].plot(style="-b", figsize=(10,3));
ax = df[["Presim. Time / s", "Sim. Time / s"]].plot(style=["--b", "-r"], figsize=(10,3));
ax.set_ylabel("Time / s");
More Plotting with Pandas
Recap: Our first proper Pandas plot
df[["Presim. Time / s", "Sim. Time / s"]].plot(figsize=(10,3));
- That's why I think Pandas is great!
- It has great defaults to quickly plot data; basically publication-grade already
- Plotting functionality is very versatile
- Before plotting, data can be massaged within data frames, if needed
More Plotting with Pandas
Some versatility
df_demo[["A", "C", "F"]].plot(kind="bar", stacked=True, figsize=(10,3));
df_demo[df_demo["F"] < 0][["A", "C", "F"]].plot(kind="bar", stacked=True, figsize=(10,3));
df_demo[df_demo["F"] < 0][["A", "C", "F"]]\
.plot(kind="barh", subplots=True, sharex=True, title="Subplots Demo", figsize=(10, 4));
df_demo.loc[df_demo["F"] < 0, ["A", "F"]]\
.plot(
style=["-*r", "--ob"],
secondary_y="A",
figsize=(12, 6),
table=True
);
df_demo.loc[df_demo["F"] < 0, ["A", "F"]]\
.plot(
style=["-*r", "--ob"],
secondary_y="A",
figsize=(12, 6),
yerr={
"A": abs(df_demo[df_demo["F"] < 0]["C"]),
"F": 0.2
},
capsize=4,
title="Bug: style is ignored with yerr",
marker="P"
);
Combine Pandas with Matplotlib
- Pandas shortcuts very handy
- But sometimes, one needs to access underlying Matplotlib functionality
- No problemo!
- Option 1: Pandas always returns axis
- Use this to manipulate the canvas
- Get underlying
figure
withax.get_figure()
(forfig.savefig()
)
- Option 2: Create figure and axes with Matplotlib, use when drawing
.plot()
: Useax
option
Option 1: Pandas Returns Axis
ax = df_demo["C"].plot(figsize=(10, 4))
ax.set_title("Hello There!");
fig = ax.get_figure()
fig.suptitle("This title is super (literally)!");
Option 2: Draw on Matplotlib Axes
fig, ax = plt.subplots(figsize=(10, 4))
df_demo["C"].plot(ax=ax)
ax.set_title("Hello There!");
fig.suptitle("This title is super (still, literally)!");
- We can also get fancy!
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 4))
for ax, column, color in zip([ax1, ax2], ["C", "F"], ["blue", "#b2e123"]):
df_demo[column].plot(ax=ax, legend=True, color=color)
Aside: Seaborn
- Python package on top of Matplotlib
- Powerful API shortcuts for plotting of statistical data
- Manipulate color palettes
- Works well together with Pandas
- Also: New, well-looking defaults for Matplotlib (IMHO)
- → https://seaborn.pydata.org/
import seaborn as sns
sns.set_theme() # set defaults
df_demo[["A", "C"]].plot(figsize=(10,3));
Seaborn Color Palette Example
sns.palplot(sns.color_palette())
sns.palplot(sns.color_palette("hls", 10))
sns.palplot(sns.color_palette("hsv", 20))
sns.palplot(sns.color_palette("Paired", 10))
sns.palplot(sns.color_palette("cubehelix", 8))
sns.palplot(sns.color_palette("colorblind", 10))
Seaborn Plot Examples
- Most of the time, I use a regression plot from Seaborn
with sns.color_palette("hls", 2):
sns.regplot(x="C", y="F", data=df_demo);
sns.regplot(x="C", y="E2", data=df_demo);
- A joint plot combines two plots relating to distribution of values into one
- Very handy for showing a fuller picture of two-dimensionally scattered variables
x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T
sns.jointplot(x=x, y=y, kind="reg");
Task 6
- To your
df
Nest data frame, add a column with the unaccounted time (Unaccounted Time / s
), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time.
(I know this is technically not super correct, but it will do for our example.) - Plot a stacked bar plot of all these columns (except for program runtime) over the threads
- Tell me when you're done with status icon in BigBlueButton: 👍
cols = [
'Avg. Neuron Build Time / s',
'Min. Edge Build Time / s',
'Min. Init. Time / s',
'Presim. Time / s',
'Sim. Time / s'
]
df["Unaccounted Time / s"] = df['Runtime Program / s']
for entry in cols:
df["Unaccounted Time / s"] = df["Unaccounted Time / s"] - df[entry]
df[["Runtime Program / s", "Unaccounted Time / s", *cols]].head(2)
Threads (Click to sort ascending) | Runtime Program / s (Click to sort ascending) | Unaccounted Time / s (Click to sort ascending) | Avg. Neuron Build Time / s (Click to sort ascending) | Min. Edge Build Time / s (Click to sort ascending) | Min. Init. Time / s (Click to sort ascending) | Presim. Time / s (Click to sort ascending) | Sim. Time / s (Click to sort ascending) |
---|---|---|---|---|---|---|---|
8 | 420.42 | 2.09 | 0.29 | 88.12 | 1.14 | 17.26 | 311.52 |
16 | 202.15 | 2.43 | 0.28 | 47.98 | 0.70 | 7.95 | 142.81 |
df[["Unaccounted Time / s", *cols]].plot(kind="bar", stacked=True, figsize=(12, 4));
- Make it relative to the total program run time
- Slight complication: Our threads as indexes are not unique; we need to find new unique indexes
- Could be anythig, but we use a multi index!
df_multind = df.set_index(["Nodes", "Tasks/Node", "Threads/Task"])
df_multind.head()
Nodes (Click to sort ascending) | Tasks/Node (Click to sort ascending) | Threads/Task (Click to sort ascending) | id (Click to sort ascending) | Runtime Program / s (Click to sort ascending) | Scale (Click to sort ascending) | Plastic (Click to sort ascending) | Avg. Neuron Build Time / s (Click to sort ascending) | Min. Edge Build Time / s (Click to sort ascending) | Max. Edge Build Time / s (Click to sort ascending) | Min. Init. Time / s (Click to sort ascending) | Max. Init. Time / s (Click to sort ascending) | Presim. Time / s (Click to sort ascending) | Sim. Time / s (Click to sort ascending) | Virt. Memory (Sum) / kB (Click to sort ascending) | Local Spike Counter (Sum) (Click to sort ascending) | Average Rate (Sum) (Click to sort ascending) | Number of Neurons (Click to sort ascending) | Number of Connections (Click to sort ascending) | Min. Delay (Click to sort ascending) | Max. Delay (Click to sort ascending) | Unaccounted Time / s (Click to sort ascending) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 4 | 5 | 420.42 | 10 | True | 0.29 | 88.12 | 88.18 | 1.14 | 1.20 | 17.26 | 311.52 | 46560664.0 | 825499 | 7.48 | 112500 | 1265738500 | 1.5 | 1.5 | 2.09 |
1 | 2 | 8 | 5 | 202.15 | 10 | True | 0.28 | 47.98 | 48.48 | 0.70 | 1.20 | 7.95 | 142.81 | 47699384.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 2.43 |
4 | 4 | 5 | 200.84 | 10 | True | 0.15 | 46.03 | 46.34 | 0.70 | 1.01 | 7.87 | 142.97 | 46903088.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 3.12 | |
2 | 2 | 4 | 5 | 164.16 | 10 | True | 0.20 | 40.03 | 41.09 | 0.52 | 1.58 | 6.08 | 114.88 | 46937216.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 2.45 |
1 | 2 | 12 | 6 | 141.70 | 10 | True | 0.30 | 32.93 | 33.26 | 0.62 | 0.95 | 5.41 | 100.16 | 50148824.0 | 813743 | 7.27 | 112500 | 1265738500 | 1.5 | 1.5 | 2.28 |
df_multind[["Unaccounted Time / s", *cols]]\
.divide(df_multind["Runtime Program / s"], axis="index")\
.plot(kind="bar", stacked=True, figsize=(14, 6), title="Relative Time Distribution");
Next Level: Hierarchical Data
MultiIndex
only a first level- More powerful:
- Grouping:
.groupby()
("Split-apply-combine", API, User Guide) - Pivoting:
.pivot_table()
(API, User Guide); also.pivot()
(specialized version of.pivot_table()
, API)
- Grouping:
Grouping
- Group a frame by common values of column(s)
- Use operations on this group
- Grouped frame is not directly a new frame, but only through an applied operation
df.groupby("Nodes").groups
{1: [8, 16, 16, 24, 32, 48], 2: [16, 32, 32, 48, 64, 96], 3: [24, 48, 48, 72, 96, 144], 4: [32, 64, 64, 96, 128, 192], 5: [40, 80, 80, 120, 160, 240], 6: [48, 96, 96, 144, 192, 288]}
df.groupby("Nodes").get_group(4).head(3)
Threads (Click to sort ascending) | id (Click to sort ascending) | Nodes (Click to sort ascending) | Tasks/Node (Click to sort ascending) | Threads/Task (Click to sort ascending) | Runtime Program / s (Click to sort ascending) | Scale (Click to sort ascending) | Plastic (Click to sort ascending) | Avg. Neuron Build Time / s (Click to sort ascending) | Min. Edge Build Time / s (Click to sort ascending) | Max. Edge Build Time / s (Click to sort ascending) | ... (Click to sort ascending) | Presim. Time / s (Click to sort ascending) | Sim. Time / s (Click to sort ascending) | Virt. Memory (Sum) / kB (Click to sort ascending) | Local Spike Counter (Sum) (Click to sort ascending) | Average Rate (Sum) (Click to sort ascending) | Number of Neurons (Click to sort ascending) | Number of Connections (Click to sort ascending) | Min. Delay (Click to sort ascending) | Max. Delay (Click to sort ascending) | Unaccounted Time / s (Click to sort ascending) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | 5 | 4 | 2 | 4 | 66.58 | 10 | True | 0.13 | 18.86 | 19.65 | ... | 2.35 | 43.38 | 47361344.0 | 821491 | 7.23 | 112500 | 1265738500 | 1.5 | 1.5 | 1.70 |
64 | 5 | 4 | 2 | 8 | 34.09 | 10 | True | 0.14 | 10.60 | 10.83 | ... | 1.25 | 20.96 | 47074752.0 | 818198 | 7.33 | 112500 | 1265738500 | 1.5 | 1.5 | 1.03 |
64 | 5 | 4 | 4 | 4 | 32.49 | 10 | True | 0.09 | 9.98 | 10.31 | ... | 1.12 | 20.12 | 48081056.0 | 818198 | 7.33 | 112500 | 1265738500 | 1.5 | 1.5 | 1.09 |
df.groupby("Nodes").mean()
Nodes (Click to sort ascending) | id (Click to sort ascending) | Tasks/Node (Click to sort ascending) | Threads/Task (Click to sort ascending) | Runtime Program / s (Click to sort ascending) | Scale (Click to sort ascending) | Plastic (Click to sort ascending) | Avg. Neuron Build Time / s (Click to sort ascending) | Min. Edge Build Time / s (Click to sort ascending) | Max. Edge Build Time / s (Click to sort ascending) | Min. Init. Time / s (Click to sort ascending) | ... (Click to sort ascending) | Presim. Time / s (Click to sort ascending) | Sim. Time / s (Click to sort ascending) | Virt. Memory (Sum) / kB (Click to sort ascending) | Local Spike Counter (Sum) (Click to sort ascending) | Average Rate (Sum) (Click to sort ascending) | Number of Neurons (Click to sort ascending) | Number of Connections (Click to sort ascending) | Min. Delay (Click to sort ascending) | Max. Delay (Click to sort ascending) | Unaccounted Time / s (Click to sort ascending) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5.333333 | 3.0 | 8.0 | 185.023333 | 10.0 | 1.0 | 0.220000 | 42.040000 | 42.838333 | 0.583333 | ... | 7.226667 | 132.061667 | 4.806585e+07 | 816298.000000 | 7.215000 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 2.891667 |
2 | 5.333333 | 3.0 | 8.0 | 73.601667 | 10.0 | 1.0 | 0.168333 | 19.628333 | 20.313333 | 0.191667 | ... | 2.725000 | 48.901667 | 4.975288e+07 | 818151.000000 | 7.210000 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.986667 |
3 | 5.333333 | 3.0 | 8.0 | 43.990000 | 10.0 | 1.0 | 0.138333 | 12.810000 | 13.305000 | 0.135000 | ... | 1.426667 | 27.735000 | 5.511165e+07 | 820465.666667 | 7.253333 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.745000 |
4 | 5.333333 | 3.0 | 8.0 | 31.225000 | 10.0 | 1.0 | 0.116667 | 9.325000 | 9.740000 | 0.088333 | ... | 1.066667 | 19.353333 | 5.325783e+07 | 819558.166667 | 7.288333 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.275000 |
5 | 5.333333 | 3.0 | 8.0 | 24.896667 | 10.0 | 1.0 | 0.140000 | 7.468333 | 7.790000 | 0.070000 | ... | 0.771667 | 14.950000 | 6.075634e+07 | 815307.666667 | 7.225000 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.496667 |
6 | 5.333333 | 3.0 | 8.0 | 20.215000 | 10.0 | 1.0 | 0.106667 | 6.165000 | 6.406667 | 0.051667 | ... | 0.630000 | 12.271667 | 6.060652e+07 | 815456.333333 | 7.201667 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 0.990000 |
Pivoting
- Combine categorically-similar columns
- Creates hierarchical index
- Respected during plotting with Pandas!
- A pivot table has three layers; if confused, think about the related questions
index
: »What's on thex
axis?«values
: »What value do I want to plot [on they
axis]?«columns
: »What categories do I want [to be in the legend]?«
- All can be populated from base data frame
- Might be aggregated, if needed
df_demo["H"] = [(-1)**n for n in range(5)]
df_pivot = df_demo.pivot_table(
index="F",
values="E2",
columns="H"
)
df_pivot
F (Click to sort ascending) | -1 (Click to sort ascending) | 1 (Click to sort ascending) |
---|---|---|
-3.918282 | NaN | 7.389056 |
-2.504068 | NaN | 1.700594 |
-1.918282 | NaN | 0.515929 |
-0.213769 | 0.972652 | NaN |
0.518282 | 2.952492 | NaN |
df_pivot.plot(figsize=(10,3));
Task 7
- Create a pivot table based on the Nest
df
data frame - Let the
x
axis show the number of nodes; display the values of the simulation time"Sim. Time / s"
for the tasks per node and threads per task configurations - Please plot a bar plot
- Tell me when you're done with status icon in BigBlueButton: 👍
df.pivot_table(
index="Nodes",
columns=["Tasks/Node", "Threads/Task"],
values="Sim. Time / s",
).plot(kind="bar", figsize=(12, 4));
Pandas 2
Pandas 2.0 was released in April 2023
Only limited deprecations (i.e. an upgrade is probably safe)
Key new feature: Apache Arrow support (via PyArrow)
Fine-grained installation options
python3 -m pip install 'pandas[performance, excel]'
However: Currently [10/2024] a dependency mismatch in default version
- Get a reasonably large data source (larger would be better, though)
- Example: Train stations as provided by Deutsche Bahn
data_db = 'db-bahnhoefe.csv' # source: https://web.archive.org/web/20231208211825/https://download-data.deutschebahn.com/static/datasets/stationsdaten/DBSuS-Uebersicht_Bahnhoefe-Stand2020-03.csv
%timeit pd.read_csv(data_db, sep=';')
10 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import pyarrow
print(pyarrow.__version__)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[234], line 1
----> 1 import pyarrow
2 print(pyarrow.__version__)
ModuleNotFoundError: No module named 'pyarrow'
pd.read_csv(data_db, sep=';', engine='pyarrow', dtype_backend='pyarrow')
Polars
import polars as ps
%timeit ps.read_csv(data_db, separator=';')
Large Data & Mangling
- Pandas can read data directly in
tar
form - Pandas can read data directly from online resource
- Let's combine that to an advanced task!
- It works also with the PyArrow backend (remember to download the online resource when testing; there is no cache!)
- IMDb provides data sets at datasets.imdbws.com
- Can directly be loaded like
pd.read_table('https://datasets.imdbws.com/dataset.tsv.gz', sep="\\t", low_memory=False, na_values=["\\\\N","nan"])
- Needed:
name.basics.tsv.gz
(for names of actors and movies they are known for)title.ratings.tsv.gz
(for ratings of titles)
- Strategy suggestions:
- Use
df.apply()
with custom function - Custom function: Compute average rating and determine if this entry is eligible for plotting (this can be done at once, but does not need to be)
- Average rating: Look up title IDs as listed in
knownForTitles
in titles dataframe
- Use
df_names = pd.read_table('imdb-data/name.basics.tsv.gz', sep="\\t", low_memory=False, na_values=["\\\\N","nan"])
df_ratings = pd.read_table('https://datasets.imdbws.com/title.ratings.tsv.gz', sep="\\t", low_memory=False, na_values=["\\\\N","nan"])
df_names_i = df_names.set_index('nconst')
df_ratings_i = df_ratings.set_index('tconst')
df_names_i = pd.concat(
[
df_names_i,
df_names_i.apply(lambda line: valid_and_avg_rating(line), axis=1, result_type='expand')
]
, axis=1
)
df_names_i[df_names_i['toPlot'] == True].sort_values('avgRating', ascending=False).iloc[0:10].reset_index().set_index('primaryName')['avgRating'].plot(kind='bar')
def valid_and_avg_rating(row):
rating = 0
ntitles = 0
_titles = row['knownForTitles']
_professions = row['primaryProfession']
if not isinstance(_titles, str):
_titles = str(_titles)
if not isinstance(_professions, str):
_professions = str(_professions)
titles = _titles.split(',')
professions = _professions.split(',')
for title in titles:
if title in df_ratings_i.index:
rating += df_ratings_i.loc[title]['averageRating']
ntitles += 1
if ntitles > 0:
plot = False
if ntitles > 2:
if 'actor' in professions:
plot = True
return {'toPlot': plot, 'avgRating': rating / ntitles}
else:
return {'toPlot': False, 'avgRating': pd.NA}
Task 8B (Bonuseption)
All of the following are ideas for unique sub-tasks, which can be done individually
- In addition to Task 8, restrict the top titles to those with more than 10000 votes
- For 30 top-rated actors, plot rating vs. age
- For 30 top-rated actors, plot rating vs. average runtime of the known-for-titles (using
title.basics.tsv.gz
)
Random Features Not Shown
This are all links:
df.drop()
df.corr()
df.boxplot()
pd.read_sql_query("SELECT * FROM purchases", con)
df.duplicated()
anddf.drop_duplicates()
- Aliases for categorical data
- Working with time
ts.tz_convert
pd.period_range()
pd.period_range().asfreq()
Conclusion
- Pandas works with and on data frames, which are central
- Slice frames to your likings
- Plot frames
- Together with Matplotlib, Seaborn, others
- Pivot tables are next level greatness
- Remember: Pandas as early as possible!
- Thanks for being here! 😍
Feedback to a.herten@fz-juelich.de
Next slide: Further reading
Further Reading
- Pandas User Guide
- Matplotlib and LaTeX Plots
- towardsdatascience.com: