Motto: »Pandas as early as possible!«
pip install --user pandas seaborn)
pip install pandasmatplotlibstatsmodels, scikit-learnseaborn, altair, plotlyimport pandas
import pandas as pd
pd.__version__
'0.24.1'
%pdoc pd
Class docstring:
pandas - a powerful data analysis and manipulation library for Python
=====================================================================
**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.
Main Features
-------------
Here are just a few of the things that pandas does well:
- Easy handling of missing data in floating point as well as non-floating
point data.
- Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned
to a set of labels, or the user can simply ignore the labels and let
`Series`, `DataFrame`, etc. automatically align the data for you in
computations.
- Powerful, flexible group by functionality to perform split-apply-combine
operations on data sets, for both aggregating and transforming data.
- Make it easy to convert ragged, differently-indexed data in other Python
and NumPy data structures into DataFrame objects.
- Intelligent label-based slicing, fancy indexing, and subsetting of large
data sets.
- Intuitive merging and joining data sets.
- Flexible reshaping and pivoting of data sets.
- Hierarchical labeling of axes (possible to have multiple labels per tick).
- Robust IO tools for loading data from flat files (CSV and delimited),
Excel files, databases, and saving/loading data from the ultrafast HDF5
format.
- Time series-specific functionality: date range generation and frequency
conversion, moving window statistics, moving window linear regressions,
date shifting and lagging, etc.
ages = [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]
pd.DataFrame(ages)
| 0 | |
|---|---|
| 0 | 41 |
| 1 | 56 |
| 2 | 56 |
| 3 | 57 |
| 4 | 39 |
| 5 | 59 |
| 6 | 43 |
| 7 | 56 |
| 8 | 38 |
| 9 | 60 |
df_ages = pd.DataFrame(ages)
df_ages.head(3)
| 0 | |
|---|---|
| 0 | 41 |
| 1 | 56 |
| 2 | 56 |
dict()data = {
"Names": ["Liu", "Rowland", "Rivers", "Waters", "Rice", "Fields", "Kerr", "Romero", "Davis", "Hall"],
"Ages": ages
}
print(data)
{'Names': ['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr', 'Romero', 'Davis', 'Hall'], 'Ages': [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]}
df_sample = pd.DataFrame(data)
df_sample.head(4)
| Names | Ages | |
|---|---|---|
| 0 | Liu | 41 |
| 1 | Rowland | 56 |
| 2 | Rivers | 56 |
| 3 | Waters | 57 |
df_sample.columns
Index(['Names', 'Ages'], dtype='object')
DataFrame always have indexes; auto-generated or customdf_sample.index
RangeIndex(start=0, stop=10, step=1)
Names be index with .set_index()inplace=True will modifiy the parent frame (I don't like it)df_sample.set_index("Names", inplace=True)
df_sample
| Ages | |
|---|---|
| Names | |
| Liu | 41 |
| Rowland | 56 |
| Rivers | 56 |
| Waters | 57 |
| Rice | 39 |
| Fields | 59 |
| Kerr | 43 |
| Romero | 56 |
| Davis | 38 |
| Hall | 60 |
df_sample.describe()
| Ages | |
|---|---|
| count | 10.000000 |
| mean | 50.500000 |
| std | 9.009255 |
| min | 38.000000 |
| 25% | 41.500000 |
| 50% | 56.000000 |
| 75% | 56.750000 |
| max | 60.000000 |
df_sample.T
| Names | Liu | Rowland | Rivers | Waters | Rice | Fields | Kerr | Romero | Davis | Hall |
|---|---|---|---|---|---|---|---|---|---|---|
| Ages | 41 | 56 | 56 | 57 | 39 | 59 | 43 | 56 | 38 | 60 |
df_sample.T.columns
Index(['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr',
'Romero', 'Davis', 'Hall'],
dtype='object', name='Names')
df_sample.multiply(2).head(3)
| Ages | |
|---|---|
| Names | |
| Liu | 82 |
| Rowland | 112 |
| Rivers | 112 |
df_sample.reset_index().multiply(2).head(3)
| Names | Ages | |
|---|---|---|
| 0 | LiuLiu | 82 |
| 1 | RowlandRowland | 112 |
| 2 | RiversRivers | 112 |
(df_sample / 2).head(3)
| Ages | |
|---|---|
| Names | |
| Liu | 20.5 |
| Rowland | 28.0 |
| Rivers | 28.0 |
(df_sample * df_sample).head(3)
| Ages | |
|---|---|
| Names | |
| Liu | 1681 |
| Rowland | 3136 |
| Rivers | 3136 |
Logical operations allowed as well
df_sample > 40
| Ages | |
|---|---|
| Names | |
| Liu | True |
| Rowland | True |
| Rivers | True |
| Waters | True |
| Rice | False |
| Fields | True |
| Kerr | True |
| Romero | True |
| Davis | False |
| Hall | True |
happy_dinos = {
"Dinosaur Name": [],
"Favourite Prime": [],
"Favourite Color": []
}
#df_dinos =
happy_dinos = {
"Dinosaur Name": ["Aegyptosaurus", "Tyrannosaurus", "Panoplosaurus", "Isisaurus", "Triceratops", "Velociraptor"],
"Favourite Prime": ["4", "8", "15", "16", "23", "42"],
"Favourite Color": ["blue", "white", "blue", "purple", "violet", "gray"]
}
df_dinos = pd.DataFrame(happy_dinos).set_index("Dinosaur Name")
df_dinos.T
| Dinosaur Name | Aegyptosaurus | Tyrannosaurus | Panoplosaurus | Isisaurus | Triceratops | Velociraptor |
|---|---|---|---|---|---|---|
| Favourite Prime | 4 | 8 | 15 | 16 | 23 | 42 |
| Favourite Color | blue | white | blue | purple | violet | gray |
Some more DataFrame examples
df_demo = pd.DataFrame({
"A": 1.2,
"B": pd.Timestamp('20180226'),
"C": [(-1)**i * np.sqrt(i) + np.e * (-1)**(i-1) for i in range(5)],
"D": pd.Categorical(["This", "column", "has", "entries", "entries"]),
"E": "Same"
})
df_demo
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
| 3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
| 4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
df_demo.sort_values("C")
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
| 4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
| 3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
df_demo.round(2).tail(2)
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 3 | 1.2 | 2018-02-26 | 0.99 | entries | Same |
| 4 | 1.2 | 2018-02-26 | -0.72 | entries | Same |
df_demo.round(2).sum()
A 6 C -2.03 D Thiscolumnhasentriesentries E SameSameSameSameSame dtype: object
print(df_demo.round(2).to_latex())
\begin{tabular}{lrlrll}
\toprule
{} & A & B & C & D & E \\
\midrule
0 & 1.2 & 2018-02-26 & -2.72 & This & Same \\
1 & 1.2 & 2018-02-26 & 1.72 & column & Same \\
2 & 1.2 & 2018-02-26 & -1.30 & has & Same \\
3 & 1.2 & 2018-02-26 & 0.99 & entries & Same \\
4 & 1.2 & 2018-02-26 & -0.72 & entries & Same \\
\bottomrule
\end{tabular}
(Links to documentation)
Example:
{
"Character": ["Sawyer", "…", "Walt"],
"Actor": ["Josh Holloway", "…", "Malcolm David Kelley"],
"Main Cast": [true, "…", false]
}
pd.read_json("lost.json").set_index("Character").sort_index()
| Actor | Main Cast | |
|---|---|---|
| Character | ||
| Hurley | Jorge Garcia | True |
| Jack | Matthew Fox | True |
| Kate | Evangeline Lilly | True |
| Locke | Terry O'Quinn | True |
| Sawyer | Josh Holloway | True |
| Walt | Malcolm David Kelley | False |
nest-data.csv to DataFrame; call it df!cat nest-data.csv | head -3
id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay 5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5 5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5
df = pd.read_csv("nest-data.csv")
df.head()
| id | Nodes | Tasks/Node | Threads/Task | Runtime Program / s | Scale | Plastic | Avg. Neuron Build Time / s | Min. Edge Build Time / s | Max. Edge Build Time / s | ... | Max. Init. Time / s | Presim. Time / s | Sim. Time / s | Virt. Memory (Sum) / kB | Local Spike Counter (Sum) | Average Rate (Sum) | Number of Neurons | Number of Connections | Min. Delay | Max. Delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 1 | 2 | 4 | 420.42 | 10 | True | 0.29 | 88.12 | 88.18 | ... | 1.20 | 17.26 | 311.52 | 46560664.0 | 825499 | 7.48 | 112500 | 1265738500 | 1.5 | 1.5 |
| 1 | 5 | 1 | 4 | 4 | 200.84 | 10 | True | 0.15 | 46.03 | 46.34 | ... | 1.01 | 7.87 | 142.97 | 46903088.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 |
| 2 | 5 | 1 | 2 | 8 | 202.15 | 10 | True | 0.28 | 47.98 | 48.48 | ... | 1.20 | 7.95 | 142.81 | 47699384.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 |
| 3 | 5 | 1 | 4 | 8 | 89.57 | 10 | True | 0.15 | 20.41 | 23.21 | ... | 3.04 | 3.19 | 60.31 | 46813040.0 | 821491 | 7.23 | 112500 | 1265738500 | 1.5 | 1.5 |
| 4 | 5 | 2 | 2 | 4 | 164.16 | 10 | True | 0.20 | 40.03 | 41.09 | ... | 1.58 | 6.08 | 114.88 | 46937216.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 |
5 rows × 21 columns
sep: Set separator (for example : instead of ,)header: Specify info about headers for columns; able to use multi-index for columns!names: Alternative to header – provide your own column titlesusecols: Don't read whole set of columns, but only these; works with any list (range(0:20:2))…skiprows: Don't read in these rowsna_values: What string(s) to recognize as N/A values (which will be ignored during operations on data frame)parse_dates: Try to parse dates in CSV; different behaviours as to provided data structure; optionally used together with date_parsercompression: Treat input file as compressed file ("infer", "gzip", "zip", …)decimal: Decimal point divider – for German data…pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
df_demo.head(3)
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo["C"]
0 -2.718282 1 1.718282 2 -1.304068 3 0.986231 4 -0.718282 Name: C, dtype: float64
[] to slice operator []A and C, ["A", "C"] from df_demodf_demo[["A", "C"]]
| A | C | |
|---|---|---|
| 0 | 1.2 | -2.718282 |
| 1 | 1.2 | 1.718282 |
| 2 | 1.2 | -1.304068 |
| 3 | 1.2 | 0.986231 |
| 4 | 1.2 | -0.718282 |
df_demo[1:3]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo.iloc[1:3]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo.iloc[1:6:2]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
.iloc[] location might change after re-sorting!df_demo.sort_values("C").iloc[1:3]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
| 4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
.loc[]df_demo_indexed = df_demo.set_index("D")
df_demo_indexed
| A | B | C | E | |
|---|---|---|---|---|
| D | ||||
| This | 1.2 | 2018-02-26 | -2.718282 | Same |
| column | 1.2 | 2018-02-26 | 1.718282 | Same |
| has | 1.2 | 2018-02-26 | -1.304068 | Same |
| entries | 1.2 | 2018-02-26 | 0.986231 | Same |
| entries | 1.2 | 2018-02-26 | -0.718282 | Same |
df_demo_indexed.loc["entries"]
| A | B | C | E | |
|---|---|---|---|---|
| D | ||||
| entries | 1.2 | 2018-02-26 | 0.986231 | Same |
| entries | 1.2 | 2018-02-26 | -0.718282 | Same |
df_demo[df_demo["C"] > 0]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same |
df_demo[(df_demo["C"] < 0) & (df_demo["D"] == "entries")]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same |
frame["new col"] = something or .insert()frame.append()df_demo.head(3)
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 1.2 | 2018-02-26 | -2.718282 | This | Same |
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same |
df_demo["F"] = df_demo["C"] - df_demo["A"]
df_demo.head(3)
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.2 | 2018-02-26 | -2.718282 | This | Same | -3.918282 |
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same | 0.518282 |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | -2.504068 |
df_demo.insert(df_demo.shape[1], "G", df_demo["C"] ** 2)
df_demo.tail(3)
| A | B | C | D | E | F | G | |
|---|---|---|---|---|---|---|---|
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | -2.504068 | 1.700594 |
| 3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same | -0.213769 | 0.972652 |
| 4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same | -1.918282 | 0.515929 |
df_demo.append(
{"A": 1.3, "B": pd.Timestamp("2018-02-27"), "C": -0.777, "D": "has it?", "E": "Same", "F": 23},
ignore_index=True
)
| A | B | C | D | E | F | G | |
|---|---|---|---|---|---|---|---|
| 0 | 1.2 | 2018-02-26 | -2.718282 | This | Same | -3.918282 | 7.389056 |
| 1 | 1.2 | 2018-02-26 | 1.718282 | column | Same | 0.518282 | 2.952492 |
| 2 | 1.2 | 2018-02-26 | -1.304068 | has | Same | -2.504068 | 1.700594 |
| 3 | 1.2 | 2018-02-26 | 0.986231 | entries | Same | -0.213769 | 0.972652 |
| 4 | 1.2 | 2018-02-26 | -0.718282 | entries | Same | -1.918282 | 0.515929 |
| 5 | 1.3 | 2018-02-27 | -0.777000 | has it? | Same | 23.000000 | NaN |
.concat() and .merge()df_1 = pd.DataFrame({"Key": ["First", "Second"], "Value": [1, 1]})
df_1
| Key | Value | |
|---|---|---|
| 0 | First | 1 |
| 1 | Second | 1 |
df_2 = pd.DataFrame({"Key": ["First", "Second"], "Value": [2, 2]})
df_2
| Key | Value | |
|---|---|---|
| 0 | First | 2 |
| 1 | Second | 2 |
axis=0)pd.concat([df_1, df_2])
| Key | Value | |
|---|---|---|
| 0 | First | 1 |
| 1 | Second | 1 |
| 0 | First | 2 |
| 1 | Second | 2 |
pd.concat([df_1, df_2], ignore_index=True)
| Key | Value | |
|---|---|---|
| 0 | First | 1 |
| 1 | Second | 1 |
| 2 | First | 2 |
| 3 | Second | 2 |
pd.concat([df_1, df_2], axis=1)
| Key | Value | Key | Value | |
|---|---|---|---|---|
| 0 | First | 1 | First | 2 |
| 1 | Second | 1 | Second | 2 |
pd.merge(df_1, df_2, on="Key")
| Key | Value_x | Value_y | |
|---|---|---|---|
| 0 | First | 1 | 2 |
| 1 | Second | 1 | 2 |
Virtual Processes which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)df["Virtual Processes"] = df["Nodes"] * df["Tasks/Node"] * df["Threads/Task"]
df.head()
| id | Nodes | Tasks/Node | Threads/Task | Runtime Program / s | Scale | Plastic | Avg. Neuron Build Time / s | Min. Edge Build Time / s | Max. Edge Build Time / s | ... | Presim. Time / s | Sim. Time / s | Virt. Memory (Sum) / kB | Local Spike Counter (Sum) | Average Rate (Sum) | Number of Neurons | Number of Connections | Min. Delay | Max. Delay | Virtual Processes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 1 | 2 | 4 | 420.42 | 10 | True | 0.29 | 88.12 | 88.18 | ... | 17.26 | 311.52 | 46560664.0 | 825499 | 7.48 | 112500 | 1265738500 | 1.5 | 1.5 | 8 |
| 1 | 5 | 1 | 4 | 4 | 200.84 | 10 | True | 0.15 | 46.03 | 46.34 | ... | 7.87 | 142.97 | 46903088.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 16 |
| 2 | 5 | 1 | 2 | 8 | 202.15 | 10 | True | 0.28 | 47.98 | 48.48 | ... | 7.95 | 142.81 | 47699384.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 16 |
| 3 | 5 | 1 | 4 | 8 | 89.57 | 10 | True | 0.15 | 20.41 | 23.21 | ... | 3.19 | 60.31 | 46813040.0 | 821491 | 7.23 | 112500 | 1265738500 | 1.5 | 1.5 | 32 |
| 4 | 5 | 2 | 2 | 4 | 164.16 | 10 | True | 0.20 | 40.03 | 41.09 | ... | 6.08 | 114.88 | 46937216.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 16 |
5 rows × 22 columns
df.columns
Index(['id', 'Nodes', 'Tasks/Node', 'Threads/Task', 'Runtime Program / s',
'Scale', 'Plastic', 'Avg. Neuron Build Time / s',
'Min. Edge Build Time / s', 'Max. Edge Build Time / s',
'Min. Init. Time / s', 'Max. Init. Time / s', 'Presim. Time / s',
'Sim. Time / s', 'Virt. Memory (Sum) / kB', 'Local Spike Counter (Sum)',
'Average Rate (Sum)', 'Number of Neurons', 'Number of Connections',
'Min. Delay', 'Max. Delay', 'Virtual Processes'],
dtype='object')
pyplot; provides MATLAB-like interfaceFigure and Axisimport matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Use like this')
ax.set_xlabel("Numbers again");
ax.set_ylabel("$\sqrt{x}$");
ax.plot() multiple timesy2 = y/np.exp(y*1.5)
fig, ax = plt.subplots()
ax.plot(x, y, label="y")
ax.plot(x, y2, label="y2")
ax.legend()
ax.set_title("This plot makes no sense");
"Presim. Time / s" and "Sim. Time / s" of our data frame df as a function of the virtual processes"Presim. Time / s", a blue line for "Sim. Time / s" (see API description)df.sort_values(["Virtual Processes", "Nodes", "Tasks/Node", "Threads/Task"], inplace=True)
fig, ax = plt.subplots()
ax.plot(df["Virtual Processes"], df["Presim. Time / s"], linestyle="dashed", color="red")
ax.plot(df["Virtual Processes"], df["Sim. Time / s"], "-b")
ax.set_xlabel("Virtual Process")
ax.set_ylabel("Time / s")
ax.legend();
.plot() function (see API)kind: line (default), bar[h], hist, box, kde, scatter, hexbinsubplots: Make a sub-plot for each column (good together with sharex, sharey)figsizegrid: Add a grid to plot (use Matplotlib options)style: Line style per column (accepts list or dict)logx, logy, loglog: Logarithmic plotsxticks, yticks: Use values for ticksxlim, ylim: Limits of axesyerr, xerr: Add uncertainty to data pointsstacked: Stack a bar plotsecondary_y: Use a secondary y axis for this plottitle: Add title to plot (Use a list of strings if subplots=True)legend: Add a legendtable: If true, add table of data under plot**kwds: Every non-parsed keyword is passed through to Matplotlib's plotting methodsdf_demo["C"].plot(figsize=(10, 2));
df_demo.plot(y="C", figsize=(10, 2));
df_demo["C"].plot(kind="bar");
kinds.plot(kind="smthng")df_demo["C"].plot.bar();
df_demo["C"].plot(kind="bar", legend=True, figsize=(12, 4), ylim=(-1, 3), title="This is a C plot");
Use the NEST data frame df to:
.set_index())"Presim. Program / s" and "Sim. Time / s" individuallyAdd a legend, add missing labels
Done? Tell me! pollev.com/aherten538
df.set_index("Virtual Processes", inplace=True)
df["Presim. Time / s"].plot(figsize=(10, 3));
df["Sim. Time / s"].plot(figsize=(10, 3));
df["Presim. Time / s"].plot();
df["Sim. Time / s"].plot();
ax = df[["Presim. Time / s", "Sim. Time / s"]].plot();
ax.set_ylabel("Time / s");
df[["Presim. Time / s", "Sim. Time / s"]].plot();
df_demo[["A", "C", "F"]].plot(kind="bar", stacked=True);
df_demo[df_demo["F"] < 0][["A", "C", "F"]].plot(kind="bar", stacked=True);
df_demo[df_demo["F"] < 0][["A", "C", "F"]]\
.plot(kind="barh", subplots=True, sharex=True, title="Subplots", figsize=(12, 4));
df_demo[df_demo["F"] < 0][["A", "F"]]\
.plot(
style=["-*r", "--ob"],
secondary_y="A",
figsize=(12, 6),
table=True
);
df_demo[df_demo["F"] < 0][["A", "F"]]\
.plot(
style=["-*r", "--ob"],
secondary_y="A",
figsize=(12, 6),
yerr={
"A": df_demo[df_demo["F"] < 0]["C"],
"F": 0.2
},
capsize=4,
title="Bug: style is ignored with yerr",
marker="P"
);
figure with ax.get_figure() (for fig.savefig()).plot(): Use ax optionax = df_demo["C"].plot(figsize=(10, 4))
ax.set_title("Hello there!");
fig = ax.get_figure()
fig.suptitle("This title is super!");
fig, ax = plt.subplots(figsize=(10, 4))
df_demo["C"].plot(ax=ax)
ax.set_title("Hello there!");
fig.suptitle("This title is super!");
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 4))
for ax, column, color in zip([ax1, ax2], ["C", "F"], ["blue", "#b2e123"]):
df_demo[column].plot(ax=ax, legend=True, color=color)
import seaborn as sns
sns.set()
df_demo[["A", "C"]].plot();
sns.palplot(sns.color_palette())
sns.palplot(sns.color_palette("hls", 10))
sns.palplot(sns.color_palette("hsv", 20))
sns.palplot(sns.color_palette("Paired", 10))
sns.palplot(sns.color_palette("cubehelix", 8))
sns.palplot(sns.color_palette("colorblind", 10))
with sns.color_palette("hls", 2):
sns.regplot(x="C", y="F", data=df_demo);
sns.regplot(x="C", y="G", data=df_demo);
x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T
sns.jointplot(x=x, y=y, kind="reg");
df NEST data frame, add a column with the unaccounted time (Unaccounted Time / s), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time.cols = [
'Avg. Neuron Build Time / s',
'Min. Edge Build Time / s',
'Min. Init. Time / s',
'Presim. Time / s',
'Sim. Time / s'
]
df["Unaccounted Time / s"] = df['Runtime Program / s']
for entry in cols:
df["Unaccounted Time / s"] = df["Unaccounted Time / s"] - df[entry]
df[["Runtime Program / s", "Unaccounted Time / s", *cols]].head(2)
| Runtime Program / s | Unaccounted Time / s | Avg. Neuron Build Time / s | Min. Edge Build Time / s | Min. Init. Time / s | Presim. Time / s | Sim. Time / s | |
|---|---|---|---|---|---|---|---|
| Virtual Processes | |||||||
| 8 | 420.42 | 2.09 | 0.29 | 88.12 | 1.14 | 17.26 | 311.52 |
| 16 | 202.15 | 2.43 | 0.28 | 47.98 | 0.70 | 7.95 | 142.81 |
df[["Unaccounted Time / s", *cols]].plot(kind="bar", stacked=True, figsize=(12, 4));
df_multind = df.set_index(["Nodes", "Tasks/Node", "Threads/Task"])
df_multind.head()
| id | Runtime Program / s | Scale | Plastic | Avg. Neuron Build Time / s | Min. Edge Build Time / s | Max. Edge Build Time / s | Min. Init. Time / s | Max. Init. Time / s | Presim. Time / s | Sim. Time / s | Virt. Memory (Sum) / kB | Local Spike Counter (Sum) | Average Rate (Sum) | Number of Neurons | Number of Connections | Min. Delay | Max. Delay | Unaccounted Time / s | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Nodes | Tasks/Node | Threads/Task | |||||||||||||||||||
| 1 | 2 | 4 | 5 | 420.42 | 10 | True | 0.29 | 88.12 | 88.18 | 1.14 | 1.20 | 17.26 | 311.52 | 46560664.0 | 825499 | 7.48 | 112500 | 1265738500 | 1.5 | 1.5 | 2.09 |
| 8 | 5 | 202.15 | 10 | True | 0.28 | 47.98 | 48.48 | 0.70 | 1.20 | 7.95 | 142.81 | 47699384.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 2.43 | ||
| 4 | 4 | 5 | 200.84 | 10 | True | 0.15 | 46.03 | 46.34 | 0.70 | 1.01 | 7.87 | 142.97 | 46903088.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 3.12 | |
| 2 | 2 | 4 | 5 | 164.16 | 10 | True | 0.20 | 40.03 | 41.09 | 0.52 | 1.58 | 6.08 | 114.88 | 46937216.0 | 802865 | 7.03 | 112500 | 1265738500 | 1.5 | 1.5 | 2.45 |
| 1 | 2 | 12 | 6 | 141.70 | 10 | True | 0.30 | 32.93 | 33.26 | 0.62 | 0.95 | 5.41 | 100.16 | 50148824.0 | 813743 | 7.27 | 112500 | 1265738500 | 1.5 | 1.5 | 2.28 |
df_multind[["Unaccounted Time / s", *cols]]\
.divide(df_multind["Runtime Program / s"], axis="index")\
.plot(kind="bar", stacked=True, figsize=(14, 6), title="Relative Time Distribution");
df.groupby("Nodes").mean()
| id | Tasks/Node | Threads/Task | Runtime Program / s | Scale | Plastic | Avg. Neuron Build Time / s | Min. Edge Build Time / s | Max. Edge Build Time / s | Min. Init. Time / s | ... | Presim. Time / s | Sim. Time / s | Virt. Memory (Sum) / kB | Local Spike Counter (Sum) | Average Rate (Sum) | Number of Neurons | Number of Connections | Min. Delay | Max. Delay | Unaccounted Time / s | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Nodes | |||||||||||||||||||||
| 1 | 5.333333 | 3.0 | 8.0 | 185.023333 | 10.0 | True | 0.220000 | 42.040000 | 42.838333 | 0.583333 | ... | 7.226667 | 132.061667 | 4.806585e+07 | 816298.000000 | 7.215000 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 2.891667 |
| 2 | 5.333333 | 3.0 | 8.0 | 73.601667 | 10.0 | True | 0.168333 | 19.628333 | 20.313333 | 0.191667 | ... | 2.725000 | 48.901667 | 4.975288e+07 | 818151.000000 | 7.210000 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.986667 |
| 3 | 5.333333 | 3.0 | 8.0 | 43.990000 | 10.0 | True | 0.138333 | 12.810000 | 13.305000 | 0.135000 | ... | 1.426667 | 27.735000 | 5.511165e+07 | 820465.666667 | 7.253333 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.745000 |
| 4 | 5.333333 | 3.0 | 8.0 | 31.225000 | 10.0 | True | 0.116667 | 9.325000 | 9.740000 | 0.088333 | ... | 1.066667 | 19.353333 | 5.325783e+07 | 819558.166667 | 7.288333 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.275000 |
| 5 | 5.333333 | 3.0 | 8.0 | 24.896667 | 10.0 | True | 0.140000 | 7.468333 | 7.790000 | 0.070000 | ... | 0.771667 | 14.950000 | 6.075634e+07 | 815307.666667 | 7.225000 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 1.496667 |
| 6 | 5.333333 | 3.0 | 8.0 | 20.215000 | 10.0 | True | 0.106667 | 6.165000 | 6.406667 | 0.051667 | ... | 0.630000 | 12.271667 | 6.060652e+07 | 815456.333333 | 7.201667 | 112500.0 | 1.265738e+09 | 1.5 | 1.5 | 0.990000 |
6 rows × 21 columns
index: »What's on the x axis?«values: »What value do I want to plot?«columns: »What categories do I want [to be in the legend]?«df_demo["H"] = [(-1)**n for n in range(5)]
df_pivot = df_demo.pivot_table(
index="F",
values="G",
columns="H"
)
df_pivot
| H | -1 | 1 |
|---|---|---|
| F | ||
| -3.918282 | NaN | 7.389056 |
| -2.504068 | NaN | 1.700594 |
| -1.918282 | NaN | 0.515929 |
| -0.213769 | 0.972652 | NaN |
| 0.518282 | 2.952492 | NaN |
df_pivot.plot();
df data framex axis show the number of nodes; display the values of the simulation time "Sim. Time / s" for the tasks per node and threas per task configurationsdf.pivot_table(
index=["Nodes"],
columns=["Tasks/Node", "Threads/Task"],
values="Sim. Time / s",
).plot(kind="bar", figsize=(12, 4));
Tell me what you think about this tutorial! a.herten@fz-juelich.de
Next slide: Further reading
