Skip to content
Snippets Groups Projects
Select Git revision
  • c102eceb14f5204e81185912d2c971fa99a89ec2
  • 2023 default
  • pages protected
  • 2022-matse
  • 2022
  • 2021
  • 2019
  • master
8 results

Introduction-to-Pandas--solution.ipynb

Blame
  • Introduction-to-Pandas--slides.ipynb 1.15 MiB

    Data Analysis and Plotting in Python with Pandas

    Carolin Penke, Jülich Supercomputing Centre, Forschungszentrum Jülich, 23 October 2024 Based on material by Andreas Herten

    Version: Slides

    My Motivation

    • I like Python
    • I like plotting data
    • I like sharing
    • I think Pandas is awesome and you should use it too
    • but I'm no Python expert!

    Motto: »Pandas as early as possible!«

    Course Setup

    • 3½ hours, including break around 10:30
    • Alternating between lecture and hands-on
    • Please give status of hands-ons via 👍 as BigBlueButton status
    • TAs and me in the room can help with issues, either in public chat or in 1:1 chat

    About Pandas

    Pandas Cohabitation

    First Steps

    In [118]:
    import pandas
    In [119]:
    import pandas as pd
    In [120]:
    pd.__version__
    Out [120]:
    '2.1.4'
    In [121]:
    %pdoc pd

    DataFrames

    It's all about DataFrames

    • Data containers of Pandas:
      • Linear: Series
      • Multi Dimension: DataFrame
    • Series is only special (1D) case of DataFrame
    • → We use DataFrames as the more general case here

    DataFrames

    Construction

    • To show features of DataFrame, let's construct one and show by example!
    • Many construction possibilities
      • From lists, dictionaries, numpy objects
      • From CSV, HDF5, JSON, Excel, HTML, fixed-width files
      • From pickled Pandas data
      • From clipboard
      • From Feather, Parquet, SAS, SQL, Google BigQuery, STATA

    DataFrames

    Examples, finally

    In [122]:
    ages  = [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]
    In [123]:
    pd.DataFrame(ages)
    Out [123]:
    (Click to sort ascending)
    0
    (Click to sort ascending)
    041
    156
    256
    357
    439
    559
    643
    756
    838
    960
    In [124]:
    df_ages = pd.DataFrame(ages)
    df_ages.head(3)
    Out [124]:
    (Click to sort ascending)
    0
    (Click to sort ascending)
    041
    156
    256
    • Let's add names to ages; put everything into a dict()
    In [125]:
    data = {
        "Name": ["Liu", "Rowland", "Rivers", "Waters", "Rice", "Fields", "Kerr", "Romero", "Davis", "Hall"],
        "Age": ages
    }
    print(data)
    Out [125]:
    {'Name': ['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr', 'Romero', 'Davis', 'Hall'], 'Age': [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]}
    
    In [126]:
    df_sample = pd.DataFrame(data)
    df_sample.head(4)
    Out [126]:
    (Click to sort ascending)
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    0Liu41
    1Rowland56
    2Rivers56
    3Waters57
    • Automatically creates columns from dictionary
    • Two columns now; one for names, one for ages
    In [127]:
    df_sample.columns
    Out [127]:
    Index(['Name', 'Age'], dtype='object')
    • First column is index
    • DataFrame always have indexes; auto-generated or custom
    In [128]:
    df_sample.index
    Out [128]:
    RangeIndex(start=0, stop=10, step=1)
    • Make Name be index with .set_index()
    • inplace=True will modifiy the parent frame (I don't like it)
    In [129]:
    df_sample.set_index("Name", inplace=True)
    df_sample
    Out [129]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    Liu41
    Rowland56
    Rivers56
    Waters57
    Rice39
    Fields59
    Kerr43
    Romero56
    Davis38
    Hall60
    • Some more operations
    In [130]:
    df_sample.describe()
    Out [130]:
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    count10.000000
    mean50.500000
    std9.009255
    min38.000000
    25%41.500000
    50%56.000000
    75%56.750000
    max60.000000
    In [131]:
    df_sample.info()
    Out [131]:
    <class 'pandas.core.frame.DataFrame'>
    Index: 10 entries, Liu to Hall
    Data columns (total 1 columns):
     #   Column  Non-Null Count  Dtype
    ---  ------  --------------  -----
     0   Age     10 non-null     int64
    dtypes: int64(1)
    memory usage: 160.0+ bytes
    
    In [132]:
    df_sample.T
    Out [132]:
    Name
    (Click to sort ascending)
    Liu
    (Click to sort ascending)
    Rowland
    (Click to sort ascending)
    Rivers
    (Click to sort ascending)
    Waters
    (Click to sort ascending)
    Rice
    (Click to sort ascending)
    Fields
    (Click to sort ascending)
    Kerr
    (Click to sort ascending)
    Romero
    (Click to sort ascending)
    Davis
    (Click to sort ascending)
    Hall
    (Click to sort ascending)
    Age41565657395943563860
    In [133]:
    df_sample.T.columns
    Out [133]:
    Index(['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr',
           'Romero', 'Davis', 'Hall'],
          dtype='object', name='Name')
    • Also: Arithmetic operations
    In [134]:
    df_sample.multiply(2).head(3)
    Out [134]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    Liu82
    Rowland112
    Rivers112
    In [135]:
    df_sample.reset_index().multiply(2).head(3)
    Out [135]:
    (Click to sort ascending)
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    0LiuLiu82
    1RowlandRowland112
    2RiversRivers112
    In [136]:
    (df_sample / 2).head(3)
    Out [136]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    Liu20.5
    Rowland28.0
    Rivers28.0
    In [137]:
    (df_sample * df_sample).head(3)
    Out [137]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    Liu1681
    Rowland3136
    Rivers3136
    In [138]:
    def mysquare(number: float) -> float:
        return number*number
    
    df_sample.apply(mysquare).head()
    # or: df_sample.apply(lambda x: x*x).head()
    Out [138]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    Liu1681
    Rowland3136
    Rivers3136
    Waters3249
    Rice1521
    In [139]:
    import numpy as np
    In [140]:
    df_sample.apply(np.square).head()
    Out [140]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    Liu1681
    Rowland3136
    Rivers3136
    Waters3249
    Rice1521

    Logical operations allowed as well

    In [141]:
    df_sample > 40
    Out [141]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    LiuTrue
    RowlandTrue
    RiversTrue
    WatersTrue
    RiceFalse
    FieldsTrue
    KerrTrue
    RomeroTrue
    DavisFalse
    HallTrue
    In [142]:
    df_sample.apply(mysquare).head() == df_sample.apply(lambda x: x*x).head()
    Out [142]:
    Name
    (Click to sort ascending)
    Age
    (Click to sort ascending)
    LiuTrue
    RowlandTrue
    RiversTrue
    WatersTrue
    RiceTrue

    Task 1

    TASK

    • Create data frame with
      • 6 names of dinosaurs,
      • their favourite prime number,
      • and their favorite color.
    • Play around with the frame
    • Tell me when you're done with status icon in BigBlueButton: 👍
    In [143]:
    happy_dinos = {
        "Dinosaur Name": [],
        "Favourite Prime": [],
        "Favourite Color": []
    }
    #df_dinos = 
    In [144]:
    happy_dinos = {
        "Dinosaur Name": ["Aegyptosaurus", "Tyrannosaurus", "Panoplosaurus", "Isisaurus", "Triceratops", "Velociraptor"],
        "Favourite Prime": ["4", "8", "15", "16", "23", "42"],
        "Favourite Color": ["blue", "white", "blue", "purple", "violet", "gray"]
    }
    df_dinos = pd.DataFrame(happy_dinos).set_index("Dinosaur Name")
    df_dinos.T
    Out [144]:
    Dinosaur Name
    (Click to sort ascending)
    Aegyptosaurus
    (Click to sort ascending)
    Tyrannosaurus
    (Click to sort ascending)
    Panoplosaurus
    (Click to sort ascending)
    Isisaurus
    (Click to sort ascending)
    Triceratops
    (Click to sort ascending)
    Velociraptor
    (Click to sort ascending)
    Favourite Prime4815162342
    Favourite Colorbluewhitebluepurplevioletgray

    More DataFrame examples

    In [145]:
    df_demo = pd.DataFrame({
        "A": 1.2,
        "B": pd.Timestamp('20180226'),
        "C": [(-1)**i * np.sqrt(i) + np.e * (-1)**(i-1) for i in range(5)],
        "D": pd.Categorical(["This", "column", "has", "entries", "entries"]),
        "E": "Same"
    })
    df_demo
    Out [145]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame
    11.22018-02-261.718282columnSame
    21.22018-02-26-1.304068hasSame
    31.22018-02-260.986231entriesSame
    41.22018-02-26-0.718282entriesSame
    In [146]:
    df_demo.sort_values("C")
    Out [146]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame
    21.22018-02-26-1.304068hasSame
    41.22018-02-26-0.718282entriesSame
    31.22018-02-260.986231entriesSame
    11.22018-02-261.718282columnSame
    In [147]:
    df_demo.round(2).tail(2)
    Out [147]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    31.22018-02-260.99entriesSame
    41.22018-02-26-0.72entriesSame
    In [148]:
    df_demo.round(2)[["A", "C"]].sum()
    Out [148]:
    A    6.00
    C   -2.03
    dtype: float64
    In [149]:
    print(df_demo.round(2).to_latex())
    Out [149]:
    \begin{tabular}{lrlrll}
    \toprule
     & A & B & C & D & E \\
    \midrule
    0 & 1.200000 & 2018-02-26 00:00:00 & -2.720000 & This & Same \\
    1 & 1.200000 & 2018-02-26 00:00:00 & 1.720000 & column & Same \\
    2 & 1.200000 & 2018-02-26 00:00:00 & -1.300000 & has & Same \\
    3 & 1.200000 & 2018-02-26 00:00:00 & 0.990000 & entries & Same \\
    4 & 1.200000 & 2018-02-26 00:00:00 & -0.720000 & entries & Same \\
    \bottomrule
    \end{tabular}
    
    

    Reading External Data

    (Links to documentation)

    Example:

    {
        "Character": ["Sawyer", "…", "Walt"],
        "Actor": ["Josh Holloway", "…", "Malcolm David Kelley"],
        "Main Cast": [true,  "…", false]
    }
    
    In [150]:
    pd.read_json("data-lost.json").set_index("Character").sort_index()
    Out [150]:
    Character
    (Click to sort ascending)
    Actor
    (Click to sort ascending)
    Main Cast
    (Click to sort ascending)
    HurleyJorge GarciaTrue
    JackMatthew FoxTrue
    KateEvangeline LillyTrue
    LockeTerry O'QuinnTrue
    SawyerJosh HollowayTrue
    WaltMalcolm David KelleyFalse

    Task 2

    TASK

    • Read in data-nest.csv to DataFrame; call it df
      (Data was produced with JUBE)
    • Get to know it and play a bit with it
    • Tell me when you're done with status icon in BigBlueButton: 👍
    In [151]:
    !head data-nest.csv
    Out [151]:
    id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay
    5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5
    5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5
    5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5
    5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5
    5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5
    5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265738500,1.5,1.5
    5,2,2,8,79.60,10,true,0.20,21.63,21.91,0.19,0.47,2.98,53.12,46847168.00,821491,7.23,112500,1265738500,1.5,1.5
    5,2,4,8,37.20,10,true,0.13,10.08,11.60,0.10,1.63,1.24,23.29,47065232.00,818198,7.33,112500,1265738500,1.5,1.5
    5,3,2,4,96.51,10,true,0.15,26.54,27.41,0.36,1.22,3.33,64.28,52256880.00,813743,7.27,112500,1265738500,1.5,1.5
    
    In [152]:
    df = pd.read_csv("data-nest.csv")
    df.head()
    Out [152]:
    (Click to sort ascending)
    id
    (Click to sort ascending)
    Nodes
    (Click to sort ascending)
    Tasks/Node
    (Click to sort ascending)
    Threads/Task
    (Click to sort ascending)
    Runtime Program / s
    (Click to sort ascending)
    Scale
    (Click to sort ascending)
    Plastic
    (Click to sort ascending)
    Avg. Neuron Build Time / s
    (Click to sort ascending)
    Min. Edge Build Time / s
    (Click to sort ascending)
    Max. Edge Build Time / s
    (Click to sort ascending)
    ...
    (Click to sort ascending)
    Max. Init. Time / s
    (Click to sort ascending)
    Presim. Time / s
    (Click to sort ascending)
    Sim. Time / s
    (Click to sort ascending)
    Virt. Memory (Sum) / kB
    (Click to sort ascending)
    Local Spike Counter (Sum)
    (Click to sort ascending)
    Average Rate (Sum)
    (Click to sort ascending)
    Number of Neurons
    (Click to sort ascending)
    Number of Connections
    (Click to sort ascending)
    Min. Delay
    (Click to sort ascending)
    Max. Delay
    (Click to sort ascending)
    05124420.4210True0.2988.1288.18...1.2017.26311.5246560664.08254997.4811250012657385001.51.5
    15144200.8410True0.1546.0346.34...1.017.87142.9746903088.08028657.0311250012657385001.51.5
    25128202.1510True0.2847.9848.48...1.207.95142.8147699384.08028657.0311250012657385001.51.5
    3514889.5710True0.1520.4123.21...3.043.1960.3146813040.08214917.2311250012657385001.51.5
    45224164.1610True0.2040.0341.09...1.586.08114.8846937216.08028657.0311250012657385001.51.5

    Read CSV Options

    • See also full API documentation
    • Important parameters
      • sep: Set separator (for example : instead of ,)
      • header: Specify info about headers for columns; able to use multi-index for columns!
      • names: Alternative to header – provide your own column titles
      • usecols: Don't read whole set of columns, but only these; works with any list (range(0:20:2))…
      • skiprows: Don't read in these rows
      • na_values: What string(s) to recognize as N/A values (which will be ignored during operations on data frame)
      • parse_dates: Try to parse dates in CSV; different behaviours as to provided data structure; optionally used together with date_parser
      • compression: Treat input file as compressed file ("infer", "gzip", "zip", …)
      • decimal: Decimal point divider – for German data…
    pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)
    

    Slicing of Data Frames

    Quick Slices

    • Use square-bracket operators to slice data frame quickly: []
      • Use column name to select column
      • Use numerical value to select row
    • Example: Select only columnn C from df_demo
    In [153]:
    df_demo.head(3)
    Out [153]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame
    11.22018-02-261.718282columnSame
    21.22018-02-26-1.304068hasSame
    In [154]:
    df_demo['C']
    Out [154]:
    0   -2.718282
    1    1.718282
    2   -1.304068
    3    0.986231
    4   -0.718282
    Name: C, dtype: float64
    • Instead of column name in quotes and square brackets: Name of column directly
    In [155]:
    df_demo.C
    Out [155]:
    0   -2.718282
    1    1.718282
    2   -1.304068
    3    0.986231
    4   -0.718282
    Name: C, dtype: float64
    • I'm not a friend, because no spaces allowed
      (And Pandas as early as possible means labelling columns well and adding spaces)
    • Select more than one column by providing list to slice operator []
    • Example: Select list of columns A and C, ['A', 'C'] from df_demo
    In [156]:
    my_slice = ['A', 'C']
    df_demo[my_slice]
    Out [156]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    C
    (Click to sort ascending)
    01.2-2.718282
    11.21.718282
    21.2-1.304068
    31.20.986231
    41.2-0.718282
    • Use numerical values in brackets to slice along rows
    • Use ranges just like with Python lists
    In [157]:
    df_demo[1:3]
    Out [157]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    11.22018-02-261.718282columnSame
    21.22018-02-26-1.304068hasSame
    In [158]:
    df_demo[1:6:2]
    Out [158]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    11.22018-02-261.718282columnSame
    31.22018-02-260.986231entriesSame
    • Attention: location might change after re-sorting!
    In [159]:
    df_demo[1:3]
    Out [159]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    11.22018-02-261.718282columnSame
    21.22018-02-26-1.304068hasSame
    In [160]:
    df_demo.sort_values("C")[1:3]
    Out [160]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    21.22018-02-26-1.304068hasSame
    41.22018-02-26-0.718282entriesSame

    Slicing of Data Frames

    Better Slicing

    • .iloc[] and .loc[]: Faster slicing interfaces with more options
    In [161]:
    df_demo.iloc[1:3]
    Out [161]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    11.22018-02-261.718282columnSame
    21.22018-02-26-1.304068hasSame
    • Also slice along columns (second argument)
    In [162]:
    df_demo.iloc[1:3, [0, 2]]
    Out [162]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    C
    (Click to sort ascending)
    11.21.718282
    21.2-1.304068
    • .iloc[]: Slice by position (numerical/integer)
    • .loc[]: Slice by label (named)
    • See difference with a proper index (and not the auto-generated default index from before)
    In [163]:
    df_demo_indexed = df_demo.set_index("D")
    df_demo_indexed
    Out [163]:
    D
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    E
    (Click to sort ascending)
    This1.22018-02-26-2.718282Same
    column1.22018-02-261.718282Same
    has1.22018-02-26-1.304068Same
    entries1.22018-02-260.986231Same
    entries1.22018-02-26-0.718282Same
    In [164]:
    df_demo_indexed.loc["entries"]
    Out [164]:
    D
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    E
    (Click to sort ascending)
    entries1.22018-02-260.986231Same
    entries1.22018-02-26-0.718282Same
    In [165]:
    df_demo_indexed.loc[["has", "entries"], ["A", "C"]]
    Out [165]:
    D
    (Click to sort ascending)
    A
    (Click to sort ascending)
    C
    (Click to sort ascending)
    has1.2-1.304068
    entries1.20.986231
    entries1.2-0.718282

    Slicing of Data Frames

    Advanced Slicing: Logical Slicing

    • Slice can also be array of booleans
    In [166]:
    df_demo[df_demo["C"] > 0]
    Out [166]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    11.22018-02-261.718282columnSame
    31.22018-02-260.986231entriesSame
    In [167]:
    df_demo["C"] > 0
    Out [167]:
    0    False
    1     True
    2    False
    3     True
    4    False
    Name: C, dtype: bool
    In [168]:
    df_demo[(df_demo["C"] < 0) & (df_demo["D"] == "entries")]
    Out [168]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    41.22018-02-26-0.718282entriesSame

    Adding to Existing Data Frame

    • Add new columns with frame["new col"] = something or .insert()
    • Combine data frames
      • Concat: Combine several data frames along an axis
      • Merge: Combine data frames on basis of common columns; database-style
      • (Join)
      • See user guide on merging
    In [169]:
    df_demo.head(3)
    Out [169]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame
    11.22018-02-261.718282columnSame
    21.22018-02-26-1.304068hasSame
    In [170]:
    df_demo["F"] = df_demo["C"] - df_demo["A"]
    df_demo.head(3)
    Out [170]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    F
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame-3.918282
    11.22018-02-261.718282columnSame0.518282
    21.22018-02-26-1.304068hasSame-2.504068
    • .insert() allows to specify position of insertion
    • .shape gives tuple of size of data frame, vertical, horizontal
    In [171]:
    df_demo.insert(df_demo.shape[1] - 1, "E2", df_demo["C"] ** 2)
    df_demo.head(3)
    Out [171]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    E2
    (Click to sort ascending)
    F
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame7.389056-3.918282
    11.22018-02-261.718282columnSame2.9524920.518282
    21.22018-02-26-1.304068hasSame1.700594-2.504068
    In [172]:
    df_demo.tail(3)
    Out [172]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    E2
    (Click to sort ascending)
    F
    (Click to sort ascending)
    21.22018-02-26-1.304068hasSame1.700594-2.504068
    31.22018-02-260.986231entriesSame0.972652-0.213769
    41.22018-02-26-0.718282entriesSame0.515929-1.918282

    Combining Frames

    • First, create some simpler data frame to show .concat() and .merge()
    In [173]:
    df_1 = pd.DataFrame({"Key": ["First", "Second"], "Value": [1, 1]})
    df_1
    Out [173]:
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value
    (Click to sort ascending)
    0First1
    1Second1
    In [174]:
    df_2 = pd.DataFrame({"Key": ["First", "Second"], "Value": [2, 2]})
    df_2
    Out [174]:
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value
    (Click to sort ascending)
    0First2
    1Second2
    • Concatenate list of data frame vertically (axis=0)
    In [175]:
    pd.concat([df_1, df_2])
    Out [175]:
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value
    (Click to sort ascending)
    0First1
    1Second1
    0First2
    1Second2
    • Same, but re-index
    In [176]:
    pd.concat([df_1, df_2], ignore_index=True)
    Out [176]:
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value
    (Click to sort ascending)
    0First1
    1Second1
    2First2
    3Second2
    • Concat, but horizontally
    In [177]:
    pd.concat([df_1, df_2], axis=1)
    Out [177]:
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value
    (Click to sort ascending)
    0First1First2
    1Second1Second2
    • Merge on common column
    In [178]:
    pd.merge(df_1, df_2, on="Key")
    Out [178]:
    (Click to sort ascending)
    Key
    (Click to sort ascending)
    Value_x
    (Click to sort ascending)
    Value_y
    (Click to sort ascending)
    0First12
    1Second12

    .concat() can also be used to append rows to a DataFrame:

    In [179]:
    pd.concat(
        [
            df_demo, 
            pd.DataFrame({"A": 1.3, "B": pd.Timestamp("2018-02-27"), "C": -0.777, "D": "has it?", "E": "Same", "F": 23}, index=[0])
        ], ignore_index=True
    )
    Out [179]:
    (Click to sort ascending)
    A
    (Click to sort ascending)
    B
    (Click to sort ascending)
    C
    (Click to sort ascending)
    D
    (Click to sort ascending)
    E
    (Click to sort ascending)
    E2
    (Click to sort ascending)
    F
    (Click to sort ascending)
    01.22018-02-26-2.718282ThisSame7.389056-3.918282
    11.22018-02-261.718282columnSame2.9524920.518282
    21.22018-02-26-1.304068hasSame1.700594-2.504068
    31.22018-02-260.986231entriesSame0.972652-0.213769
    41.22018-02-26-0.718282entriesSame0.515929-1.918282
    51.32018-02-27-0.777000has it?SameNaN23.000000

    Task 3

    TASK

    • Add a column to the Nest data frame form Task 2 called Threads which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)
    • Tell me when you're done with status icon in BigBlueButton: 👍
    In [180]:
    df["Threads"] = df["Nodes"] * df["Tasks/Node"] * df["Threads/Task"]
    df.head()
    Out [180]:
    (Click to sort ascending)
    id
    (Click to sort ascending)
    Nodes
    (Click to sort ascending)
    Tasks/Node
    (Click to sort ascending)
    Threads/Task
    (Click to sort ascending)
    Runtime Program / s
    (Click to sort ascending)
    Scale
    (Click to sort ascending)
    Plastic
    (Click to sort ascending)
    Avg. Neuron Build Time / s
    (Click to sort ascending)
    Min. Edge Build Time / s
    (Click to sort ascending)
    Max. Edge Build Time / s
    (Click to sort ascending)
    ...
    (Click to sort ascending)
    Presim. Time / s
    (Click to sort ascending)
    Sim. Time / s
    (Click to sort ascending)
    Virt. Memory (Sum) / kB
    (Click to sort ascending)
    Local Spike Counter (Sum)
    (Click to sort ascending)
    Average Rate (Sum)
    (Click to sort ascending)
    Number of Neurons
    (Click to sort ascending)
    Number of Connections
    (Click to sort ascending)
    Min. Delay
    (Click to sort ascending)
    Max. Delay
    (Click to sort ascending)
    Threads
    (Click to sort ascending)
    05124420.4210True0.2988.1288.18...17.26311.5246560664.08254997.4811250012657385001.51.58
    15144200.8410True0.1546.0346.34...7.87142.9746903088.08028657.0311250012657385001.51.516
    25128202.1510True0.2847.9848.48...7.95142.8147699384.08028657.0311250012657385001.51.516
    3514889.5710True0.1520.4123.21...3.1960.3146813040.08214917.2311250012657385001.51.532
    45224164.1610True0.2040.0341.09...6.08114.8846937216.08028657.0311250012657385001.51.516
    In [181]:
    df.columns
    Out [181]:
    Index(['id', 'Nodes', 'Tasks/Node', 'Threads/Task', 'Runtime Program / s',
           'Scale', 'Plastic', 'Avg. Neuron Build Time / s',
           'Min. Edge Build Time / s', 'Max. Edge Build Time / s',
           'Min. Init. Time / s', 'Max. Init. Time / s', 'Presim. Time / s',
           'Sim. Time / s', 'Virt. Memory (Sum) / kB', 'Local Spike Counter (Sum)',
           'Average Rate (Sum)', 'Number of Neurons', 'Number of Connections',
           'Min. Delay', 'Max. Delay', 'Threads'],
          dtype='object')

    Aside: Plotting without Pandas

    Matplotlib 101

    • Matplotlib: de-facto standard for plotting in Python
    • Main interface: pyplot; provides MATLAB-like interface
    • Better: Use object-oriented API with Figure and Axis
    • Great integration into Jupyter Notebooks
    • Since v. 3: Only support for Python 3
    • https://matplotlib.org/
    In [182]:
    import matplotlib.pyplot as plt
    %matplotlib inline
    In [183]:
    x = np.linspace(0, 2*np.pi, 400)
    y = np.sin(x**2)
    In [184]:
    fig, ax = plt.subplots()
    ax.plot(x, y)
    ax.set_title('Use like this')
    ax.set_xlabel("Numbers");
    ax.set_ylabel("$\sqrt{x}$");
    Out [184]:
    <>:5: SyntaxWarning: invalid escape sequence '\s'
    <>:5: SyntaxWarning: invalid escape sequence '\s'
    /tmp/ipykernel_106956/3587136147.py:5: SyntaxWarning: invalid escape sequence '\s'
      ax.set_ylabel("$\sqrt{x}$");
    
    • Plot multiple lines into one canvas
    • Call ax.plot() multiple times
    In [185]:
    y2 = y/np.exp(y*1.5)
    In [186]:
    fig, ax = plt.subplots()
    ax.plot(x, y, label="y")
    ax.plot(x, y2, label="y2")
    ax.legend()
    ax.set_title("This plot makes no sense");
    out [186]:
    • Matplotlib can also plot DataFrame data
    • Because DataFrame data is only array-like data with stuff on top
    In [187]:
    fig, ax = plt.subplots()
    ax.plot(df_demo.index, df_demo["C"], label="C")
    ax.legend()
    ax.set_title("Nope, no sense at all");
    out [187]:

    Task 4

    TASK

    • Sort the Nest data frame by threads
    • Plot "Presim. Time / s" and "Sim. Time / s" of our data frame df as a function of threads
    • Use a dashed, red line for "Presim. Time / s", a blue line for "Sim. Time / s" (see API description)
    • Don't forget to label your axes and to add a legend (1st rule of plotting)
    • Tell me when you're done with status icon in BigBlueButton: 👍
    In [188]:
    df.sort_values(["Threads", "Nodes", "Tasks/Node", "Threads/Task"], inplace=True)  # multi-level sort
    In [189]:
    fig, ax = plt.subplots(figsize=(10, 3))
    ax.plot(df["Threads"], df["Presim. Time / s"], linestyle="dashed", color="red", label="Presim. Time / s")
    ax.plot(df["Threads"], df["Sim. Time / s"], "-b", label="Sim. Time / s")
    ax.set_xlabel("Threads")
    ax.set_ylabel("Time / s")
    ax.legend(loc='best');
    out [189]:

    Plotting with Pandas

    • Each data frame has a .plot() function (see API)
    • Plots with Matplotlib
    • Important API options:
      • kind: 'line' (default), 'bar[h]', 'hist', 'box', 'kde', 'scatter', 'hexbin'
      • subplots: Make a sub-plot for each column (good together with sharex, sharey)
      • figsize
      • grid: Add a grid to plot (use Matplotlib options)
      • style: Line style per column (accepts list or dict)
      • logx, logy, loglog: Logarithmic plots
      • xticks, yticks: Use values for ticks
      • xlim, ylim: Limits of axes
      • yerr, xerr: Add uncertainty to data points
      • stacked: Stack a bar plot
      • secondary_y: Use a secondary y axis for this plot
      • Labeling
        • title: Add title to plot (Use a list of strings if subplots=True)
        • legend: Add a legend
        • table: If true, add table of data under plot
      • **kwds: Non-parsed keyword passed to Matplotlib's plotting methods
    • Either slice and plot…
    In [190]:
    df_demo["C"].plot(figsize=(10, 2));
    out [190]:
    • … or plot and select
    In [191]:
    df_demo.plot(y="C", figsize=(10, 2));
    out [191]:
    • I prefer slicing first:
      → Allows for further operations on the sliced data frame
    In [192]:
    df_demo["C"].plot(kind="bar");
    out [192]:
    • There are pseudo-sub-functions for each of the plot kinds
    • I prefer to just call .plot(kind="smthng")
    In [193]:
    df_demo["C"].plot.bar();
    out [193]:
    In [194]:
    df_demo["C"].plot(kind="bar", legend=True, figsize=(12, 4), ylim=(-1, 3), title="This is a C plot");
    out [194]:

    Task 5

    TASK

    Use the Nest data frame df to:

    1. Make threads index of the data frame (.set_index())
    2. Plot "Presim. Time / s" and "Sim. Time / s" individually
    3. Plot them onto one common canvas!
    4. Make them have the same line colors and styles as before
    5. Add a legend, add missing axes labels
    6. Tell me when you're done with status icon in BigBlueButton: 👍
    In [195]:
    df.set_index("Threads", inplace=True)
    In [196]:
    df["Presim. Time / s"].plot(figsize=(10, 3), style="--", color="red");
    out [196]:
    In [197]:
    df["Sim. Time / s"].plot(figsize=(10, 3), style="-b");
    out [197]:
    In [198]:
    df["Presim. Time / s"].plot(style="--r", figsize=(10,3));
    df["Sim. Time / s"].plot(style="-b", figsize=(10,3));
    out [198]:
    In [199]:
    ax = df[["Presim. Time / s", "Sim. Time / s"]].plot(style=["--b", "-r"], figsize=(10,3));
    ax.set_ylabel("Time / s");
    out [199]:

    More Plotting with Pandas

    Recap: Our first proper Pandas plot

    In [200]:
    df[["Presim. Time / s", "Sim. Time / s"]].plot(figsize=(10,3));
    out [200]:
    • That's why I think Pandas is great!
    • It has great defaults to quickly plot data; basically publication-grade already
    • Plotting functionality is very versatile
    • Before plotting, data can be massaged within data frames, if needed

    More Plotting with Pandas

    Some versatility

    In [201]:
    df_demo[["A", "C", "F"]].plot(kind="bar", stacked=True, figsize=(10,3));
    out [201]:
    In [202]:
    df_demo[df_demo["F"] < 0][["A", "C", "F"]].plot(kind="bar", stacked=True, figsize=(10,3));
    out [202]:
    In [203]:
    df_demo[df_demo["F"] < 0][["A", "C", "F"]]\
        .plot(kind="barh", subplots=True, sharex=True, title="Subplots Demo", figsize=(10, 4));
    out [203]:
    In [204]:
    df_demo.loc[df_demo["F"] < 0, ["A", "F"]]\
        .plot(
            style=["-*r", "--ob"], 
            secondary_y="A", 
            figsize=(12, 6),
            table=True
        );
    out [204]:
    In [205]:
    df_demo.loc[df_demo["F"] < 0, ["A", "F"]]\
        .plot(
            style=["-*r", "--ob"], 
            secondary_y="A", 
            figsize=(12, 6),
            yerr={
                "A": abs(df_demo[df_demo["F"] < 0]["C"]), 
                "F": 0.2
            }, 
            capsize=4,
            title="Bug: style is ignored with yerr",
            marker="P"
        );  
    out [205]:

    Combine Pandas with Matplotlib

    • Pandas shortcuts very handy
    • But sometimes, one needs to access underlying Matplotlib functionality
    • No problemo!
    • Option 1: Pandas always returns axis
      • Use this to manipulate the canvas
      • Get underlying figure with ax.get_figure() (for fig.savefig())
    • Option 2: Create figure and axes with Matplotlib, use when drawing
      • .plot(): Use ax option

    Option 1: Pandas Returns Axis

    In [206]:
    ax = df_demo["C"].plot(figsize=(10, 4))
    ax.set_title("Hello There!");
    fig = ax.get_figure()
    fig.suptitle("This title is super (literally)!");
    out [206]:

    Option 2: Draw on Matplotlib Axes

    In [207]:
    fig, ax = plt.subplots(figsize=(10, 4))
    df_demo["C"].plot(ax=ax)
    ax.set_title("Hello There!");
    fig.suptitle("This title is super (still, literally)!");
    out [207]:
    • We can also get fancy!
    In [208]:
    fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 4))
    for ax, column, color in zip([ax1, ax2], ["C", "F"], ["blue", "#b2e123"]):
        df_demo[column].plot(ax=ax, legend=True, color=color)
    out [208]:

    Aside: Seaborn

    • Python package on top of Matplotlib
    • Powerful API shortcuts for plotting of statistical data
    • Manipulate color palettes
    • Works well together with Pandas
    • Also: New, well-looking defaults for Matplotlib (IMHO)
    • https://seaborn.pydata.org/
    In [209]:
    import seaborn as sns
    sns.set_theme()  # set defaults
    In [210]:
    df_demo[["A", "C"]].plot(figsize=(10,3));
    out [210]:

    Seaborn Color Palette Example

    In [211]:
    sns.palplot(sns.color_palette())
    out [211]:
    In [212]:
    sns.palplot(sns.color_palette("hls", 10))
    out [212]:
    In [213]:
    sns.palplot(sns.color_palette("hsv", 20))
    out [213]:
    In [214]:
    sns.palplot(sns.color_palette("Paired", 10))
    out [214]:
    In [215]:
    sns.palplot(sns.color_palette("cubehelix", 8))
    out [215]:
    In [216]:
    sns.palplot(sns.color_palette("colorblind", 10))
    out [216]:

    Seaborn Plot Examples

    • Most of the time, I use a regression plot from Seaborn
    In [217]:
    with sns.color_palette("hls", 2):
        sns.regplot(x="C", y="F", data=df_demo);
        sns.regplot(x="C", y="E2", data=df_demo);
    out [217]:
    • A joint plot combines two plots relating to distribution of values into one
    • Very handy for showing a fuller picture of two-dimensionally scattered variables
    In [218]:
    x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T
    In [219]:
    sns.jointplot(x=x, y=y, kind="reg");
    out [219]:

    Task 6

    TASK

    • To your df Nest data frame, add a column with the unaccounted time (Unaccounted Time / s), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time.
      (I know this is technically not super correct, but it will do for our example.)
    • Plot a stacked bar plot of all these columns (except for program runtime) over the threads
    • Tell me when you're done with status icon in BigBlueButton: 👍
    In [220]:
    cols = [
        'Avg. Neuron Build Time / s', 
        'Min. Edge Build Time / s', 
        'Min. Init. Time / s', 
        'Presim. Time / s', 
        'Sim. Time / s'
    ]
    df["Unaccounted Time / s"] = df['Runtime Program / s']
    for entry in cols:
        df["Unaccounted Time / s"] = df["Unaccounted Time / s"] - df[entry]
    In [221]:
    df[["Runtime Program / s", "Unaccounted Time / s", *cols]].head(2)
    Out [221]:
    Threads
    (Click to sort ascending)
    Runtime Program / s
    (Click to sort ascending)
    Unaccounted Time / s
    (Click to sort ascending)
    Avg. Neuron Build Time / s
    (Click to sort ascending)
    Min. Edge Build Time / s
    (Click to sort ascending)
    Min. Init. Time / s
    (Click to sort ascending)
    Presim. Time / s
    (Click to sort ascending)
    Sim. Time / s
    (Click to sort ascending)
    8420.422.090.2988.121.1417.26311.52
    16202.152.430.2847.980.707.95142.81
    In [222]:
    df[["Unaccounted Time / s", *cols]].plot(kind="bar", stacked=True, figsize=(12, 4));
    out [222]:
    • Make it relative to the total program run time
    • Slight complication: Our threads as indexes are not unique; we need to find new unique indexes
    • Could be anythig, but we use a multi index!
    In [223]:
    df_multind = df.set_index(["Nodes", "Tasks/Node", "Threads/Task"])
    df_multind.head()
    Out [223]:
    Nodes
    (Click to sort ascending)
    Tasks/Node
    (Click to sort ascending)
    Threads/Task
    (Click to sort ascending)
    id
    (Click to sort ascending)
    Runtime Program / s
    (Click to sort ascending)
    Scale
    (Click to sort ascending)
    Plastic
    (Click to sort ascending)
    Avg. Neuron Build Time / s
    (Click to sort ascending)
    Min. Edge Build Time / s
    (Click to sort ascending)
    Max. Edge Build Time / s
    (Click to sort ascending)
    Min. Init. Time / s
    (Click to sort ascending)
    Max. Init. Time / s
    (Click to sort ascending)
    Presim. Time / s
    (Click to sort ascending)
    Sim. Time / s
    (Click to sort ascending)
    Virt. Memory (Sum) / kB
    (Click to sort ascending)
    Local Spike Counter (Sum)
    (Click to sort ascending)
    Average Rate (Sum)
    (Click to sort ascending)
    Number of Neurons
    (Click to sort ascending)
    Number of Connections
    (Click to sort ascending)
    Min. Delay
    (Click to sort ascending)
    Max. Delay
    (Click to sort ascending)
    Unaccounted Time / s
    (Click to sort ascending)
    1245420.4210True0.2988.1288.181.141.2017.26311.5246560664.08254997.4811250012657385001.51.52.09
    1285202.1510True0.2847.9848.480.701.207.95142.8147699384.08028657.0311250012657385001.51.52.43
    445200.8410True0.1546.0346.340.701.017.87142.9746903088.08028657.0311250012657385001.51.53.12
    2245164.1610True0.2040.0341.090.521.586.08114.8846937216.08028657.0311250012657385001.51.52.45
    12126141.7010True0.3032.9333.260.620.955.41100.1650148824.08137437.2711250012657385001.51.52.28
    In [224]:
    df_multind[["Unaccounted Time / s", *cols]]\
        .divide(df_multind["Runtime Program / s"], axis="index")\
        .plot(kind="bar", stacked=True, figsize=(14, 6), title="Relative Time Distribution");
    out [224]:

    Next Level: Hierarchical Data

    • MultiIndex only a first level
    • More powerful:
      • Grouping: .groupby() ("Split-apply-combine", API, User Guide)
      • Pivoting: .pivot_table() (API, User Guide); also .pivot() (specialized version of .pivot_table(), API)

    Grouping

    • Group a frame by common values of column(s)
    • Use operations on this group
    • Grouped frame is not directly a new frame, but only through an applied operation
    In [225]:
    df.groupby("Nodes").groups
    Out [225]:
    {1: [8, 16, 16, 24, 32, 48], 2: [16, 32, 32, 48, 64, 96], 3: [24, 48, 48, 72, 96, 144], 4: [32, 64, 64, 96, 128, 192], 5: [40, 80, 80, 120, 160, 240], 6: [48, 96, 96, 144, 192, 288]}
    In [226]:
    df.groupby("Nodes").get_group(4).head(3)
    Out [226]:
    Threads
    (Click to sort ascending)
    id
    (Click to sort ascending)
    Nodes
    (Click to sort ascending)
    Tasks/Node
    (Click to sort ascending)
    Threads/Task
    (Click to sort ascending)
    Runtime Program / s
    (Click to sort ascending)
    Scale
    (Click to sort ascending)
    Plastic
    (Click to sort ascending)
    Avg. Neuron Build Time / s
    (Click to sort ascending)
    Min. Edge Build Time / s
    (Click to sort ascending)
    Max. Edge Build Time / s
    (Click to sort ascending)
    ...
    (Click to sort ascending)
    Presim. Time / s
    (Click to sort ascending)
    Sim. Time / s
    (Click to sort ascending)
    Virt. Memory (Sum) / kB
    (Click to sort ascending)
    Local Spike Counter (Sum)
    (Click to sort ascending)
    Average Rate (Sum)
    (Click to sort ascending)
    Number of Neurons
    (Click to sort ascending)
    Number of Connections
    (Click to sort ascending)
    Min. Delay
    (Click to sort ascending)
    Max. Delay
    (Click to sort ascending)
    Unaccounted Time / s
    (Click to sort ascending)
    32542466.5810True0.1318.8619.65...2.3543.3847361344.08214917.2311250012657385001.51.51.70
    64542834.0910True0.1410.6010.83...1.2520.9647074752.08181987.3311250012657385001.51.51.03
    64544432.4910True0.099.9810.31...1.1220.1248081056.08181987.3311250012657385001.51.51.09
    In [227]:
    df.groupby("Nodes").mean()
    Out [227]:
    Nodes
    (Click to sort ascending)
    id
    (Click to sort ascending)
    Tasks/Node
    (Click to sort ascending)
    Threads/Task
    (Click to sort ascending)
    Runtime Program / s
    (Click to sort ascending)
    Scale
    (Click to sort ascending)
    Plastic
    (Click to sort ascending)
    Avg. Neuron Build Time / s
    (Click to sort ascending)
    Min. Edge Build Time / s
    (Click to sort ascending)
    Max. Edge Build Time / s
    (Click to sort ascending)
    Min. Init. Time / s
    (Click to sort ascending)
    ...
    (Click to sort ascending)
    Presim. Time / s
    (Click to sort ascending)
    Sim. Time / s
    (Click to sort ascending)
    Virt. Memory (Sum) / kB
    (Click to sort ascending)
    Local Spike Counter (Sum)
    (Click to sort ascending)
    Average Rate (Sum)
    (Click to sort ascending)
    Number of Neurons
    (Click to sort ascending)
    Number of Connections
    (Click to sort ascending)
    Min. Delay
    (Click to sort ascending)
    Max. Delay
    (Click to sort ascending)
    Unaccounted Time / s
    (Click to sort ascending)
    15.3333333.08.0185.02333310.01.00.22000042.04000042.8383330.583333...7.226667132.0616674.806585e+07816298.0000007.215000112500.01.265738e+091.51.52.891667
    25.3333333.08.073.60166710.01.00.16833319.62833320.3133330.191667...2.72500048.9016674.975288e+07818151.0000007.210000112500.01.265738e+091.51.51.986667
    35.3333333.08.043.99000010.01.00.13833312.81000013.3050000.135000...1.42666727.7350005.511165e+07820465.6666677.253333112500.01.265738e+091.51.51.745000
    45.3333333.08.031.22500010.01.00.1166679.3250009.7400000.088333...1.06666719.3533335.325783e+07819558.1666677.288333112500.01.265738e+091.51.51.275000
    55.3333333.08.024.89666710.01.00.1400007.4683337.7900000.070000...0.77166714.9500006.075634e+07815307.6666677.225000112500.01.265738e+091.51.51.496667
    65.3333333.08.020.21500010.01.00.1066676.1650006.4066670.051667...0.63000012.2716676.060652e+07815456.3333337.201667112500.01.265738e+091.51.50.990000

    Pivoting

    • Combine categorically-similar columns
    • Creates hierarchical index
    • Respected during plotting with Pandas!
    • A pivot table has three layers; if confused, think about the related questions
      • index: »What's on the x axis?«
      • values: »What value do I want to plot [on the y axis]?«
      • columns: »What categories do I want [to be in the legend]?«
    • All can be populated from base data frame
    • Might be aggregated, if needed
    In [228]:
    df_demo["H"] = [(-1)**n for n in range(5)]
    In [229]:
    df_pivot = df_demo.pivot_table(
        index="F",
        values="E2",
        columns="H"
    )
    df_pivot
    Out [229]:
    F
    (Click to sort ascending)
    -1
    (Click to sort ascending)
    1
    (Click to sort ascending)
    -3.918282NaN7.389056
    -2.504068NaN1.700594
    -1.918282NaN0.515929
    -0.2137690.972652NaN
    0.5182822.952492NaN
    In [230]:
    df_pivot.plot(figsize=(10,3));
    out [230]:

    Task 7

    TASK

    • Create a pivot table based on the Nest df data frame
    • Let the x axis show the number of nodes; display the values of the simulation time "Sim. Time / s" for the tasks per node and threads per task configurations
    • Please plot a bar plot
    • Tell me when you're done with status icon in BigBlueButton: 👍
    In [231]:
    df.pivot_table(
        index="Nodes",
        columns=["Tasks/Node", "Threads/Task"],
        values="Sim. Time / s",
    ).plot(kind="bar", figsize=(12, 4));
    out [231]:

    Task 7B (like Bonus)

    TASK

    • Same pivot table as before (that is, x with nodes, and columns for Tasks/Node and Threads/Task)
    • But now, use Sim. Time / s and Presim. Time / s as values to show
    • Show them as a stack of those two values inside the pivot table
    • Use Panda's functionality as much as possible!

    Pandas 2

    • Pandas 2.0 was released in April 2023

    • Only limited deprecations (i.e. an upgrade is probably safe)

    • Key new feature: Apache Arrow support (via PyArrow)

    • Fine-grained installation options python3 -m pip install 'pandas[performance, excel]'

    • However: Currently [10/2024] a dependency mismatch in default version

    In [232]:
    data_db = 'db-bahnhoefe.csv'  # source: https://web.archive.org/web/20231208211825/https://download-data.deutschebahn.com/static/datasets/stationsdaten/DBSuS-Uebersicht_Bahnhoefe-Stand2020-03.csv
    In [233]:
    %timeit pd.read_csv(data_db, sep=';')
    Out [233]:
    10 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [234]:
    import pyarrow
    print(pyarrow.__version__)
    Out [234]:
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    Cell In[234], line 1
    ----> 1 import pyarrow
          2 print(pyarrow.__version__)
    
    ModuleNotFoundError: No module named 'pyarrow'
    
    pd.read_csv(data_db, sep=';', engine='pyarrow', dtype_backend='pyarrow')

    Polars

    import polars as ps
    %timeit ps.read_csv(data_db, separator=';')

    Large Data & Mangling

    • Pandas can read data directly in tar form
    • Pandas can read data directly from online resource
    • Let's combine that to an advanced task!
    • It works also with the PyArrow backend (remember to download the online resource when testing; there is no cache!)

    Task 8 (Super Bonus)

    TASK

    • Create bar chart of top 10 actors (on x) and average ratings of their top movies (y) based on IMDb data (only if they play in at least two movies)
    pd.read_table('https://datasets.imdbws.com/dataset.tsv.gz', sep="\\t", low_memory=False, na_values=["\\\\N","nan"])
    
    • Needed:
      • name.basics.tsv.gz (for names of actors and movies they are known for)
      • title.ratings.tsv.gz (for ratings of titles)
    • Strategy suggestions:
      • Use df.apply() with custom function
      • Custom function: Compute average rating and determine if this entry is eligible for plotting (this can be done at once, but does not need to be)
      • Average rating: Look up title IDs as listed in knownForTitles in titles dataframe
    df_names = pd.read_table('imdb-data/name.basics.tsv.gz', sep="\\t", low_memory=False, na_values=["\\\\N","nan"])
    df_ratings = pd.read_table('https://datasets.imdbws.com/title.ratings.tsv.gz', sep="\\t", low_memory=False, na_values=["\\\\N","nan"])
                               
    df_names_i = df_names.set_index('nconst')
    df_ratings_i = df_ratings.set_index('tconst')
                               
    df_names_i = pd.concat(
        [
            df_names_i, 
            df_names_i.apply(lambda line: valid_and_avg_rating(line), axis=1, result_type='expand')
        ]
        , axis=1
    )
    df_names_i[df_names_i['toPlot'] == True].sort_values('avgRating', ascending=False).iloc[0:10].reset_index().set_index('primaryName')['avgRating'].plot(kind='bar')
    
    def valid_and_avg_rating(row):
        rating = 0
        ntitles = 0
        _titles = row['knownForTitles']
        _professions = row['primaryProfession']
        if not isinstance(_titles, str):
            _titles = str(_titles)
        if not isinstance(_professions, str):
            _professions = str(_professions)
        titles = _titles.split(',')
        professions = _professions.split(',')
        for title in titles:
            if title in df_ratings_i.index:
                rating += df_ratings_i.loc[title]['averageRating']
                ntitles += 1
        if ntitles > 0:
            plot = False
            if ntitles > 2:
                if 'actor' in professions:
                    plot = True
            return {'toPlot': plot, 'avgRating': rating / ntitles}
        else:
            return {'toPlot': False, 'avgRating': pd.NA}
    

    Task 8B (Bonuseption)

    TASK

    All of the following are ideas for unique sub-tasks, which can be done individually

    • In addition to Task 8, restrict the top titles to those with more than 10000 votes
    • For 30 top-rated actors, plot rating vs. age
    • For 30 top-rated actors, plot rating vs. average runtime of the known-for-titles (using title.basics.tsv.gz)

    Random Features Not Shown

    This are all links:

    Conclusion

    • Pandas works with and on data frames, which are central
    • Slice frames to your likings
    • Plot frames
      • Together with Matplotlib, Seaborn, others
    • Pivot tables are next level greatness
    • Remember: Pandas as early as possible!
    • Thanks for being here! 😍

    Next slide: Further reading