Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
1 result

large-data-lecture

  • Clone with SSH
  • Clone with HTTPS
  • Name Last commit Last update
    de
    LICENSE
    README.md

    Large data lecture

    Synopsis

    Big data is already a thing from the past again [1], but that doesn't mean that we should forget what we learned about managing big data - or, let's call it large data for now. Data transmission, storage and management are increasingly seen as bottlenecks or even roadblocks for simulation science and AI. However, there are also increasingly large experimental facilities and satellites, which produce more data than we can handle. And you may also think of self-driving vehicles or autonomous robot systems here.

    This lecture is an introductory course to large data handling. This repository has been set-up specifically for a 15-minute introduction into "Design patterns for large-scale data handling" (held in German at the University of Cologne on Nov 16th, 2022). The plan is to develop this into a complete course or a course segment in the coming year.

    Training goals

    1. understand the key issues of modern large data handling
    2. understand and adopt the FAIR principles of open data and the challenges of applying these principles to large data
    3. get to know important concepts and technologies for working with large data
    4. learn how to design efficient large data analysis workflows
    5. get to know some advanced tools for large data management

    In the current version, this course focuses on objective 3.

    Pre-requisites

    Although the current sample lecture is in German, I would prefer to develop this course in the English language. Hence, sufficient ability to follow technical and scientific content in English will be needed.

    While there are relatively few specific code examples, it helps to have at least some basic knowledge of Python. A reasonable fluency in Python is mandatory if you consider participating in the exercises as well.

    Finally, this course is realized as a set of Jupyter notebooks. To make full use of it, you should have an account on a system that allows you to open, edit and run these notebooks.

    Author

    Martin G. Schultz (m.schultz@fz-juelich.de) Jülich Supercomputing Centre Forschungszentrum Jülich 52425 Jülich Germany

    References

    [1] Joe Reis, Matt Housley (2022): Fundamentals of Data Engineering, O'Reilly, ISBN: 9781098108304