JSC HPC Support Corner

# JSC HPC Support Corner :::info **Topic** (short presentation at the beginning): Project quota overview (https://www.fz-juelich.de/de/jsc/services/unterstuetzung-der-nutzer-innen/jsc-hpc-support-corner/jsc_hsc_2025_03_24-compite_time_usage_and_allocation_periods___zjupa) **When**: 24.03.2025, 10-11am **Where**: Zoom, https://go.fzj.de/jsc_hpc_support_corner ::: ## What is "JSC HPC Support Corner"? JSC HPC Support Corner is a monthly interactive support session designed to provide live assistance on HPC-related topics. The aim is to provide a structured yet flexible space for users to get help with troubleshooting, software installation, debugging and more, while enabling direct engagement with HPC experts. ## Session structure * Each session begins with **a brief talk on a broad HPC-related topic**, offering valuable insights and guidance. * The presentation will be followed by an interactive **Q&A session** where participants can ask questions related to the topic of the presentation. * Then the Q&A session is open to any other HPC-related topic participants would like to discuss. Participants can also share their screens for live discussions and debugging, enabling real-time problem-solving. ## Presentation topic **Project quota overview**: Most of the JSC compute projects will conclude by the end of April, leading many JSC users to utilize their remaining compute time before then. In our short presentation, we will cover how to estimate your current quota and explain what will happen after the project expiration. ## Feel free to ask your questions in advance :::info To edit this document you need: - click on **pencil** icon in the left top corner - login if necessary with your **[JuDoor](https://judoor.fz-juelich.de/login) credentials** ::: :::danger We consider questions here as **anonymous** by default. If you would like us to contact you privately, please provide your contact details (e.g., name and email), and we will reach out to you. ::: 1. Debugging: Illegal memory access in GPUs for problems beyond a certain size. :::success You can use Nvidia compute-sanitizer memcheck (https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html#memcheck-tool) to analyze your GPU code for memory errors. If you need further debugging capabilities you can try TotalView or DDT on the JSC systems or CUDA-GDB on your workstation. Contact the ATML Accelerated Devices (https://www.fz-juelich.de/en/ias/jsc/about-us/structure/atml/atml-x-dev) for further support. ::: 2. Quota: Inode shortage and how to deal with it. (+ it seems like storing files in SCRATCH and symbolic-linking them to PROJECT counts them into the PROJECT quota :eyes: how does that work?) :::success **What is an inode?** An inode (index node) is a data structure used in Unix file systems to store metadata about files and directories. It serves as a unique identifier for each file/directory on the file system. **Key points about inodes** * Each file has an associated inode that stores its metadata, such as file type, permissions, ownership, timestamps, and pointers to the file’s data blocks on disk. Each directory also has at least one associated inode that contains similar information. * inodes are identified by a unique number within a file system. This inode number is used by the operating system to access the file’s metadata and data blocks. * Each file system has a fixed number of inodes determined during its creation, limiting the maximum number of files it can hold. Running out of inodes can cause issues even if disk space is available. This is one of several reasons large numbers of small files or complex directory trees are not ideal. * More inodes in a system can mean a longer time to look up files. * A soft link is a way of pointing to another file/directory. The soft link (file) has its own inode and not that of the original file. * In general, large numbers of small files leads to lower performance or performance issues. **Potential strategies to deal with inode shortage** * Remove/archive (e.g. tar/zip) unused files. * Use containers for complex software installations. ::: 3. Why is training with a single H100 GPU so slow for fine-tuning large models? For example, using the same program, training on an A100 takes about one minute, but on the H100 it takes 10 minutes. Could this be due to a connectivity issue? :::success The issue was investigated in the breakout room, and there was a follow-up discussion afterwards. ::: ## Next session - [23.04.2025](https://gitlab.jsc.fz-juelich.de/hedgedoc/y1Ed1uvmRJyd8FIbUAIEFA?)