SIONlib issueshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues2021-08-25T15:20:29+02:00https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/172`_sion_free_filedesc_all_localranks` is dead code but should probably be call...2021-08-25T15:20:29+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.de`_sion_free_filedesc_all_localranks` is dead code but should probably be called somewhereWhile working on #157 (dead code removal), I did not remove `_sion_free_filedesc_all_localranks`, because the corresponding allocation function is indeed called and the `all_localranks` field is used. Find the right spot to insert calls ...While working on #157 (dead code removal), I did not remove `_sion_free_filedesc_all_localranks`, because the corresponding allocation function is indeed called and the `all_localranks` field is used. Find the right spot to insert calls to `_sion_free_filedesc_all_localranks`.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/171Public functions without tests2021-08-25T15:20:31+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.dePublic functions without testsThese symbols were not removed while working on #157 (dead code removal), because they appear in public headers and are not prefix by `_`.
```
sion_free_io_info
sion_set_fp_closed
sion_get_dfile
sion_isdebug
sion_debug_off
sion_generic...These symbols were not removed while working on #157 (dead code removal), because they appear in public headers and are not prefix by `_`.
```
sion_free_io_info
sion_set_fp_closed
sion_get_dfile
sion_isdebug
sion_debug_off
sion_generic_register_get_multi_filename_cb
```
They should probably be called from the test suite.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/170_sion_errorprint(_, _SION_ERROR_RETURN, ...) without returning2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.de_sion_errorprint(_, _SION_ERROR_RETURN, ...) without returningAs a reaction to #169, I dug out my knowledge of regexen and searched the code for additional instances of `_sion_errorprint(_, _SION_ERROR_RETURN, ...)` that do not obviously return. Here is a list:
- [src/parlib/sion_generic_buddy.c@2...As a reaction to #169, I dug out my knowledge of regexen and searched the code for additional instances of `_sion_errorprint(_, _SION_ERROR_RETURN, ...)` that do not obviously return. Here is a list:
- [src/parlib/sion_generic_buddy.c@2173#L795](src/parlib/sion_generic_buddy.c@2173#L795)
- [src/parlib/sion_generic_buddy.c@2173#L897](src/parlib/sion_generic_buddy.c@2173#L897)
- [src/parlib/sion_generic_collective.c@2173#L188](src/parlib/sion_generic_collective.c@2173#L188)
- [src/parlib/sion_ompi_coll_cb_gen.c@2173#L158](src/parlib/sion_ompi_coll_cb_gen.c@2173#L158)
- [src/parlib/sion_ompi_coll_cb_gen.c@2173#L218](src/parlib/sion_ompi_coll_cb_gen.c@2173#L218)
- [src/parlib/sion_ompi_coll_cb_gen.c@2173#L364](src/parlib/sion_ompi_coll_cb_gen.c@2173#L364)
- [src/parlib/sion_ompi_coll_cb_gen.c@2173#L419](src/parlib/sion_ompi_coll_cb_gen.c@2173#L419)
https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/169Collective writes larger than the chunk size deadlock2021-08-25T15:20:30+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.deCollective writes larger than the chunk size deadlockReported by Michael Bareford of EPCC (m.bareford@epcc.ed.ac.uk):
> So each process makes two calls to a `write_vector` method that then calls `sion_coll_fwrite(rhs.data(), sizeof(T), rhs.size(), _sid)`, where `T` is either `unsigned int...Reported by Michael Bareford of EPCC (m.bareford@epcc.ed.ac.uk):
> So each process makes two calls to a `write_vector` method that then calls `sion_coll_fwrite(rhs.data(), sizeof(T), rhs.size(), _sid)`, where `T` is either `unsigned int` or `double`. At present, it seems that writing the string data and the element ids causes no problem, only when the element data (the vectors of `double`s) are also written do some processes fail to return from `sion_coll_fwrite`.
And later:
> Just to let you know, I have fixed my problem. It was caused by specifying an inadequate chunk size in the call to sion_paropen_mpi.
This probably can and should be detected and reported as an error.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/186_sion_filedesc is probably a god object2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.de_sion_filedesc is probably a god object... or rather, it would be one, if C was an object oriented language. It contains everything and at the same time encapsulates almost nothing. It represents all of these:
- a single physical file with a single logical file opened,
- a s...... or rather, it would be one, if C was an object oriented language. It contains everything and at the same time encapsulates almost nothing. It represents all of these:
- a single physical file with a single logical file opened,
- a single physical file with multiple (or all) of its logical files opened,
- multiple physical files with mulitple (or all) of their logical files opened.
To make matters worse, at any given time, only a single logical file in a single physical file out of these collections is in focus. When switching focus (which commonly happens when seeking, but also during open and close), the meta-data of the newly selected file (be it logical and/or physical) is copied to special fields of `_sion_filedesc` and the meta-data of the no longer in focus file is written back to the fields it was previously copied from. There are however no functions that do this. These copies are done ad-hoc, every time, see, e.g., [sion_seek](https://trac.version.fz-juelich.de/SIONlib/browser/trunk/src/lib/sion_internal_seek.c?rev=2345#L57) or [sion_generic_paropen_mapped](https://trac.version.fz-juelich.de/SIONlib/browser/trunk/src/parlib/sion_generic_mapped.c?rev=2345#L474).
Separating the different aspects of `_sion_filedesc` into multiple types and encapsulating copy operations (or not copying in the first place) would help cut down on redundant code and clarify responsibilities.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/185Error handling by deadlock2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deError handling by deadlockWhen return codes of fallible functions are not ignored (see #181) they are often handled by immediately performing clean up (or not, see #184) and returning early. This is better than ignoring the errors, but has serious drawbacks in th...When return codes of fallible functions are not ignored (see #181) they are often handled by immediately performing clean up (or not, see #184) and returning early. This is better than ignoring the errors, but has serious drawbacks in the parallel parts of SIONlib. It is not necessarily the case that all errors will always manifest globally (i.e. on all processes). This leads to some (but not all) processes returning early from functions that later try to perform collective operations leading to a deadlock. Considering that in the context of HPC, users are often billed for compute time by the wall clock, this is a particularly poor error handling strategy.
Better strategies would be to make error handling collective (if any process encounters an error, all return early) or abort the program (harsh, but probably saves a lot of wasted compute time).https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/188sion_flush does not flush buffers2021-08-25T15:20:29+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.desion_flush does not flush buffers`sion_flush()` does not flush the I/O buffers as one would expect from its name but rather updates the internal blocksize array which is being removed as part of SIONlib 2.0. Should `sion_flush()` be re-dedicated to flushing buffers or s...`sion_flush()` does not flush the I/O buffers as one would expect from its name but rather updates the internal blocksize array which is being removed as part of SIONlib 2.0. Should `sion_flush()` be re-dedicated to flushing buffers or should it be removed now that it is a no-op?https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/163Incorrect intent declarations in Fortran bindings2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deIncorrect intent declarations in Fortran bindingsAnke Kreuzer recently reported a segmentation fault originating in a call to `fsion_paropen_mpi`.
> Hallo Benedikt,
>
> also der Code rund um das FSION_PAROPEN_MPI sieht so aus aus:
>
> fsblksize = -1
> chunksize = INT(16+nt*(96+...Anke Kreuzer recently reported a segmentation fault originating in a call to `fsion_paropen_mpi`.
> Hallo Benedikt,
>
> also der Code rund um das FSION_PAROPEN_MPI sieht so aus aus:
>
> fsblksize = -1
> chunksize = INT(16+nt*(96+120*ndl),KIND=8)
>
> IF(myid==0) THEN
> WRITE(*,*) " Before create_Com"
> ENDIF
> color = crank/24
> key = modulo(crank,24);
> ! WRITE(*,*) "My id: ",myid , "color: ", color, "key: ", key
>
> CALL MPI_COMM_SPLIT(MPI_COMM_WORLD, color, myid,
> & lcomm, ierr)
> IF(myid==0) THEN
> WRITE(*,*) " After create_Com, before SION_PAROPEN_MPI"
> ENDIF
>
> So, bis hier hin laeuft das Programm durch und dann kommt:
>
> CALL FSION_PAROPEN_MPI('NAM_CPs',
> & 'bw',-1,icomm0,lcomm,
> & chunksize, fsblksize, myid, newfname, sid)
>
> IF(myid==0) THEN
> WRITE(*,*) " After paropen"
> ENDIF
>
> Und diese letzte Ausgabe wird nie ausgegeben, da vorher der Segmentation
> fault passiert.
The problem is that SIONlib tries to return the number of files it actually opened (based on the number of distinct local communicators) in the `nfiles` argument. However, the special value `-1` is specified as an integer literal and as the dummy argument `nfiles` is declared with `intent(in)` is probably placed in a read-only part of memory.
A quick inspection reveals that the `intent` declarations on the Fortran subroutine and the argument const-ness on the C function that is called underneath differ in several places.
```
subroutine fsion_paropen_mpi(fname,file_mode,nfiles,fgcomm,flcomm,chunksizes,fsblksize,&
& globalrank,newfname,sid)
implicit none
character(len=*), intent(in) :: fname
character(len=*), intent(in) :: file_mode
integer, intent(in) :: nfiles
integer, intent(in) :: fgcomm
integer, intent(inout) :: flcomm
integer*8, intent(inout) :: chunksizes
integer*4, intent(inout) :: fsblksize
integer, intent(in) :: globalrank
character(len=*), intent(out) :: newfname
integer, intent(out) :: sid
call fsion_paropen_mpi_c(fname,file_mode,nfiles,fgcomm,flcomm,chunksizes,fsblksize,&
& globalrank,newfname,sid)
end subroutine fsion_paropen_mpi
```
```
void fsion_paropen_mpi_c(char *fname,
char *file_mode,
int *numFiles,
MPI_Fint * fgComm,
MPI_Fint * flComm,
sion_int64 *chunksize,
sion_int32 *fsblksize,
int *globalrank,
char *newfname,
int *sid,
int fname_len,
int file_mode_len,
int newfname_len);
```https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/184Audit open and close functions for memory leaks2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deAudit open and close functions for memory leaksAddress sanitizer indicates that some open and close functions leak memory on the error paths (e.g. when called with incorrect parameters, as some tests do). A more thorough audit should be performed.Address sanitizer indicates that some open and close functions leak memory on the error paths (e.g. when called with incorrect parameters, as some tests do). A more thorough audit should be performed.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/183Apply hints when doing collective I/O2021-08-25T15:20:43+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.deApply hints when doing collective I/OBefore resolving #159, hints were applied by calling `sion_ensure_free_space` on every task. This is currently not done anymore. One possible way to re-enable this would be to extend the protocol of the iterators in `sion_internal_positi...Before resolving #159, hints were applied by calling `sion_ensure_free_space` on every task. This is currently not done anymore. One possible way to re-enable this would be to extend the protocol of the iterators in `sion_internal_positioning.{c|h}` to return an `enum` rather than `bool` and use that to indicate a jump to a new block. (Or keep the boolean and return a true tagged union `enum ??? { Remainder { length: i64 }, Jump { position: i64, length: i64 } }`.) Then on every jump to a different block either the sender or the collector can take action (apply hints).https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/182_sion_file_set_position_ansi and _sion_file_get_position_ansi use POSIX funct...2021-08-25T15:20:29+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.de_sion_file_set_position_ansi and _sion_file_get_position_ansi use POSIX functions on some platformsE.g., on Linux, said functions reach around the C standard `FILE*` via `fileno()` and modify/inspect the position of the POSIX file descriptor directly via `lseek()`. This leads to confusing results by making the buffering that C standar...E.g., on Linux, said functions reach around the C standard `FILE*` via `fileno()` and modify/inspect the position of the POSIX file descriptor directly via `lseek()`. This leads to confusing results by making the buffering that C standard I/O performs observable, e.g. in a sequence `read -> set_position -> read` the second `read` can return bytes that were read by the first `read` and still remain in the I/O buffer.
This behavior is not consistent with POSIX (where no buffering is performed) or the C standard library (where `fseek` transparently flushes buffer as necessary) or even with itself (on mac OS (Darwin) `_sion_file_set_position_ansi` is implemented via `fseek`).
To paper over the inconsistency most (but not all) uses of `_sion_file_set_position()` and `_sion_file_get_position()` in SIONlib are prefixed with an explicit `_sion_file_flush()`.
Changing the implementation of the functions in question to use `fseek` and `ftello` like on mac OS does not break anything. In fact, after doing so, it is possible to remove `_sion_file_flush` from all sequences `_sion_file_flush -> _sion_file_{get|set}_position` without breaking any tests.
== Question
Is there a reason for `_sion_file_{get|set}_position_ansi` being implemented the way they are or can they be changed to be more consistent?https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/136Use more recent fortran for f90 interface2019-12-02T16:03:43+01:00Kay ThustUse more recent fortran for f90 interfaceE.g. integer*8 is f77 style, use selected_int_kind insteadE.g. integer*8 is f77 style, use selected_int_kind insteadhttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/133Clean up debugging information2019-12-02T16:03:57+01:00Kay ThustClean up debugging information- `DFUNCTION` should be used consistently
- "Enter" debug messages should be the first
- "Exit" should be the last, if possible also for cases where the function does not exit successful- `DFUNCTION` should be used consistently
- "Enter" debug messages should be the first
- "Exit" should be the last, if possible also for cases where the function does not exit successfulhttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/129Add sion_paropen_mapped_omp2019-04-12T14:18:44+02:00Kay ThustAdd sion_paropen_mapped_ompmapped omp is still missingmapped omp is still missingWolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/115Add platform library for adapting SIONlib's behavior to the platform it runs on2019-04-12T14:16:26+02:00Kay ThustAdd platform library for adapting SIONlib's behavior to the platform it runs onFormally known as hardware libFormally known as hardware libWolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/193Hybrid API too simple for heterogeneous systems(?)2021-02-12T08:03:35+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deHybrid API too simple for heterogeneous systems(?)The hybrid API currently flattens the 2D identifiers `(mpi_rank, omp_thread_num)` into a 1D space as follows:
```
sionlib_rank = mpi_rank * num_threads + omp_thread_num
```
This assumes that there is a single `num_threads` that is vali...The hybrid API currently flattens the 2D identifiers `(mpi_rank, omp_thread_num)` into a 1D space as follows:
```
sionlib_rank = mpi_rank * num_threads + omp_thread_num
```
This assumes that there is a single `num_threads` that is valid across all MPI processes. On heterogeneous systems (or in heterogeneous applications, think MPMD) this is not necessarily the case.
As a first step, this should be mentioned in the documentation. In the future, this limitation should be lifted by allowing arbitrary numbers of threads on each process.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/110Check debug levels and usage of DFUNCTION2019-12-02T16:03:57+01:00Kay ThustCheck debug levels and usage of DFUNCTIONhttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/145Split up Fortran interface libraries2019-12-02T16:03:42+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deSplit up Fortran interface librariesCurrently, the OpenMP part of the Fortran interface is included in the serial library, while the hybrid part is included in the MPI library.
Split up the Fortran interface into eight libraries: `(F77, F90) x (serial, omp, mpi, ompi)`Currently, the OpenMP part of the Fortran interface is included in the serial library, while the hybrid part is included in the MPI library.
Split up the Fortran interface into eight libraries: `(F77, F90) x (serial, omp, mpi, ompi)`https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/141Raising an error in _sion_paropen_generic_one_file lets buddy checkpointing hang2019-12-02T16:03:42+01:00Kay ThustRaising an error in _sion_paropen_generic_one_file lets buddy checkpointing hangRaising an error, e.g. line 159 in [src/parlib/sion_generic_internal.c@2030](src/parlib/sion_generic_internal.c@2030) causes tests for buddy checkpointing to hang.Raising an error, e.g. line 159 in [src/parlib/sion_generic_internal.c@2030](src/parlib/sion_generic_internal.c@2030) causes tests for buddy checkpointing to hang.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/140Add checks for malloc(0)2019-12-02T16:03:42+01:00Kay ThustAdd checks for malloc(0)Check all circumstances where `malloc(0)` could occur and handle them properly since it is implementation dependent.
Usually this should not be an issue, since zero sized allocs should be prevented by other logic, e.g. size of global co...Check all circumstances where `malloc(0)` could occur and handle them properly since it is implementation dependent.
Usually this should not be an issue, since zero sized allocs should be prevented by other logic, e.g. size of global communicator > 0.