SIONlib issueshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues2021-08-25T15:20:29+02:00https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/213Deprecate or remove sion_get_current_position etc.2021-08-25T15:20:29+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.deDeprecate or remove sion_get_current_position etc.These functions expose internal fields, remove in favor of `sion_tell` (and possibly more functions to be implemented).These functions expose internal fields, remove in favor of `sion_tell` (and possibly more functions to be implemented).2.0.0-rc.4https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/190Buddy checkpointing does not work with hybrid API2021-08-25T15:20:29+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.deBuddy checkpointing does not work with hybrid APIThere are no tests for the combination of hybrid (MPI + OpenMP) API and buddy checkpointing, but one of the existing tests (e.g. `test_buddy___1`) can be adapted by changing all occurences of `paropen_mpi` and `parclose_mpi` to `paropen_...There are no tests for the combination of hybrid (MPI + OpenMP) API and buddy checkpointing, but one of the existing tests (e.g. `test_buddy___1`) can be adapted by changing all occurences of `paropen_mpi` and `parclose_mpi` to `paropen_ompi` and `parclose_ompi`. The resulting test deadlocks, because the `gather_execute` and `execute_scatter` implementations for the hybrid API do not make the same three branch comparison (collector, sender, no-op) as the MPI implementation (no-op is missing).https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/186_sion_filedesc is probably a god object2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.de_sion_filedesc is probably a god object... or rather, it would be one, if C was an object oriented language. It contains everything and at the same time encapsulates almost nothing. It represents all of these:
- a single physical file with a single logical file opened,
- a s...... or rather, it would be one, if C was an object oriented language. It contains everything and at the same time encapsulates almost nothing. It represents all of these:
- a single physical file with a single logical file opened,
- a single physical file with multiple (or all) of its logical files opened,
- multiple physical files with mulitple (or all) of their logical files opened.
To make matters worse, at any given time, only a single logical file in a single physical file out of these collections is in focus. When switching focus (which commonly happens when seeking, but also during open and close), the meta-data of the newly selected file (be it logical and/or physical) is copied to special fields of `_sion_filedesc` and the meta-data of the no longer in focus file is written back to the fields it was previously copied from. There are however no functions that do this. These copies are done ad-hoc, every time, see, e.g., [sion_seek](https://trac.version.fz-juelich.de/SIONlib/browser/trunk/src/lib/sion_internal_seek.c?rev=2345#L57) or [sion_generic_paropen_mapped](https://trac.version.fz-juelich.de/SIONlib/browser/trunk/src/parlib/sion_generic_mapped.c?rev=2345#L474).
Separating the different aspects of `_sion_filedesc` into multiple types and encapsulating copy operations (or not copying in the first place) would help cut down on redundant code and clarify responsibilities.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/185Error handling by deadlock2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deError handling by deadlockWhen return codes of fallible functions are not ignored (see #181) they are often handled by immediately performing clean up (or not, see #184) and returning early. This is better than ignoring the errors, but has serious drawbacks in th...When return codes of fallible functions are not ignored (see #181) they are often handled by immediately performing clean up (or not, see #184) and returning early. This is better than ignoring the errors, but has serious drawbacks in the parallel parts of SIONlib. It is not necessarily the case that all errors will always manifest globally (i.e. on all processes). This leads to some (but not all) processes returning early from functions that later try to perform collective operations leading to a deadlock. Considering that in the context of HPC, users are often billed for compute time by the wall clock, this is a particularly poor error handling strategy.
Better strategies would be to make error handling collective (if any process encounters an error, all return early) or abort the program (harsh, but probably saves a lot of wasted compute time).https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/184Audit open and close functions for memory leaks2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deAudit open and close functions for memory leaksAddress sanitizer indicates that some open and close functions leak memory on the error paths (e.g. when called with incorrect parameters, as some tests do). A more thorough audit should be performed.Address sanitizer indicates that some open and close functions leak memory on the error paths (e.g. when called with incorrect parameters, as some tests do). A more thorough audit should be performed.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/181Check return codes of fallible operations2021-08-25T15:20:43+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.deCheck return codes of fallible operationsIn many places, return codes of fallible operations such as `_sion_file_flush` and `_sion_file_set_position` are not checked.In many places, return codes of fallible operations such as `_sion_file_flush` and `_sion_file_set_position` are not checked.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/177Simplify `globalranks` vs. `ranks`?2019-12-02T16:03:44+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deSimplify `globalranks` vs. `ranks`?A SIONlib file container contains an arbitrary number `n` of logical files. Each one of these `n` files has two identifying numbers attached to it:
- an implicit one in the range `0...n-1`, `siondump` calls this "Task",
- an explicit on...A SIONlib file container contains an arbitrary number `n` of logical files. Each one of these `n` files has two identifying numbers attached to it:
- an implicit one in the range `0...n-1`, `siondump` calls this "Task",
- an explicit one that is freely assigned by the user, called "globalrank".
Depending on how the file was opened (e.g. serial open vs. mapped open) either the implicit or explicit rank numbers are used to refer to a logical file when doing a `sion_seek`.
- This is confusing. Why are there two rank numbers for a logical file? Why does `sion_seek` behave differently, depending on how a file was opened?
- The current implementation is incomplete. The serial open functions do not perform any validation of the `globalranks` making it possible to create files that trip up a later mapped open (see e.g. #156 and #174).
Is the additional identifying number really needed or can the mechanism be simplified?2.0.0https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/172`_sion_free_filedesc_all_localranks` is dead code but should probably be call...2021-08-25T15:20:29+02:00Benedikt Steinbuschb.steinbusch@fz-juelich.de`_sion_free_filedesc_all_localranks` is dead code but should probably be called somewhereWhile working on #157 (dead code removal), I did not remove `_sion_free_filedesc_all_localranks`, because the corresponding allocation function is indeed called and the `all_localranks` field is used. Find the right spot to insert calls ...While working on #157 (dead code removal), I did not remove `_sion_free_filedesc_all_localranks`, because the corresponding allocation function is indeed called and the `all_localranks` field is used. Find the right spot to insert calls to `_sion_free_filedesc_all_localranks`.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/163Incorrect intent declarations in Fortran bindings2019-12-02T16:03:43+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deIncorrect intent declarations in Fortran bindingsAnke Kreuzer recently reported a segmentation fault originating in a call to `fsion_paropen_mpi`.
> Hallo Benedikt,
>
> also der Code rund um das FSION_PAROPEN_MPI sieht so aus aus:
>
> fsblksize = -1
> chunksize = INT(16+nt*(96+...Anke Kreuzer recently reported a segmentation fault originating in a call to `fsion_paropen_mpi`.
> Hallo Benedikt,
>
> also der Code rund um das FSION_PAROPEN_MPI sieht so aus aus:
>
> fsblksize = -1
> chunksize = INT(16+nt*(96+120*ndl),KIND=8)
>
> IF(myid==0) THEN
> WRITE(*,*) " Before create_Com"
> ENDIF
> color = crank/24
> key = modulo(crank,24);
> ! WRITE(*,*) "My id: ",myid , "color: ", color, "key: ", key
>
> CALL MPI_COMM_SPLIT(MPI_COMM_WORLD, color, myid,
> & lcomm, ierr)
> IF(myid==0) THEN
> WRITE(*,*) " After create_Com, before SION_PAROPEN_MPI"
> ENDIF
>
> So, bis hier hin laeuft das Programm durch und dann kommt:
>
> CALL FSION_PAROPEN_MPI('NAM_CPs',
> & 'bw',-1,icomm0,lcomm,
> & chunksize, fsblksize, myid, newfname, sid)
>
> IF(myid==0) THEN
> WRITE(*,*) " After paropen"
> ENDIF
>
> Und diese letzte Ausgabe wird nie ausgegeben, da vorher der Segmentation
> fault passiert.
The problem is that SIONlib tries to return the number of files it actually opened (based on the number of distinct local communicators) in the `nfiles` argument. However, the special value `-1` is specified as an integer literal and as the dummy argument `nfiles` is declared with `intent(in)` is probably placed in a read-only part of memory.
A quick inspection reveals that the `intent` declarations on the Fortran subroutine and the argument const-ness on the C function that is called underneath differ in several places.
```
subroutine fsion_paropen_mpi(fname,file_mode,nfiles,fgcomm,flcomm,chunksizes,fsblksize,&
& globalrank,newfname,sid)
implicit none
character(len=*), intent(in) :: fname
character(len=*), intent(in) :: file_mode
integer, intent(in) :: nfiles
integer, intent(in) :: fgcomm
integer, intent(inout) :: flcomm
integer*8, intent(inout) :: chunksizes
integer*4, intent(inout) :: fsblksize
integer, intent(in) :: globalrank
character(len=*), intent(out) :: newfname
integer, intent(out) :: sid
call fsion_paropen_mpi_c(fname,file_mode,nfiles,fgcomm,flcomm,chunksizes,fsblksize,&
& globalrank,newfname,sid)
end subroutine fsion_paropen_mpi
```
```
void fsion_paropen_mpi_c(char *fname,
char *file_mode,
int *numFiles,
MPI_Fint * fgComm,
MPI_Fint * flComm,
sion_int64 *chunksize,
sion_int32 *fsblksize,
int *globalrank,
char *newfname,
int *sid,
int fname_len,
int file_mode_len,
int newfname_len);
```https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/155Fehler in sion_generic_register_nam_restore_file_cb2017-02-09T14:03:16+01:00Kay ThustFehler in sion_generic_register_nam_restore_file_cbDEEP-ERWolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/145Split up Fortran interface libraries2019-12-02T16:03:42+01:00Benedikt Steinbuschb.steinbusch@fz-juelich.deSplit up Fortran interface librariesCurrently, the OpenMP part of the Fortran interface is included in the serial library, while the hybrid part is included in the MPI library.
Split up the Fortran interface into eight libraries: `(F77, F90) x (serial, omp, mpi, ompi)`Currently, the OpenMP part of the Fortran interface is included in the serial library, while the hybrid part is included in the MPI library.
Split up the Fortran interface into eight libraries: `(F77, F90) x (serial, omp, mpi, ompi)`https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/141Raising an error in _sion_paropen_generic_one_file lets buddy checkpointing hang2019-12-02T16:03:42+01:00Kay ThustRaising an error in _sion_paropen_generic_one_file lets buddy checkpointing hangRaising an error, e.g. line 159 in [src/parlib/sion_generic_internal.c@2030](src/parlib/sion_generic_internal.c@2030) causes tests for buddy checkpointing to hang.Raising an error, e.g. line 159 in [src/parlib/sion_generic_internal.c@2030](src/parlib/sion_generic_internal.c@2030) causes tests for buddy checkpointing to hang.https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/138Buddy checkpointing hangs when main file is missing2016-07-05T10:38:32+02:00Kay ThustBuddy checkpointing hangs when main file is missingtest_fitest_1 does not pass TEST H, so it is deactivated and the log is adapted to make it pass (see [1990]).test_fitest_1 does not pass TEST H, so it is deactivated and the log is adapted to make it pass (see [1990]).DEEP-ERWolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/129Add sion_paropen_mapped_omp2019-04-12T14:18:44+02:00Kay ThustAdd sion_paropen_mapped_ompmapped omp is still missingmapped omp is still missingWolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/113Optimization for Scalasca2015-03-25T11:02:34+01:00Kay ThustOptimization for Scalasca- investigation on currently poor I/O performance- investigation on currently poor I/O performancev1.6Wolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/109SION_DEBUG=stdout does not work2015-03-10T17:35:21+01:00Wolfgang FringsSION_DEBUG=stdout does not workSION_DEBUG=stdout does not work: is writes to a file instead to stdoutSION_DEBUG=stdout does not work: is writes to a file instead to stdoutv1.6Wolfgang FringsWolfgang Fringshttps://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/103Problems with POSIX I/O and buffering2019-12-02T16:03:56+01:00Wolfgang FringsProblems with POSIX I/O and bufferingFrom John Donners:
I've also experimented a lot with the POSIX interface and the internal
buffering for SIONlib. The SIONlib buffer seems not flushed at the
moment that the file is closed. I've added an extra routine to flush the
buffer...From John Donners:
I've also experimented a lot with the POSIX interface and the internal
buffering for SIONlib. The SIONlib buffer seems not flushed at the
moment that the file is closed. I've added an extra routine to flush the
buffer, which uses the same code as in sion_fwrite. I'm not sure where
to insert the code in the different SIONlib close calls.v1.6https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/80Error codes for unsuccessful read / write calls2021-08-25T15:20:43+02:00Kay ThustError codes for unsuccessful read / write callsreported by Bert Wesarg
> Das Problem ist wohl eher, das wir gerne zw. 'erfolgreichem lesen, auch wenn es zu wenig war' und 'fehler beim lesen' untscheiden wollen. Mit der ASNI API kann man das gut mit ferror() unterscheiden. Aber bei s...reported by Bert Wesarg
> Das Problem ist wohl eher, das wir gerne zw. 'erfolgreichem lesen, auch wenn es zu wenig war' und 'fehler beim lesen' untscheiden wollen. Mit der ASNI API kann man das gut mit ferror() unterscheiden. Aber bei sion_fread_key() sehe ich da jetzt nicht eine moeglichkeit.
Und später
> Auch bei ANSI kann man anhand des Rükgabewertes nicht entscheiden, ob es ein error oder ein end-of-file ist. Ob nun die Anzahl der Bytes oder der Elemente zurück gegeben werden, spielt da keine Rolle. Ich sehe also gerade nicht, wie diese Information vorher verfügbar hätte sein könnte. POSIX auf der anderen Seite gibt einem -1 wieder, in Falle eines Fehlers.2.0.0https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/71Handling of SION_BUFFERSIZE2019-12-02T16:03:56+01:00Wolfgang FringsHandling of SION_BUFFERSIZEInterestingly enough, the POSIX open
call does not show this behaviour. So I changed to the POSIX interface.
To get the buffering, I also added the 'buffered' option when opening the
file. Is that option mature enough for production? I j...Interestingly enough, the POSIX open
call does not show this behaviour. So I changed to the POSIX interface.
To get the buffering, I also added the 'buffered' option when opening the
file. Is that option mature enough for production? I just have one remark:
I had to set the environment variable SION_BUFFERSIZE to actually get
the buffering to work. I think that is because of line 60 in sion_buffer.c, which
shouldn't be inside the if-construct, so the buffer size gets set to the block size
by default. v1.6https://gitlab.jsc.fz-juelich.de/cstao-public/SIONlib/SIONlib/-/issues/63Download-Panel HTML-Problems2019-04-12T14:18:59+02:00Wolfgang FringsDownload-Panel HTML-ProblemsWolfgang FringsWolfgang Frings