1. Introduction and Goals
Following describes the architecture of eFlows4HPC Data Catalog. The service will provide information about data sets used in the project. The catalog will store info about locations, schemas, and additional metadata.
Main features:
-
keep track of data sources
-
enable registration of new data sources
-
provide user-view as well as simple API to access the information
1.1. Requirements Overview
ID |
Requirement |
Explanation |
R1 |
View data sources |
View the list of data sources and details on particular ones (web page + api) |
R2 |
Register data sets |
Authenticated users should be able register/change data sets with additional metadata |
R3 |
No schema MD |
We don’t impose a schema for the metadata (not know what is relevant) |
R4 |
Documented API |
Swagger/OpenAPI |
1.2. Quality Goals
ID |
Prio |
Quality |
Explanation |
Q1 |
1 |
Extensibility |
Possibility to add new metadata to existing rows |
Q2 |
2 |
Interoperability |
The service should work with Data Logistics |
Q3 |
2 |
Deployability |
Quick/automatic deployment |
2. Architecture Constraints
Constraint |
Explanation |
Authentication |
OAuth-based for admin users |
Deployment |
We shall use CI/CD, the project will also be a playing field to setup this and test before the Data Logistics |
Docker-based Deployment |
This technology will be used in the project anyways |
3. System Scope and Context
3.1. Business Context
3.2. Technical Context
Mapping Input/Output to Channels
User → Data Catalog: simple (static?) web page view
Data Logistics → Data catalog HTTP/API read-only
Admin → Data Catalog: either a web page or CLI
4. Solution Strategy
4.1. Speed and flexibility
This product will not be very mission critical, we want to keep it simple. A solution even without a backend database would be possible. API with Swagger/OpenAPI (e.g. fastAPI). Frontend static page with JavaScript calls to the API.
4.2. Automatic Deployment
-
Code in Gitlab
-
Resources on HDF Cloud
-
Automatic deployment with Docker + docker-compose, OpenStack API
We use docker image repository in gitlab to generate new images.
4.3. Structure
Main data model is based on JSON and uses pydantic. Resources in the Catalog are of two storage classes (sources and targets). The number of classes can change in the future.
The actual storage of the information in the catalog is done through an abstract interface which in the first attempt stores the data in a file, other backends can be added.
API uses a backend abstraction to mange the informations
Web front-end are static html files generated from templates. This gives a lot of flexibility and allows for easy scalability if required.