Select Git revision
datalogistics-arch.html
datalogistics-arch.html 33.11 KiB
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 9.1.0" />
<title></title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overridden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install(2);
/*]]>*/
</script>
</head>
<body class="article">
<div id="header">
<div id="toc">
<div id="toctitle">Table of Contents</div>
<noscript><p><b>JavaScript must be enabled in your browser to display the table of contents.</b></p></noscript>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="section-introduction-and-goals">1. Introduction and Goals</h2>
<div class="sectionbody">
<div class="paragraph"><p>Following describes the architecture of eFlows4HPC Data Logistics Service. The service
provides scientific users with a means to prepare, conduct, and monitor data
movement and transformations. The primary use case is to facilitate the data ingestion
process.</p></div>
<div class="paragraph"><p>Main features:</p></div>
<div class="ulist"><ul>
<li>
<p>
keep track of data sources (Data Catalog)
</p>
</li>
<li>
<p>
detect availability of new data in data sources
</p>
</li>
<li>
<p>
move the data from sources into target systems for further processing
</p>
</li>
<li>
<p>
monitor the movement and quality assurance
</p>
</li>
</ul></div>
<div class="sect2">
<h3 id="_requirements_overview">1.1. Requirements Overview</h3>
<div class="tableblock">
<table rules="all"
width="100%"
frame="border"
cellspacing="0" cellpadding="4">
<col width="14%" />
<col width="28%" />
<col width="57%" />
<tbody>
<tr>
<td align="left" valign="top"><p class="table"><strong>ID</strong></p></td>
<td align="left" valign="top"><p class="table"><strong>Requirement</strong></p></td>
<td align="left" valign="top"><p class="table"><strong>Explanation</strong></p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">R1</p></td>
<td align="left" valign="top"><p class="table">Orchestrate movement of data to and from target locations</p></td>
<td align="left" valign="top"><p class="table">Ensure that the data required for computation is moved from external repoistories to target locations</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">R2</p></td>
<td align="left" valign="top"><p class="table">Transfer computation results</p></td>
<td align="left" valign="top"><p class="table">Results of computations should be put in target repositories</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">R3</p></td>
<td align="left" valign="top"><p class="table">Integrate user tools</p></td>
<td align="left" valign="top"><p class="table">There are already some codes to download, upload, and do QA</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">R4</p></td>
<td align="left" valign="top"><p class="table">Integrate new data sources and targets</p></td>
<td align="left" valign="top"><p class="table">New sources and targets can be added at the later stage</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">R5</p></td>
<td align="left" valign="top"><p class="table">Self-service</p></td>
<td align="left" valign="top"><p class="table">In the long-term move from developers to data scientists</p></td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="sect2">
<h3 id="_quality_goals">1.2. Quality Goals</h3>
<div class="tableblock">
<table rules="all"
width="100%"
frame="border"
cellspacing="0" cellpadding="4">
<col width="14%" />
<col width="14%" />
<col width="14%" />
<col width="57%" />
<tbody>
<tr>
<td align="left" valign="top"><p class="table"><strong>ID</strong></p></td>
<td align="left" valign="top"><p class="table"><strong>Prio</strong></p></td>
<td align="left" valign="top"><p class="table"><strong>Quality</strong></p></td>
<td align="left" valign="top"><p class="table"><strong>Explanation</strong></p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Q1</p></td>
<td align="left" valign="top"><p class="table">1</p></td>
<td align="left" valign="top"><p class="table">Performance/Reliablity</p></td>
<td align="left" valign="top"><p class="table">Required data should be moved to target in a timely manner (accounting for retransfers upon errors)</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Q2</p></td>
<td align="left" valign="top"><p class="table">3</p></td>
<td align="left" valign="top"><p class="table">Extensibility/Interoperability</p></td>
<td align="left" valign="top"><p class="table">New sources (up to 10 at the same time) and targets (model evolution) can be added. Integration with project ML libraries</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Q3</p></td>
<td align="left" valign="top"><p class="table">2</p></td>
<td align="left" valign="top"><p class="table">Transparency/Reproducibility</p></td>
<td align="left" valign="top"><p class="table">Other researchers should be able to verify and redo the data movement and processing steps</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Q4</p></td>
<td align="left" valign="top"><p class="table">4</p></td>
<td align="left" valign="top"><p class="table">Elasticity</p></td>
<td align="left" valign="top"><p class="table">Future changes in workfload should be accounted for</p></td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="section-architecture-constraints">2. Architecture Constraints</h2>
<div class="sectionbody">
<div class="tableblock">
<table rules="all"
width="100%"
frame="border"
cellspacing="0" cellpadding="4">
<col width="20%" />
<col width="80%" />
<tbody>
<tr>
<td align="left" valign="top"><p class="table"><strong>Constraint</strong></p></td>
<td align="left" valign="top"><p class="table"><strong>Explanation</strong></p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Local orientation</p></td>
<td align="left" valign="top"><p class="table">The user shall have an impression of owning the local process and codes. The data should be brought to Data Logistics Service</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Python-compatible</p></td>
<td align="left" valign="top"><p class="table">There are some modules and codes in Python and scripts, they should be reuse/integrated</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">Reuse infra</p></td>
<td align="left" valign="top"><p class="table">Take advantage of existing services (Splunk, Grafana, Jenkins, Gitlab)</p></td>
</tr>
<tr>
<td align="left" valign="top"><p class="table">External developer</p></td>
<td align="left" valign="top"><p class="table">Some of the code developed outside should be included in the transformation step, in not intrusible way</p></td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="section-system-scope-and-context">3. System Scope and Context</h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="_business_context">3.1. Business Context</h3>
<div class="paragraph"><p><span class="image">
<img src="./images/business_context.png" alt="Business view" />
</span></p></div>
</div>
<div class="sect2">
<h3 id="_technical_context">3.2. Technical Context</h3>
<div class="paragraph"><p><span class="image">
<img src="./images/technical_context.png" alt="Technical L1 view" />
</span></p></div>
<div class="paragraph"><p><strong>Mapping Input/Output to Channels</strong></p></div>
<div class="paragraph"><p>Scheduler → Worker: asynchornous though redis messaging</p></div>
<div class="paragraph"><p>Worker → Data Source (real-time): HTTP-based, synchron</p></div>
<div class="paragraph"><p>Worker → Data Source (initial replication): API (S3 for OpenAQ)</p></div>
<div class="paragraph"><p>Worker → Local Store: API in first attempt, direct Postgres access could also be an option</p></div>
<div class="paragraph"><p>Worker → Model Repository: HTTP API through CLI/ BashOperator</p></div>
<div class="paragraph"><p>DAG Repository → {Workers, Scheduler}: Jenkins-based CI/CD pipeline with local Docker image repository</p></div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="section-solution-strategy">4. Solution Strategy</h2>
<div class="sectionbody">
<div class="paragraph"><p>Fundamental decisions and solutions strategies.</p></div>
<div class="sect2">
<h3 id="_async_communication">4.1. Async Communication</h3>
<div class="paragraph"><p>We are not planning to address the streaming use-cases, thus the communication between
scheduler and workers will be done async with a message bus. Airflow offers this.</p></div>
<div class="paragraph"><p>This is one-leader async replication use-case.</p></div>
</div>
<div class="sect2">
<h3 id="_airflow">4.2. Airflow</h3>
<div class="paragraph"><p>The pipelines in Airflow are defined in Python (rather than in special workflow language), in long
term the data scientists should be able to create their own pipelines.</p></div>
<div class="paragraph"><p>Airflow pipelines are executed on workflows but can interact with storages their in-storage processing
capabilities. This can be beneficial when moving to Spark for more performance-requiring jobs.</p></div>
</div>
<div class="sect2">
<h3 id="_pure_functions_workflows">4.3. "Pure functions" workflows</h3>
<div class="paragraph"><p>The idea is to build the workflows as pure functions, no orthogonal concerns should be
included (invocation/scheduling/input-output locations). The primary reason for this
is the testablility outside of Airflow.</p></div>
<div class="paragraph"><p>Also the idea of the functions is to achieve idempotentce. In case we have to rerun
the pipeline, the results should be the same, e.g., a measurement series is recreated
rather than new measurements are added. Motivation for this the ability to re-run
the workflows. Also relevant for continuous deployment. If some tasks are lost, we
can restart them. In worst case, the work will be done twice.</p></div>
<div class="paragraph"><p>In general, we want to have a possibility to recreate full data sets.</p></div>
</div>
<div class="sect2">
<h3 id="_deployment_with_jenkins">4.4. Deployment with Jenkins</h3>
<div class="paragraph"><p>Reuse our CD infrastructure. See <a href="#Deployment">[Deployment]</a> for more details.</p></div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="section-building-block-view">5. Building Block View</h2>
<div class="sectionbody">
<div class="paragraph"><p><span class="image">
<img src="./images/technical_context.png" alt="Technical L1 view" />
</span></p></div>
<div class="sect2">
<h3 id="_data_logistics_ui">5.1. Data Logistics UI</h3>
<div class="paragraph"><p>Web service offering view of the defined pipelines, execution details, performance metrics.
Also offers view of the data sources. The view can be extended with plugins.</p></div>
<div class="paragraph"><p><em>Interfaces</em>
Primary interface is the web view, but there is also CLI used for deployment and defining
data sources.</p></div>
<div class="paragraph"><p><em>Fulfilled Requirements</em></p></div>
<div class="ulist"><ul>
<li>
<p>
Self-service (to some extend)
</p>
</li>
</ul></div>
<div class="paragraph"><p>Qualities:</p></div>
<div class="ulist"><ul>
<li>
<p>
Monitoring/Transparency
</p>
</li>
</ul></div>
<div class="paragraph"><p>Extensibility: DAG in Python</p></div>
<div class="paragraph"><p><em>Open Issues/Problems/Risks</em>
For the proper working it requires to have DAGs injected (locally). This becomes an issue
when DAGs use e.g. <code>PythonOperators</code> and thus require additional libraries for the runtime.
Either built the pipelines in such a way that this is not necessary (imports) or rebuild
the UI on each change (here again conflicts).</p></div>
</div>
<div class="sect2">
<h3 id="_data_logistics_scheduler_executor">5.2. Data Logistics Scheduler + Executor</h3>
<div class="paragraph"><p>Not directly accessible to the users. It keeps track of scheduling tasks on workers. Require
connection to workers (can be indirect through messaging system). Also keeps track of the
executions (through metadata store) and restarts failed tasks.</p></div>
<div class="paragraph"><p><em>Interfaces</em>
Stand-alone system, read DAGs (local) and metadata store to find out if new instantiations
are required. The communication with Workers uses two queuing backends:</p></div>
<div class="ulist"><ul>
<li>
<p>
Queue Broker (put the commands to be executed)
</p>
</li>
<li>
<p>
Result backend (gets the status of the completed tasks)
</p>
</li>
</ul></div>
<div class="paragraph"><p>Multi queue deployment for initial snap shot and updates.</p></div>
<div class="listingblock">
<div class="title">Operator Example</div>
<div class="content"><!-- Generator: GNU source-highlight
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt> t3 <span style="color: #990000">=</span> <span style="font-weight: bold"><span style="color: #000000">BashOperator</span></span><span style="color: #990000">(</span>
task_id<span style="color: #990000">=</span><span style="color: #FF0000">'print_host'</span><span style="color: #990000">,</span>
bash_command<span style="color: #990000">=</span><span style="color: #FF0000">'hostname >> /tmp/dag_output.log'</span><span style="color: #990000">,</span>
queue<span style="color: #990000">=</span><span style="color: #FF0000">'local_queue'</span><span style="color: #990000">,</span>
dag<span style="color: #990000">=</span>dag<span style="color: #990000">)</span></tt></pre></div></div>
<div class="paragraph"><p><em>Fulfilled Requirements</em></p></div>
<div class="paragraph"><p>Quality:</p></div>
<div class="ulist"><ul>
<li>
<p>
Extensibility
</p>
</li>
<li>
<p>
Elasticity
</p>
</li>
</ul></div>
<div class="paragraph"><p><em>Open Issues/Problems/Risks</em>
To some extend the scheduler is a single point of failure. It is relative hard to restart upon
rebuild.</p></div>
<div class="paragraph"><p>Not so stateless.</p></div>
</div>
<div class="sect2">
<h3 id="_data_logistics_worker">5.3. Data Logistics Worker</h3>
<div class="paragraph"><p>Workers are responsible for executing task from pipelines. The instantiate the operators
of different type. Operators that are executed on the worker need to have their
dependencies met in that context. The worker needs to have access to its DAGS_FOLDER,
and you need to synchronize the filesystems by your own means. We use git for that.</p></div>
<div class="paragraph"><p>The number of workers can be changed to enable some elasticity in the system.</p></div>
<div class="paragraph"><p>Operator scaffoldings.</p></div>
<div class="paragraph"><p><em>Interfaces</em>
Not directly accessible to the users. The commands are received from scheduler (through)
messaging system. Workers get task execution context and can access MD to get information
about required connections (directly from MD DB).</p></div>
<div class="paragraph"><p>Each worker can listen on multiple queues of tasks. This can be used if the workers are
heterogeneous, e.g., separate queue for Spark jobs or differentiate between different loads.</p></div>
<div class="paragraph"><p>The workers do not communicate directly with each other. Small amounts of data can be exchanged
through 3rd party in form of XCOM. Tested to work with a shared storage, can be an option for
deployment.</p></div>
<div class="paragraph"><p>The worker offers and interface to access log files this is used by the UI.</p></div>
<div class="paragraph"><p><em>Quality/Performance Characteristics</em></p></div>
<div class="ulist"><ul>
<li>
<p>
Integrate user tools
</p>
</li>
<li>
<p>
New sources
</p>
</li>
</ul></div>
<div class="paragraph"><p>Extensibility: multiple workers, multiple queues</p></div>
<div class="paragraph"><p>Interoperability: Operators to access new data sources</p></div>
<div class="paragraph"><p>Self-service: possiblity to create and publish own pipelines</p></div>
</div>
<div class="sect2">
<h3 id="_dag_repo">5.4. DAG Repo</h3>
<div class="paragraph"><p>This repo will store the information about the pipelines in form of DAGs. The
pipeline definition comprises: code (in Python), dependencies (if required),
config metadata (frequency, backfill, etc).</p></div>
<div class="paragraph"><p>To store, share, etc we use DAG Repo to store locally available GitLab of the
requester research group.</p></div>
<div class="paragraph"><p>For some parts of DAG we will have unittests.</p></div>
<div class="paragraph"><p>The deployment of the DAGs is done through the Continuous Deployment gitlab.</p></div>
<div class="paragraph"><p><em>Interfaces</em>
Standard gitlab access.</p></div>
<div class="paragraph"><p><em>Fulfilled Requirements</em>
Extensibility, Self-service</p></div>
<div style="page-break-after:always"></div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="section-runtime-view">6. Runtime View</h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="_scheduling">6.1. Scheduling</h3>
<div class="paragraph"><p><span class="image">
<img src="./images/scheduling.png" alt="Scheduling process" />
</span></p></div>
</div>
<div class="sect2">
<h3 id="_deployment">6.2. Deployment</h3>
<div class="paragraph" id="Deployment"><p><span class="image">
<img src="./images/deployment.png" alt="Deployment and image using process" />
</span></p></div>
</div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated
2022-02-10 12:11:11 CET
</div>
</div>
</body>
</html>