<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://hpc-wiki.info/hpc/index.php?action=history&amp;feed=atom&amp;title=Machine_and_Deep_Learning_Frameworks</id>
	<title>Machine and Deep Learning Frameworks - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://hpc-wiki.info/hpc/index.php?action=history&amp;feed=atom&amp;title=Machine_and_Deep_Learning_Frameworks"/>
	<link rel="alternate" type="text/html" href="https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;action=history"/>
	<updated>2026-05-26T11:14:42Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.35.9</generator>
	<entry>
		<id>https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5062&amp;oldid=prev</id>
		<title>Jannis-klinkenberg-0962@rwth-aachen.de at 07:27, 1 July 2024</title>
		<link rel="alternate" type="text/html" href="https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5062&amp;oldid=prev"/>
		<updated>2024-07-01T07:27:24Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left diff-editfont-monospace&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:27, 1 July 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l109&quot; &gt;Line 109:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 109:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Consult the according documentation on how to run/execute a container depending&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Consult the according documentation on how to run/execute a container depending&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;on the used software and check for flags that might be required on the desired HPC systems.&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;on the used software and check for flags that might be required on the desired HPC systems.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;includes ensuring &lt;/del&gt;GPU availabilty inside the container, &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;via &lt;/del&gt;e.g.&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt;+&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;might include the following entries&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;code&amp;gt;--nv&amp;lt;/code&amp;gt; for Apptainer, or &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;checking &lt;/del&gt;the set environment variables that&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt;+&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;* Ensuring &lt;/ins&gt;GPU availabilty inside the container, e.g.&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;, via &lt;/ins&gt;&amp;lt;code&amp;gt;--nv&amp;lt;/code&amp;gt; for Apptainer&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;might get carried into the container environment and therefore have to be&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt;+&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;* Mapping additional user-specific directories during container usage&lt;/ins&gt;, &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;where files can be accessed &lt;/ins&gt;or &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;manipulated, e.g, via &amp;lt;code&amp;gt;--bind&amp;lt;/code&amp;gt; for Apptainer. Remember: Directories that are part of the container are typically read-only.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;either cleaned or expanded when requiring file paths that are not part of the&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt;+&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;* Checking &lt;/ins&gt;the set environment variables that might get carried into the container environment and therefore have to be either cleaned or expanded when requiring file paths that are not part of the container.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;container.&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In some cases it might be required to build own containers or build upon&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In some cases it might be required to build own containers or build upon&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key hpc_wiki:diff::1.12:old-5059:rev-5062 --&gt;
&lt;/table&gt;</summary>
		<author><name>Jannis-klinkenberg-0962@rwth-aachen.de</name></author>
	</entry>
	<entry>
		<id>https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5059&amp;oldid=prev</id>
		<title>Jannis-klinkenberg-0962@rwth-aachen.de at 07:05, 1 July 2024</title>
		<link rel="alternate" type="text/html" href="https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5059&amp;oldid=prev"/>
		<updated>2024-07-01T07:05:04Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left diff-editfont-monospace&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:05, 1 July 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l281&quot; &gt;Line 281:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 281:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;For more detailed information on distributed training for specific frameworks,&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;For more detailed information on distributed training for specific frameworks,&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;consult the pages below:&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;consult the pages below:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[PyTorch#&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;Distributed_training&lt;/del&gt;|Distributed training with PyTorch]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt;+&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[PyTorch#&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;Distributed training&lt;/ins&gt;|Distributed training with PyTorch]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[TensorFlow#&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;Distributed_training&lt;/del&gt;|Distributed training with TensorFlow]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt;+&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[TensorFlow#&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;Distributed training&lt;/ins&gt;|Distributed training with TensorFlow]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Inference ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Inference ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key hpc_wiki:diff::1.12:old-5058:rev-5059 --&gt;
&lt;/table&gt;</summary>
		<author><name>Jannis-klinkenberg-0962@rwth-aachen.de</name></author>
	</entry>
	<entry>
		<id>https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5058&amp;oldid=prev</id>
		<title>Jannis-klinkenberg-0962@rwth-aachen.de at 07:03, 1 July 2024</title>
		<link rel="alternate" type="text/html" href="https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5058&amp;oldid=prev"/>
		<updated>2024-07-01T07:03:43Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left diff-editfont-monospace&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:03, 1 July 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l359&quot; &gt;Line 359:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 359:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [https://github.com/webdataset/webdataset webdataset]&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [https://github.com/webdataset/webdataset webdataset]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [https://datadings.readthedocs.io/en/stable/ datadings]&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [https://datadings.readthedocs.io/en/stable/ datadings]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt;−&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Improving I/O performance for parallel single node runs ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&#039;diff-marker&#039;&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Improving I/O performance for parallel single node runs ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key hpc_wiki:diff::1.12:old-5055:rev-5058 --&gt;
&lt;/table&gt;</summary>
		<author><name>Jannis-klinkenberg-0962@rwth-aachen.de</name></author>
	</entry>
	<entry>
		<id>https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5055&amp;oldid=prev</id>
		<title>Jannis-klinkenberg-0962@rwth-aachen.de: Created page with &quot;Category:HPC-Developer Category:HPC-User Frameworks for machine learning (ML) and deep learning (DL) provide many tools to facilitate the building, training and infere...&quot;</title>
		<link rel="alternate" type="text/html" href="https://hpc-wiki.info/hpc/index.php?title=Machine_and_Deep_Learning_Frameworks&amp;diff=5055&amp;oldid=prev"/>
		<updated>2024-07-01T06:55:40Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;&lt;a href=&quot;/hpc/Category:HPC-Developer&quot; title=&quot;Category:HPC-Developer&quot;&gt;Category:HPC-Developer&lt;/a&gt; &lt;a href=&quot;/hpc/Category:HPC-User&quot; title=&quot;Category:HPC-User&quot;&gt;Category:HPC-User&lt;/a&gt; Frameworks for machine learning (ML) and deep learning (DL) provide many tools to facilitate the building, training and infere...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;[[Category:HPC-Developer]] [[Category:HPC-User]]&lt;br /&gt;
Frameworks for machine learning (ML) and deep learning (DL) provide many tools&lt;br /&gt;
to facilitate the building, training and inference process of different machine&lt;br /&gt;
learning models. This article aims to provide an overview about common&lt;br /&gt;
frameworks and the surrounding execution environments, as well as some details&lt;br /&gt;
about the underlying strategies.&lt;br /&gt;
&lt;br /&gt;
== Exisiting frameworks ==&lt;br /&gt;
&lt;br /&gt;
As there is already an extensive number of ML/DL frameworks available and new&lt;br /&gt;
ones targeting more specialized use-cases are actively being developed, this&lt;br /&gt;
article only lists some of them and provides some classification as a basic&lt;br /&gt;
overview.&lt;br /&gt;
The choice for a suiting framework depends on multiple factors:&lt;br /&gt;
* the type of machine learning model: e.g. classification, regression, neural networks, large language models, evolutionary algorithms&lt;br /&gt;
* the training method: e.g. (un-)supervised learning, reinforcment learning, auto-regressive&lt;br /&gt;
* the targeted hardware: CPU, GPU, CPU+GPU, or other accelerators&lt;br /&gt;
* the used programming model: e.g. CUDA for Nvidia GPUs, ROCm via HiP for AMD GPUs, etc.&lt;br /&gt;
* the used programming language: C/C++, Python, Julia, Fortran, etc.&lt;br /&gt;
* others&lt;br /&gt;
&lt;br /&gt;
In the following the focus on ML/DL frameworks lies on the Python programming&lt;br /&gt;
language, while some of them also offer support for different programming&lt;br /&gt;
languages like C++.&lt;br /&gt;
&lt;br /&gt;
=== scikit-learn ===&lt;br /&gt;
&lt;br /&gt;
[https://scikit-learn.org/stable/ scikit-learn] is a Python framework for&lt;br /&gt;
shallow machine learning. It provides both supervised and unsupervised machine&lt;br /&gt;
learning models like regression, support vector machines, neural networks&lt;br /&gt;
and clustering. scikit-learn only supports execution on the CPU.&lt;br /&gt;
&lt;br /&gt;
=== PyTorch ===&lt;br /&gt;
&lt;br /&gt;
[https://pytorch.org/ PyTorch] is a Python framework for machine and deep&lt;br /&gt;
learning. It is build upon the [http://torch.ch/ torch] library, which also&lt;br /&gt;
provides a C++ interface. Both CPU and GPU execution is supported for&lt;br /&gt;
single-node and multi-node systems. Distributed model training is possible&lt;br /&gt;
through a PyTorch native implementation as well as [https://horovod.ai/ horovod] and can be&lt;br /&gt;
extended with additional distributed strategies and algorithms through&lt;br /&gt;
libraries and frameworks like [https://www.deepspeed.ai/ DeepSpeed] or&lt;br /&gt;
[https://lightning.ai/docs/pytorch/stable/ PyTorch Lightning].&lt;br /&gt;
&lt;br /&gt;
For more information on general setup and (distributed) machine learning, check&lt;br /&gt;
out: [[PyTorch|PyTorch in HPC]]&lt;br /&gt;
&lt;br /&gt;
=== TensorFlow ===&lt;br /&gt;
&lt;br /&gt;
[https://www.tensorflow.org/ TensorFlow] is a machine learning framework with&lt;br /&gt;
focus on deep neural networks, supporting CPU and GPU execution. It uses Keras&lt;br /&gt;
as a high-level API to help the user in constructing neural network models.&lt;br /&gt;
Distributed model training is possible through a TensorFlow native implementation and&lt;br /&gt;
[https://horovod.ai/ horovod].&lt;br /&gt;
&lt;br /&gt;
For more information on general setup and (distributed) machine learning, check&lt;br /&gt;
out: [[TensorFlow|TensorFlow in HPC]]&lt;br /&gt;
&lt;br /&gt;
=== Others ===&lt;br /&gt;
&lt;br /&gt;
Another framework that was popular in the past was&lt;br /&gt;
[https://mxnet.apache.org/versions/1.9.1/ MXNet], which is no longer in&lt;br /&gt;
development. Many other machine and deep learning frameworks exist, where some&lt;br /&gt;
of them are tailored more specifically towards different fields of application.&lt;br /&gt;
[https://colossalai.org/ Colossal-AI] and [https://docs.mosaicml.com/en/latest/ Mosaic ML]&lt;br /&gt;
are other noteworthy mentions as frameworks for neural networks in&lt;br /&gt;
general, while [https://github.com/NVIDIA/Megatron-LM Megatron-LM] is a&lt;br /&gt;
framework meant for transformer-based large language models (LLM).&lt;br /&gt;
&lt;br /&gt;
== General setup and software environment ==&lt;br /&gt;
&lt;br /&gt;
In most cases installing a framework through a package manager like pip, when&lt;br /&gt;
using Python, is enough to get started. When GPU support is required,&lt;br /&gt;
additional software and libraries are necessary (which sometimes will be&lt;br /&gt;
installed as requirements, if not found). For Nvidia support, this includes at&lt;br /&gt;
least [https://developer.nvidia.com/cuda-toolkit CUDA] for the backend and&lt;br /&gt;
[https://developer.nvidia.com/nccl NCCL] for communication and sometimes&lt;br /&gt;
additional libraries like [https://developer.nvidia.com/cudnn cuDNN] for deep&lt;br /&gt;
neural networks and others. Similar libraries exist for other types of&lt;br /&gt;
accelerators like AMD GPUs and different ML/DL frameworks. As these libraries&lt;br /&gt;
tend to be large in size, it is advised to use pre-installed versions if&lt;br /&gt;
available. On HPC systems these are often available through the provided module&lt;br /&gt;
system and should be loaded before installing an applicable framework and each&lt;br /&gt;
time before executing a workload using that framework. &lt;br /&gt;
&lt;br /&gt;
=== Containers ===&lt;br /&gt;
&lt;br /&gt;
As dependencies between package, library and framework versions can be an issue,&lt;br /&gt;
it is sometimes a challenge to find a working combinitation of those. This is&lt;br /&gt;
were containers excel. Containers offer the option to provide a&lt;br /&gt;
pre-configured/pre-built copy of a configuration. One source for container&lt;br /&gt;
images is the [https://catalog.ngc.nvidia.com/containers Nvidia GPU Cloud (NGC)&lt;br /&gt;
Catalog], which offers many container images for different softwares to use&lt;br /&gt;
with Nvidia hardware. This also includes working environments for frameworks&lt;br /&gt;
like TensorFlow or PyTorch that are packed together with other tools and&lt;br /&gt;
library that improve the (GPU) performance of certain workloads within these&lt;br /&gt;
frameworks or in case of e.g. TensorFlow contain a Horovod installation for&lt;br /&gt;
distributed execution.&lt;br /&gt;
&lt;br /&gt;
To make use of these containers check out which containerization software is&lt;br /&gt;
available on the cluster you are using. Possible software includes&lt;br /&gt;
[https://apptainer.org/ Apptainer], [https://github.com/NERSC/shifter Shifter]&lt;br /&gt;
or [https://www.docker.com/ Docker]. The latter one is most likely only&lt;br /&gt;
available through [https://github.com/NVIDIA/enroot Enroot] as an unpriviledged&lt;br /&gt;
container or managed via SLURM by the [https://github.com/NVIDIA/pyxis Pyxis]&lt;br /&gt;
plugin on HPC systems due to security considerations. Note that the other&lt;br /&gt;
mentioned containerization tools typically support converting existing docker&lt;br /&gt;
containers to their own container image type.&lt;br /&gt;
&lt;br /&gt;
Consult the according documentation on how to run/execute a container depending&lt;br /&gt;
on the used software and check for flags that might be required on the desired HPC systems.&lt;br /&gt;
This includes ensuring GPU availabilty inside the container, via e.g.&lt;br /&gt;
&amp;lt;code&amp;gt;--nv&amp;lt;/code&amp;gt; for Apptainer, or checking the set environment variables that&lt;br /&gt;
might get carried into the container environment and therefore have to be&lt;br /&gt;
either cleaned or expanded when requiring file paths that are not part of the&lt;br /&gt;
container.&lt;br /&gt;
&lt;br /&gt;
In some cases it might be required to build own containers or build upon&lt;br /&gt;
exisiting ones, but it is strongly recommended to use containers provided on&lt;br /&gt;
the HPC systems to avoid unnecessary duplication of container images on file&lt;br /&gt;
systems. Consult applying guidlines regarding the use and availability of&lt;br /&gt;
containers on the HPC system in use.&lt;br /&gt;
&lt;br /&gt;
==== Expanding containers without rebuilding ====&lt;br /&gt;
 &lt;br /&gt;
If a used container does not feature all required packages different options&lt;br /&gt;
exist that do not require to rebuild the image. One option provided by tools&lt;br /&gt;
like Apptainer are persistant overlays. While containers are typically&lt;br /&gt;
read-only file systems, persitant overlays are sandboxed file systems lying&lt;br /&gt;
ontop of the container enabling making additional software and packages&lt;br /&gt;
available to the containerized software. Read more about persistant overlays&lt;br /&gt;
[https://apptainer.org/docs/user/main/persistent_overlays.html here].&lt;br /&gt;
&lt;br /&gt;
In case of Python, a second option is available through the additional use of&lt;br /&gt;
virtual environments. If packages are missing inside the container they can be&lt;br /&gt;
installed in a separate virtual environment. Ensure that the Python version&lt;br /&gt;
used to create the environment matches the version inside the container.&lt;br /&gt;
Otherwise compatibility issues are possible. The path to the virtual&lt;br /&gt;
environment can then be appended to the &amp;lt;code&amp;gt;PYTHONPATH&amp;lt;/code&amp;gt; environment&lt;br /&gt;
variable and passed to the container when executing it. This allows the&lt;br /&gt;
containerized software to be able to find packages installed in the virtual&lt;br /&gt;
environment.&lt;br /&gt;
&lt;br /&gt;
=== Virtual environments ===&lt;br /&gt;
&lt;br /&gt;
Virtual environments allow separating different package installations to&lt;br /&gt;
account for dependencies between package versions, enabling separation of&lt;br /&gt;
framework installations for better maintenance and compatibility. The following&lt;br /&gt;
will cover virtual environments for Python installations. Virtualenv is a tool&lt;br /&gt;
that allows to create virtual environments. A version with a reduced, but for&lt;br /&gt;
most cases sufficient, feature set is integrated in the Python&lt;br /&gt;
&amp;lt;code&amp;gt;venv&amp;lt;/code&amp;gt; module. Before creating a virtual environment ensure that the&lt;br /&gt;
desired Python version is loaded.&lt;br /&gt;
A virtual environment can be created and activated using the following commands:&lt;br /&gt;
&lt;br /&gt;
    $ python -m venv path/to/venv % To create the venv&lt;br /&gt;
    $ source path/to/venv/bin/activate % To activate the venv&lt;br /&gt;
&lt;br /&gt;
Once the environment is activated all package installations will be performed&lt;br /&gt;
inside the virtual environment. Ensure that required dependencies like e.g.&lt;br /&gt;
CUDA libraries are loaded before package installations, if applicable. To start&lt;br /&gt;
an execution using a virtual environment in a job script, simply load all&lt;br /&gt;
necessary modules, source the virtual envrionment to be activated and execute&lt;br /&gt;
the desired command, all inside the job script.&lt;br /&gt;
&lt;br /&gt;
Be aware that virtual environments create overhead in form of around 50,000&lt;br /&gt;
files on creation which may make it not suitable to be put on file systems like&lt;br /&gt;
LUSTRE, where file quotas are often used. Also, resort to provided containers,&lt;br /&gt;
if suited, to minimize unnecessary duplication of packages.&lt;br /&gt;
&lt;br /&gt;
== Possible workloads ==&lt;br /&gt;
&lt;br /&gt;
This section is meant to provide rough guidlines to select available hardware&lt;br /&gt;
suited for the desired workload.&lt;br /&gt;
&lt;br /&gt;
=== Training and fine-tuning ===&lt;br /&gt;
&lt;br /&gt;
Training and fine-tuning (in this section referred to as simply training) of&lt;br /&gt;
machine learning models, especially (deep) neural networks are compute and&lt;br /&gt;
memory intensive tasks. These kind of tasks involve the loading of (large)&lt;br /&gt;
datasets and are well suited to be performed on GPUs, as they benefit from the&lt;br /&gt;
accelerated computation of matrix-matrix multiplications, which are the core of&lt;br /&gt;
many neural network computations. Depending on the field of application, like&lt;br /&gt;
e.g. computer vision and natural language processing, models vary heavily in&lt;br /&gt;
number of trainable parameters requiring different amounts of GPU memory to fit&lt;br /&gt;
on a device. Methods to run models that exceed the available memory of a single&lt;br /&gt;
GPU are covered in a sub-section for distributed training. The required amount&lt;br /&gt;
of GPU memory depends on the chosen model, the optimizer and the precision&lt;br /&gt;
(FP32, FP16, BF16, FP32+FP16 (mixed precision)). For example, the memory&lt;br /&gt;
requirements for the training of a LLama2 7B large language model with 7&lt;br /&gt;
billion parameters in FP32 precision can be estimated to roughly 112GB&lt;br /&gt;
(depending on the used optimizer). &lt;br /&gt;
&lt;br /&gt;
Training is performed in so called batches&lt;br /&gt;
which inputs multiple training samples into the model and aggregates the&lt;br /&gt;
gradients of an entire batch before updating parameters to increase both&lt;br /&gt;
training throughput and model accuracy. While a dataset is most commonly first&lt;br /&gt;
loaded into the systems main memory, the data samples required for the training&lt;br /&gt;
batches need to be copied to the GPU and therefore have also be taken into&lt;br /&gt;
consideration when estimating the required memory. It is often required to&lt;br /&gt;
experiment with different batch sizes to find a balance between memory usage,&lt;br /&gt;
training speed and the accuracy of the final model. If a batch-size is required&lt;br /&gt;
for a certain outcome, but does not fit into the memory, techniques like&lt;br /&gt;
gradient accumulation can be used to trade additional computational overhead&lt;br /&gt;
for improved model performance by aggregating gradients from multiple batches&lt;br /&gt;
before updating the parameters, instead of updating after each batch.&lt;br /&gt;
&lt;br /&gt;
For considerations regarding dataset storage and loading refer to&lt;br /&gt;
[Machine_and_Deep_Learning_Frameworks#Handling_datasets dataset handling]&lt;br /&gt;
&lt;br /&gt;
==== Distributed training/fine-tuning ====&lt;br /&gt;
&lt;br /&gt;
If the model that should be trained does not fit into the memory of a single&lt;br /&gt;
GPU or the model training takes too much time, the work can be distributed over&lt;br /&gt;
multiple CPUs or GPUs. While distributed training with multiple CPUs is&lt;br /&gt;
possible, the remaining part will only consider muli-GPU use-cases.&lt;br /&gt;
&lt;br /&gt;
Training over multiple devices is mainly classified into two categories, model&lt;br /&gt;
parallel and data parallel, which also can be combined to allow even better&lt;br /&gt;
usage of distributed resources.&lt;br /&gt;
These concepts will be briefly explained for the use-cases of neural networks.&lt;br /&gt;
More detailed explanations can be found on:&lt;br /&gt;
[https://huggingface.co/docs/transformers/v4.15.0/en/parallelism Hugging Face]&lt;br /&gt;
or [https://colossalai.org/docs/concepts/paradigms_of_parallelism/ Colossal AI]&lt;br /&gt;
and other.&lt;br /&gt;
&lt;br /&gt;
===== Model parallelism =====&lt;br /&gt;
&lt;br /&gt;
Model parallelism focuses on the problem of fitting the model parameters into&lt;br /&gt;
GPU memory. By distributing the parameters among available GPUs reduces the&lt;br /&gt;
memory required for the model parameters per GPU and frees memory to train&lt;br /&gt;
larger models or allow larger batch sizes.&lt;br /&gt;
&lt;br /&gt;
Splitting a network vertically distributes the different layers among the GPUs,&lt;br /&gt;
so one GPU will only need to save the parameters of a subset of layers. This&lt;br /&gt;
requires communication between the GPUs in both the forward and backwards path.&lt;br /&gt;
It also leaves all GPUs idle which require other GPUs to finish the&lt;br /&gt;
computations on their layers and exchange information. To increase usage of the&lt;br /&gt;
devices pipeline parallelism can be used to split the batch into micro-batches,&lt;br /&gt;
perform calculations on those micro-batches and already provide data to other&lt;br /&gt;
GPUs while still working on the remaining micro-batches.&lt;br /&gt;
&lt;br /&gt;
Tensor parallelism offers another appproach to reduce the memory requirements.&lt;br /&gt;
By splitting the tensors along one of the dimensions a tensor can be&lt;br /&gt;
distributed among multiple GPUs reducing the memory required for the tensor on&lt;br /&gt;
each GPU. The results from each GPU are computed into one final result tensor&lt;br /&gt;
at the end.&lt;br /&gt;
&lt;br /&gt;
===== Data parallelism =====&lt;br /&gt;
&lt;br /&gt;
Data parallelism serves the main purpose of accelerating the training process&lt;br /&gt;
of a machine learning model. By distributing the training samples across&lt;br /&gt;
multiple GPUs each GPU needs to process less batches resulting in lower&lt;br /&gt;
training time per epoch. After each batch all GPUs exchange their gradients&lt;br /&gt;
through an all-reduce pattern to calculate the gradients for the weight&lt;br /&gt;
updates. Because all GPUs contribute with their number of samples the effective&lt;br /&gt;
batch size is scaled by the number of used GPUs. This might require adjustments&lt;br /&gt;
to the batch size to still achieve the required performance of a model, but can&lt;br /&gt;
also help to achieve certain batch sizes which otherwise would not be possible&lt;br /&gt;
on single devices. Specialized optimizer like LARS (Layer-wise Adaptive Rate&lt;br /&gt;
Scaling) are designed to perform well on large batch sizes that are achieved&lt;br /&gt;
through data parallel training. Basic data distributed parallelism is supported&lt;br /&gt;
by most ML/DL frameworks like PyTorch and TensorFlow.&lt;br /&gt;
&lt;br /&gt;
===== Hybrid parallelism =====&lt;br /&gt;
&lt;br /&gt;
Often some degrees of model and data parallelism are combined to achieve better&lt;br /&gt;
training performance.&lt;br /&gt;
&lt;br /&gt;
[https://www.deepspeed.ai/tutorials/zero/ ZeRO], the zero redundancy optimizer,&lt;br /&gt;
partitions optimizer states across GPUs and CPUs to both accelerate training&lt;br /&gt;
and lower memory requirements. Different optimization levels have different&lt;br /&gt;
impact on communication overhead, training time and memory requirements.&lt;br /&gt;
&lt;br /&gt;
[https://engineering.fb.com/2021/07/15/open-source/fsdp/ FSDP], fully sharded&lt;br /&gt;
data parallelism, is another approach to enable the training of large models&lt;br /&gt;
across multiple GPUs by sharding parameters across multiple devices. FSDP is&lt;br /&gt;
available in PyTorch.&lt;br /&gt;
&lt;br /&gt;
===== Further readings =====&lt;br /&gt;
&lt;br /&gt;
For more detailed information on distributed training for specific frameworks,&lt;br /&gt;
consult the pages below:&lt;br /&gt;
* [[PyTorch#Distributed_training|Distributed training with PyTorch]]&lt;br /&gt;
* [[TensorFlow#Distributed_training|Distributed training with TensorFlow]]&lt;br /&gt;
&lt;br /&gt;
=== Inference ===&lt;br /&gt;
&lt;br /&gt;
Inference allows to use a trained model with previously unseen data to create&lt;br /&gt;
predictions. It requires significantly less computational power and memory&lt;br /&gt;
resources than training. Therefore it is suited for CPU and GPU systems, where&lt;br /&gt;
GPU systems still outperform CPUs by a lot regarding the number of processed&lt;br /&gt;
samples per time step. But depending on the use-cases of a trained model, CPU&lt;br /&gt;
inference might be sufficient on smaller deployment systems if there is not a&lt;br /&gt;
large amount of data or if the latency of the inference is negligible. As a&lt;br /&gt;
reference for memory requirements a Llama 2 7B model would require about 28GB&lt;br /&gt;
of memory with FP32 precision, which is significantly less than required for&lt;br /&gt;
the model training as optimizer states and gradients don&amp;#039;t have to be stored.&lt;br /&gt;
&lt;br /&gt;
Inference can be sped up by using multiple devices. Unlike distributed training&lt;br /&gt;
this does not require communication or specialized algorithms/strategies.&lt;br /&gt;
Distributed inference is performed by supplying different instances with&lt;br /&gt;
separate data samples to work on.&lt;br /&gt;
&lt;br /&gt;
To deploy trained models frameworks like PyTorch and Tensorflow provide&lt;br /&gt;
inference server, [https://pytorch.org/serve/ TorchServe] and&lt;br /&gt;
[https://www.tensorflow.org/tfx/guide/serving TensorFlow Serving], that are&lt;br /&gt;
suited for production environments. Nvidia also provides an inference server&lt;br /&gt;
with [https://developer.nvidia.com/triton-inference-server Triton] which is&lt;br /&gt;
optimized for Nvidia GPUs.&lt;br /&gt;
&lt;br /&gt;
Additional libraries like [https://developer.nvidia.com/tensorrt TensorRT] can&lt;br /&gt;
increase the inference throughput by pre-compiling a trained model with&lt;br /&gt;
optimizations before deploying it in a target environment.&lt;br /&gt;
&lt;br /&gt;
== Handling datasets ==&lt;br /&gt;
&lt;br /&gt;
Being able to access a dataset for training with high bandwidth is crucial for&lt;br /&gt;
high resource utilization during model training. As datasets tend to require up&lt;br /&gt;
to multiple TB of storage and can span over millions of files choosing an&lt;br /&gt;
adequate filesystem is really important on HPC systems. HPC systems use shared&lt;br /&gt;
file systems to provide data storage to the users. Accessing datasets on those&lt;br /&gt;
file systems can add many I/O operations on the systems and can have significant&lt;br /&gt;
performance impact for other users.&lt;br /&gt;
Potentially required considerations that could be made are:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Storage quota:&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
The large size of datasets present challenges to storage space avaiable to a&lt;br /&gt;
user. A space-efficient approach are datasets that are provided to the cluster&lt;br /&gt;
users from a central storage which reduces the need for own copies of the same&lt;br /&gt;
dataset for users. Due to licensing and availability reasons some datasets can&lt;br /&gt;
not be provided to all users and may require special permission or access&lt;br /&gt;
groups to be available. If the targeted user group is too small it may not be&lt;br /&gt;
feasible to store a dataset centrally. Additionally users may require different&lt;br /&gt;
preprocessings or file formats for their datasets. This again creates the need&lt;br /&gt;
for additional copies and the space to store them.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;File systems:&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
HPC systems often provide multiple types of storage and file systems. To reduce&lt;br /&gt;
the load on those systems it is advised to reduce the amount of single file&lt;br /&gt;
operations that are performed. One way to achieve this are archives/tarballs,&lt;br /&gt;
as this presents the file systems with a single file operation when copying to&lt;br /&gt;
a destination. This case requires additional considerations to be accessible.&lt;br /&gt;
If node-local storage is available those archives can be copied onto the nodes,&lt;br /&gt;
unpacked, potentially required preprocessing can be applied and the training&lt;br /&gt;
then started. Depending on the required availability of data samples to&lt;br /&gt;
multiple nodes using on-demand file systems like BeeOND can improve performance&lt;br /&gt;
and can help to reduce network load. The least amount of network load would be&lt;br /&gt;
achieved if the datasets or data samples required for different workers fit&lt;br /&gt;
onto the node-local storage and are only required by those devices. Depending&lt;br /&gt;
on the disk space, the required number of nodes and the number of available&lt;br /&gt;
devies, this may not be possible. This approach also needs to be adapted for&lt;br /&gt;
the training needs. If sampling, shuffling or other operations are required on&lt;br /&gt;
the data samples it might not be possible to just copy and unpack archives on&lt;br /&gt;
different nodes as this might not provide the required distribution.&lt;br /&gt;
Possible libraries that may help in these cases are:&lt;br /&gt;
* [https://github.com/mxmlnkn/ratarmount ratarmount]&lt;br /&gt;
* [https://github.com/webdataset/webdataset webdataset]&lt;br /&gt;
* [https://datadings.readthedocs.io/en/stable/ datadings]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Improving I/O performance for parallel single node runs ===&lt;br /&gt;
&lt;br /&gt;
The following assumes a GPU node with multiple GPUs and that the user runs many&lt;br /&gt;
trainings simultaneously in differnt jobs, each using one GPU. When training&lt;br /&gt;
multiple instances of the same model, or different models, but using the same&lt;br /&gt;
dataset, it could prove beneficial to consolidate multiple jobs into a single&lt;br /&gt;
job. By doing so, it is possible to make use of local SSDs (if available)&lt;br /&gt;
through on-demand file systems like BeeOND. The dataset only needs to be&lt;br /&gt;
transfered once onto the local storage over the network and can be accessed by&lt;br /&gt;
mutliple training instances. This reduces the load on the network and also&lt;br /&gt;
improves the access time to the data samples. For a node with four GPUs this&lt;br /&gt;
would mean to submit one four-GPU job instead of four one-GPU jobs, requesting&lt;br /&gt;
the available on-demand file system, and launching multiple independent&lt;br /&gt;
training instances on the same dataset on single GPUs.&lt;/div&gt;</summary>
		<author><name>Jannis-klinkenberg-0962@rwth-aachen.de</name></author>
	</entry>
</feed>