<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>infrastructure | DAILY ZSOCIAL MEDIA NEWS</title>
	<atom:link href="https://dailyzsocialmedianews.com/tag/infrastructure/feed/" rel="self" type="application/rss+xml" />
	<link>https://dailyzsocialmedianews.com</link>
	<description>ALL ABOUT DAILY ZSOCIAL MEDIA NEWS</description>
	<lastBuildDate>Tue, 12 Mar 2024 15:22:59 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.7.1</generator>

<image>
	<url>https://dailyzsocialmedianews.com/wp-content/uploads/2020/12/cropped-DAILY-ZSOCIAL-MEDIA-NEWS-e1607166156946-32x32.png</url>
	<title>infrastructure | DAILY ZSOCIAL MEDIA NEWS</title>
	<link>https://dailyzsocialmedianews.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Constructing Meta’s GenAI Infrastructure &#8211; Engineering at Meta</title>
		<link>https://dailyzsocialmedianews.com/constructing-metas-genai-infrastructure-engineering-at-meta/</link>
		
		<dc:creator><![CDATA[]]></dc:creator>
		<pubDate>Tue, 12 Mar 2024 15:22:58 +0000</pubDate>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Building]]></category>
		<category><![CDATA[Engineering]]></category>
		<category><![CDATA[GenAI]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[Meta]]></category>
		<category><![CDATA[Metas]]></category>
		<guid isPermaLink="false">https://dailyzsocialmedianews.com/?p=24857</guid>

					<description><![CDATA[<div style="margin-bottom:20px;"><img width="1023" height="733" src="https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta.png" class="attachment-post-thumbnail size-post-thumbnail wp-post-image" alt="Building Meta’s GenAI Infrastructure - Engineering at Meta" decoding="async" fetchpriority="high" srcset="https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta.png 1023w, https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta-300x215.png 300w, https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta-768x550.png 768w" sizes="(max-width: 1023px) 100vw, 1023px" /></div><p>Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We are strongly committed to open [&#8230;]</p>
The post <a href="https://dailyzsocialmedianews.com/constructing-metas-genai-infrastructure-engineering-at-meta/">Constructing Meta’s GenAI Infrastructure – Engineering at Meta</a> first appeared on <a href="https://dailyzsocialmedianews.com">DAILY ZSOCIAL MEDIA NEWS</a>.]]></description>
										<content:encoded><![CDATA[<div style="margin-bottom:20px;"><img width="1023" height="733" src="https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta.png" class="attachment-post-thumbnail size-post-thumbnail wp-post-image" alt="Building Meta’s GenAI Infrastructure - Engineering at Meta" decoding="async" srcset="https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta.png 1023w, https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta-300x215.png 300w, https://social-media-news.s3.amazonaws.com/wp-content/uploads/2024/03/12152256/Building-Metas-GenAI-Infrastructure-Engineering-at-Meta-768x550.png 768w" sizes="(max-width: 1023px) 100vw, 1023px" /></div><p></p>
<ul>
<li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.</span></li>
<li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">We are strongly committed to open compute and open source. We built these clusters on top of </span><span style="font-weight: 400;">Grand Teton</span><span style="font-weight: 400;">, </span><span style="font-weight: 400;">OpenRack</span><span style="font-weight: 400;">, and </span><span style="font-weight: 400;">PyTorch</span><span style="font-weight: 400;"> and continue to push open innovation across the industry.</span></li>
<li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">This announcement is one step in our ambitious infrastructure roadmap. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.</span></li>
</ul>
<p><span style="font-weight: 400;">To lead in developing AI means leading investments in hardware infrastructure. Hardware infrastructure plays an important role in AI’s future. Today, we’re sharing details on two versions of our </span><span style="font-weight: 400;">24,576-GPU data center scale cluster at Meta. These clusters support our current and next generation AI models, including Llama 3, the successor to</span> <span style="font-weight: 400;">Llama 2</span><span style="font-weight: 400;">, our publicly released LLM, as well as AI research and development across GenAI and other areas .</span></p>
<h2><span style="font-weight: 400;">A peek into Meta’s large-scale AI clusters</span></h2>
<p><span style="font-weight: 400;">Meta’s long-term vision is to build artificial general intelligence (AGI) that is open and built responsibly so that it can be widely available for everyone to benefit from. As we work towards AGI, we have also worked on scaling our clusters to power this ambition. The progress we make towards AGI creates new products,</span> <span style="font-weight: 400;">new AI features for our family of apps</span><span style="font-weight: 400;">, and new AI-centric computing devices. </span></p>
<p><span style="font-weight: 400;">While we’ve had a long history of building AI infrastructure, we first shared details on our </span><span style="font-weight: 400;">AI Research SuperCluster (RSC)</span><span style="font-weight: 400;">, featuring 16,000 NVIDIA A100 GPUs, in 2022. RSC has accelerated our open and responsible AI research by helping us build our first generation of advanced AI models. It played and continues to play an important role in the development of </span><span style="font-weight: 400;">Llama</span><span style="font-weight: 400;"> and </span><span style="font-weight: 400;">Llama 2</span><span style="font-weight: 400;">, as well as advanced AI models for applications ranging from computer vision, NLP, and speech recognition, to</span> <span style="font-weight: 400;">image generation</span><span style="font-weight: 400;">, and even</span> <span style="font-weight: 400;">coding</span><span style="font-weight: 400;">.</span></p>
</p>
<h2><span style="font-weight: 400;">Under the hood</span></h2>
<p><span style="font-weight: 400;">Our newer AI clusters build upon the successes and lessons learned from RSC. We focused on building end-to-end AI systems with a major emphasis on researcher and developer experience and productivity. The efficiency of the high-performance network fabrics within these clusters, some of the key storage decisions, combined with the 24,576 NVIDIA Tensor Core H100 GPUs in each, allow both cluster versions to support models larger and more complex than that could be supported in the RSC and pave the way for advancements in GenAI product development and AI research.</span></p>
<h3><span style="font-weight: 400;">Network</span></h3>
<p><span style="font-weight: 400;">At Meta, we handle hundreds of trillions of AI model executions per day. Delivering these services at a large scale requires a highly advanced and flexible infrastructure. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently. </span></p>
<p><span style="font-weight: 400;">With this in mind, we built one cluster with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the </span><span style="font-weight: 400;">Arista 7800</span><span style="font-weight: 400;"> with </span><span style="font-weight: 400;">Wedge400</span><span style="font-weight: 400;"> and </span><span style="font-weight: 400;">Minipack2</span><span style="font-weight: 400;"> OCP rack switches. The other cluster features an </span><span style="font-weight: 400;">NVIDIA Quantum2 InfiniBand</span><span style="font-weight: 400;"> fabric. Both of these solutions interconnect 400 Gbps endpoints. With these two, we are able to assess the suitability and scalability of these </span><span style="font-weight: 400;">different types of interconnect for large-scale training,</span><span style="font-weight: 400;"> giving us more insights that will help inform how we design and build even larger, scaled-up clusters in the future. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.</span></p>
<h3><span style="font-weight: 400;">Compute</span></h3>
<p><span style="font-weight: 400;">Both clusters are built using</span> <span style="font-weight: 400;">Grand Teton</span><span style="font-weight: 400;">, our in-house-designed, open GPU hardware platform that we’ve contributed to the Open Compute Project (OCP). Grand Teton builds on the many generations of AI systems that integrate power, control, compute, and fabric interfaces into a single chassis for better overall performance, signal integrity, and thermal performance. It provides rapid scalability and flexibility in a simplified design, allowing it to be quickly deployed into data center fleets and easily maintained and scaled. Combined with other in-house innovations like our</span> <span style="font-weight: 400;">Open Rack</span><span style="font-weight: 400;"> power and rack architecture, Grand Teton allows us to build new clusters in a way that is purpose-built for current and future applications at Meta.</span></p>
<p><span style="font-weight: 400;">We have been openly designing our GPU hardware platforms beginning with our </span><span style="font-weight: 400;">Big Sur platform in 2015</span><span style="font-weight: 400;">.</span></p>
<h3><span style="font-weight: 400;">Storage</span></h3>
<p><span style="font-weight: 400;">Storage plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of image, video, and text data, the need for data storage grows rapidly. The need to fit all that data storage into a performant, yet power-efficient footprint doesn’t go away though, which makes the problem more interesting.</span></p>
<p><span style="font-weight: 400;">Our storage deployment addresses the data and checkpointing needs of the AI clusters via a home-grown Linux Filesystem in Userspace (FUSE) API backed by a version of Meta’s </span><span style="font-weight: 400;">‘Tectonic’ distributed storage solution</span><span style="font-weight: 400;"> optimized for Flash media. This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a </span><span style="font-weight: 400;">challenge</span><span style="font-weight: 400;"> for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading.</span></p>
<p><span style="font-weight: 400;">We have also partnered with </span><span style="font-weight: 400;">Hammerspace</span><span style="font-weight: 400;"> to co-develop and land a parallel network file system (NFS) deployment to meet the developer experience requirements for this AI cluster. Among other benefits, Hammerspace enables engineers to perform interactive debugging for jobs using thousands of GPUs as code changes are immediately accessible to all nodes within the environment. When paired together, the combination of our Tectonic distributed storage solution and Hammerspace enable fast iteration velocity without compromising on scale.     </span></p>
<p><span style="font-weight: 400;">The storage deployments in our GenAI clusters, both Tectonic- and Hammerspace-backed, are based on the </span><span style="font-weight: 400;">YV3 Sierra Point server platform</span><span style="font-weight: 400;">, upgraded with the latest high capacity E1.S SSD we can procure in the market today. Aside from the higher SSD capacity, the servers per rack was customized to achieve the right balance of throughput capacity per server, rack count reduction, and associated power efficiency. Utilizing the OCP servers as Lego-like building blocks, our storage layer is able to flexibly scale to future requirements in this cluster as well as in future, bigger AI clusters, while being fault-tolerant to day-to-day Infrastructure maintenance operations.</span></p>
<h3><span style="font-weight: 400;">Performance</span></h3>
<p><span style="font-weight: 400;">One of the principles we have in building our large-scale AI clusters is to maximize performance and ease of use simultaneously without compromising one for the other. This is an important principle in creating the best-in-class AI models. </span></p>
<p><span style="font-weight: 400;">As we push the limits of AI systems, the best way we can test our ability to scale-up our designs is to simply build a system, optimize it, and actually test it (while simulators help, they only go so far). In this design journey, we compared the performance seen in our small clusters and with large clusters to see where our bottlenecks are. In the graph below, AllGather collective performance is shown (as normalized bandwidth on a 0-100 scale) when a large number of GPUs are communicating with each other at message sizes where roofline performance is expected. </span></p>
<p><span style="font-weight: 400;">Our out-of-box performance for large clusters was initially poor and inconsistent, compared to optimized small cluster performance. To address this we made several changes to how our internal job scheduler schedules jobs with network topology awareness – this resulted in latency benefits and minimized the amount of traffic going to upper layers of the network. We also optimized our network routing strategy in combination with NVIDIA Collective Communications Library (NCCL) changes to achieve optimal network utilization. This helped push our large clusters to achieve great and expected performance just as our small clusters.</span></p>
<p><img decoding="async" class="size-large wp-image-21048" src="https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?w=1024" alt="" width="1024" height="768" srcset="https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png 1999w, https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?resize=916,687 916w, https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?resize=768,576 768w, https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?resize=1024,768 1024w, https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?resize=1536,1152 1536w, https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?resize=96,72 96w, https://engineering.fb.com/wp-content/uploads/2024/03/Meta-24K-GenAi-clusters-performance.png?resize=192,144 192w" sizes="(max-width: 992px) 100vw, 62vw"/>In the figure we see that small cluster performance (overall communication bandwidth and utilization) reaches 90%+ out of the box, but an unoptimized large cluster performance has very poor utilization, ranging from 10% to 90%. After we optimize the full system (software, network, etc.), we see large cluster performance return to the ideal 90%+ range.</p>
<p><span style="font-weight: 400;">In addition to software changes targeting our internal infrastructure, we worked closely with teams authoring training frameworks and models to adapt to our evolving infrastructure. For example, NVIDIA H100 GPUs open the possibility of leveraging new data types such as 8-bit floating point (FP8) for training. Fully utilizing larger clusters required investments in additional parallelization techniques and new storage solutions provided opportunities to highly optimize checkpointing across thousands of ranks to run in hundreds of milliseconds.</span></p>
<p><span style="font-weight: 400;">We also recognize debuggability as one of the major challenges in large-scale training. Identifying a problematic GPU that is stalling an entire training job becomes very difficult at a large scale. We’re building tools such as desync debug, or a distributed collective flight recorder, to expose the details of distributed training, and help identify issues in a much faster and easier way</span></p>
<p><span style="font-weight: 400;">Finally, we’re continuing to evolve PyTorch, the foundational AI framework powering our AI workloads, to make it ready for tens, or even hundreds, of thousands of GPU training. We have identified multiple bottlenecks for process group initialization, and reduced the startup time from sometimes hours down to minutes. </span></p>
<h2><span style="font-weight: 400;">Commitment to open AI innovation</span></h2>
<p><span style="font-weight: 400;">Meta maintains its commitment to open innovation in AI software and hardware. We believe open-source hardware and software will always be a valuable tool to help the industry solve problems at large scale.</span></p>
<p><span style="font-weight: 400;">Today, we continue to support</span> <span style="font-weight: 400;">open hardware innovation</span><span style="font-weight: 400;"> as a founding member of OCP, where we make designs like Grand Teton and Open Rack available to the OCP community. We also continue to be the largest and primary contributor to </span><span style="font-weight: 400;">PyTorch</span><span style="font-weight: 400;">, the AI software framework that is powering a large chunk of the industry.</span></p>
<p><span style="font-weight: 400;">We also continue to be committed to open innovation in the AI research community. We’ve launched the</span> <span style="font-weight: 400;">Open Innovation AI Research Community</span><span style="font-weight: 400;">, a partnership program for academic researchers to deepen our understanding of how to responsibly develop and share AI technologies – with a particular focus on LLMs.</span></p>
<p><span style="font-weight: 400;">An open approach to AI is not new for Meta. We’ve also launched the </span><span style="font-weight: 400;">AI Alliance</span><span style="font-weight: 400;">, a group of leading organizations across the AI industry focused on accelerating responsible innovation in AI within an open community. Our AI efforts are built on a philosophy of open science and cross-collaboration. An open ecosystem brings transparency, scrutiny, and trust to AI development and leads to innovations that everyone can benefit from that are built with safety and responsibility top of mind. </span></p>
<h2><span style="font-weight: 400;">The future of Meta’s AI infrastructure</span></h2>
<p><span style="font-weight: 400;">These two AI training cluster designs are a part of our larger roadmap for the future of AI. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100s as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.</span></p>
<p><span style="font-weight: 400;">As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow’s needs. That’s why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond. Our goal is to create systems that are flexible and reliable to support the fast-evolving new models and research.  </span></p>The post <a href="https://dailyzsocialmedianews.com/constructing-metas-genai-infrastructure-engineering-at-meta/">Constructing Meta’s GenAI Infrastructure – Engineering at Meta</a> first appeared on <a href="https://dailyzsocialmedianews.com">DAILY ZSOCIAL MEDIA NEWS</a>.]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
