How to Build AI or ML Farms – Without Breaking the Bank

WRITTEN BY

Richard Jonker, VP Commercial
Business Development

Artificial Intelligence (AI) and Machine Learning (ML) are the biggest trends in information technology. While the benefits are clear, the complexity of building a fast AI setup is overwhelming. NETGEAR can help you with crucial components to build an optimized AI cluster.

The cluster is where things can get either really expensive or limit AI performance – or both. Let’s run through the typical hardware components used and determine where you can optimize for the best performance vs cost ratio.

Servers

These are the workhorses of an AI server farm. And you need a lot of them. These servers typically have powerful CPUs and are often equipped with GPUs or specialized accelerators like TPUs (Tensor Processing Units) that are designed specifically for the parallel processing tasks common in machine learning and deep learning. There is no compromising here. The amount of raw compute power of the servers will make or break your AI cluster.

Storage Systems

AI applications often require access to large datasets. Storage solutions in a server farm can include SSDs for fast access, HDDs for larger, less frequently accessed data, and network-attached storage (NAS) or storage area networks (SAN) for shared storage solutions. Fortunately, these systems have commoditized and there are a lot of choices for any budget.

Networking Hardware

High-bandwidth, low-latency switches are crucial for handling the intense traffic demands of a server farm. They are often the bottleneck of all data transport in an AI setup. Since you can’t compromise on throughput performance or low latency, this is an area where you may find NETGEAR’s new M4350 series of 10GbE/100GbE network switches to be a particular life-saver.

Not only do these switches run on the most modern, low latency and high-performance silicon, they are also built with simplicity in management and cost-conscious customers in mind. Manufacturers of typical enterprise data center switches have made their products simply unaffordable, which makes those products a bad bet.

Routers manage traffic between the server farm and the wider internet or other networks. Same applies here; they could be the bottleneck with traffic to / from the internet. But cost-effective alternatives, such as NETGEAR’s PR60X Professional Router with multi-gig/10gig WAN/LAN performance, are available.

Network interface cards (NICs), possibly with 10GbE /100GbE throughput, are essential for fast communication between servers. They need to be optimized for the switches you choose.
NETGEAR’s engineering team can help you design an optimal network setup with these components.

Network management software is often forgotten, but is crucial to manage network configurations, monitor network performance, troubleshoot, and ensure network security. NETGEAR offers a free controller, called NETGEAR Engage, to manage and monitor small or large numbers of NETGEAR Fully Managed Switches.

Examples

NETGEAR switches are in use in world class AI/ML applications. Two examples:

  • AI/ML cluster to analyze thousands of concurrent camera feeds of self-driving cars by a third-party data analysis company.
  • AI/ML cluster to gather, scan, analyze and combine drone camera footage for the military of a NATO member state.

With this overview, we hope to have given you a high-level idea of the prime considerations in designing an AI/ML setup and where we can help you optimize performance and cost.

When you have a draft proposal for the architecture of your cluster, please reach out to us to have a discussion on the network design. We will design your network for free and guarantee its functions correctly. Read more about our M4350 series of switches, suitable for AI & ML deployments.