BLOCKCHAIN CLOUD PROJECTS FOR AI

BLOCKCHAIN CLOUD PROJECTS FOR AI

The computing needs of Machine Learning are beginning to outpace the capacity of modern-day microchips. 
Decentralized cloud projects may be the answer.  

IS MOORE’s
LAW DEAD — or dying?

For nearly half a century, Moore’s Law — the observation that the number of transistors on an integrated circuit will double every two years with minimal rise in cost — has held true in the semiconductor industry. In a sense, the exponential growth in computing capacity is one of the main factors that brought on the AI boom that we see today. The theoretical basis regarding AI had been around for decades, but there were two major roadblocks to its implementation: (1) the massive amounts of data needed to train and test the model and (2) the tools (computing power) needed to run the math on this data.

Today’s post focuses on the latter of this two-part equation. To provide some perspective of the amount of compute required to train an AI model, OpenAI’s GPT-3 requires more than 5 billion billion (and yes that’s not a typo) operations per second of computation, as well as 3 trillion bytes of memory capacity. Nowadays, a single transistor in a microchip is as small as 4 nm, and it is typical for a GPU (which is the preferred processing unit for AI computing and crypto mining) to house billions of transistors. These incredible leaps in semiconductor technology has made it possible for us to meet the aforementioned needs of the artificial intelligence.

However, in recent times, the increase in computing power used in training AI systems is outmatching the growth rate of Moore’s Law. At the same time, Moore's Law itself -- which had been considered to be a force of nature governing the semiconductor industry -- is hitting hitting several barriers, including quantum tunneling, heat production, diseconomies of scale in size reduction, etc. Gaining access to the compute power needed to train AI models is a major challenge for many companies in the space. All of these phenomena point to a single conclusion:

recent AI computation needs are outpacing the computing capacity offered by modern microchips. The same TECHNOLOGY that brought forth the ai boom has become its biggest bottleneck.

BREAKING FREE OF MOORE’s LAW

There are many ways that companies within the AI space are attempting to circumvent the aforementioned constraints. Many of these strategies depart from shrinking the size of each individual transistor to fit more into a single microchip. Some notable strategies include:

  1. Parallelization: splitting large-scale AI training operations over a large quantity of distributed processors

  2. Custom architectures: custom building chips solely for AI operations, rather than general purpose chips such as CPUs or GPUs

  3. Quantum computing: using special computer hardware and algorithms based on quantum mechanical phenomena to speed up computations

Although these methods could possibly allow players in the AI space to “keep up” with the computational needs of their projects, some of these technologies (quantum computing especially) are arguably far away from the implementation stage; all of them require massive deployment of capital, a luxury that can be afforded by only the biggest technology companies.

This exacerbates the tilted “playing field” which is already a prominent issue in the AI race. “Ever-larger deployments of computational power for the training and inference of today’s largest and most powerful models” means big technology companies have an advantage over startups in the race to capture value from AI. Big Tech enjoys privileged access to computing power and the economies of scale of large data centers. Since startups do not have the capital to implement such large-scale projects, it is significantly more difficult to break into the AI scene under the current circumstances.

With Current efforts to break free of moore’s Law, Startups will fall even further behind big tech in their efforts to break into AI.

THE RISE OF BLOCKCHAIN
-BASED MARKETS

Of course, the aforementioned trends is mostly bad news to the AI boom, but there is good news as well: there may be an unlikely hero to this crisis: blockchain-based markets. Despite the shortage in GPUs and computing power that has hindered many AI companies, the fact of the matter is that a typical data center in the US only operates at only 12-18%. This means that there are massive amounts of untapped computing power which could potentially be directed towards the AI firms that need it. Several decentralized cloud projects are underway, which creates a secure, lucrative environment that “encourages global participants to contribute computing power… fostering innovation and widespread adoption of AI technologies.”

So how exactly does the decentralized cloud work, and why is it such a promising solution to the computing power shortage of the AI boom? Let’s examine two companies, the Akash Network and Gensyn, which are leading efforts to improve access to compute power for AI research through the blockchain.

THE AKASH NETWORK

The Akash Network is a decentralized cloud computing platform built and implemented on the Cosmos blockchain. It’s “Supercloud” connects clients or “tenants” (AI researchers and smaller companies who need computing power to run data) with providers (GPU owners who have idle processing power).

When a computer resource is made available, they are split into smaller modules or “containers.” When a tenant submits a request to use these containers, the providers bid on them; the lowest bid for the requested resources win the lease. This entire transaction occurs on the Akash blockchain. This “reverse auction” system encourages competitive pricing, which in turn makes it an attractive alternative for customers looking for access to the most cutting-edge GPUs without the five-figure price tag.

GENSYN

Whereas transactions on the Akash Network are for compute time on a GPU, or any other processing unit, Gensyn is currently working to build a system that offers a more advanced service. Customers submit deep learning training tasks, which are run by other actors on the platform on their behalf, using idle computing power from GPUs, phones, PCs, etc. From the user’s perspective, there is no training that they need to perform on their part.

A standard training task on Gensyn goes through the following process:

  1. The user submits a task, simply detailing 3 pieces of information (1. metadata describing the task and hyperparameters, 2. a model binary, or skeleton structure, 3. publicly accessible, pre-processed training data)

  2. The training task is profiled and moved to a common task pool. Tasks are the smallest unit of Machine Learning work on the protocol; larger workloads are split into smaller tasks which are distributed throughout the network (this is made possible through parallelization, which was a concept mentioned earlier).

  3. The Solver performs the task according to the metadata submitted by the Submitter and using the model and training data supplied. The Solver, working through the task, creates checkpoints at scheduled intervals, storing metadeta from the training process at the point in time. These checkpoints are called proof-of-learning.

  4. Once the task is completed, the Solver registers it with the chain. The proof-of-learning is accessed by Verifiers, who re-run portions of the proof to ensure that the Solver has actually completed the entirety of the machine learning work instead of cutting corners for easy money.

GENSYN

Whereas transactions on the Akash Network are for compute time on a GPU, or any other processing unit, Gensyn offers a more advanced service. Customers submit deep learning training tasks, which are run by owners of idle computing power. From the user’s perspective, there is no training that they need to perform on their part: upon simply detailing 3 pieces of information (1. metadata describing the task and hyperparameters, 2. a model binary, or skeleton structure, 3. publicly accessible, pre-processed training data), the training task is profiled

one of the Solvers on the platform will perform the task on their behalf. Gensyn uses a layer-1 trustless protocol which doesn’t require an administrative overseer or legal enforcement. Rather, smart contracts facilitate task distribution and payments programmatically.

At this point, you might be asking, “how can a smart contract check if the Solvers of the platform aren’t cutting corners, that they actually completed their training task to the fullest?”

The Verification challenge

Gensyn’s system has four main participants

n 5 years time, a developer won’t think about where their model will be trained, how the server is configured, how many GPUs they can access, or any other system administration details. They’ll simply define their model architecture and hyperparameters and send it out to a protocol - where it could be trained on a single GPU, a cluster of TPUs, a billion iPhones, or any combination thereof.

So what would this actually look like in real life? Consider the example of Ishan Dhanani, a computer science graduate student at Columbia University. When he was trying to experiment on Meta’s LLama2 open-source AI model, he realized that he couldn’t obtain access to GPU compute power using traditional cloud computing providers, such as Amazon Web Services — they were always sold out.

Using Akash, however, he was able to rent out 7 hours of processing on a $15,000 Nvidia A100 at the price of $1.10 per hour. The total cost he incurred to carry out his task was “about the cost of a beer.”