An overview of a computation’s lifecycle in Golem
After the introduction made a few weeks ago, it’s now time to deep dive into our Blog post series. As mentioned, this series will feature some insights into the computational perspective of the Golem platform; alongside the requirements, imposed constraints and emerging challenges.
In its current form, Golem is a platform for non-interactive computation: meaning that a user defines a task, submits it to Golem and, after some time, collects the results. As simple as that.
The computer does not need to remain idle while going through this process; the user can submit other tasks or continue to use their machines in any way they prefer.
While all this sounds dead simple for a user to picture, under the hood, there are quite a lot of events and processes happening.
Running Trusted Computations in Untrusted Scenarios
Modern hardware has become so reliable that almost no fault tolerance is necessary, provided that you run your computations locally. However, if you want to run such computations in a Golem-like setting, things start to get complex.
In short: we strive to build a nearly perfect, untrusted p2p network platform for computations out of a bunch/lots of imperfect/untrusted parts. This includes all, modules, layers and logic of the computation process (e.g., transport, result’s collection or verification).
Through this post, we’re going to focus on the computational side of things — setting aside transport-related issues.
So this is what happens within the requestor’s computer:
As you can spot, before any results are collected, Golem has to get through two additional phases: task broadcasting and negotiations. Only after that, Golem can proceed to assess the actual computation, which consists of two additional states — the execution phase and partial results collection. These intermediary states are almost transparent to the user (requestor).
In that graphic, there is a missing piece: the Golem Network — let’s add it and see how the requestor states fit into the network. From left to right (horizontally) there are the states on the requestors’ side. Network communication is illustrated with the vertical arrows.
Besides some few technical nuances, this would capture the ideal visualization about what happens in Golem. To clarify, network communication means that ultimately, the requestor communicates with individual participants (providers).
Now for a more detailed breakdown on states/phases:
In the first image, the first state corresponds to a user preparing the task description:
Which boils down to specifying a few essential points: what kind of task it is — e.g., rendering task with Blender, where all scene and final output parameters have to be set; required resources for this task — e.g. number of providers and their machine caps such as CPU, RAM, HDD, bandwidth; timeouts and pricing. The last one is not of interest in this blog post, nevertheless is a part of the task specification.
After this point, Golem takes over the task and starts its core job. During the broadcast phase, the requestor is sending the task description to the interested providers, via the Golem P2P protocol (denoted in the image with dotted arrow).
Sometimes, these steps might happen in parallel. The broadcast phase may still be active when some fast providers are already starting the execution phase or even delivered partial results. This is an important detail, but we will skip presenting it this way, as trying to present it would only blur the picture. In any case, it does not impact the logic of the described Pic 3.
We’ll stick to the idealized model with ordered states just as presented in the picture, as logically the outcome is identical. If Golem used serialized states in the real implementation, the efficiency would suffer.
The result of the negotiation phase is a subset of providers which are ready to compute the parts of the main task. The main task is split into smaller tasks which will be computed in parallel by the providers.
This is what happens during the next phase: the execution phase. Which consists of two activities.
First of all, the requestor starts pushing large, binary assets to the network, so that they are available for providers. A large binary assets transport layer is used for this purpose (denoted with a large, dotted arrow in the image).
The distinction between the P2P Golem protocol and the large asset transport layer is deliberate, even though both entities may be implemented using the same underlying logic. The Golem protocol is a lightweight protocol for sending messages between nodes in the network to keep their states updated, and trigger required events fast — whereas the latter is responsible for the transport of large binary data either between selected nodes or to a group of nodes, and should not congest the main network.Besides, this layer may use transport-only components such as relays which may be implemented without any dependency or even the knowledge of the Golem P2P protocol.
Large binary asset transfer happens in parallel with sending specific subtasks to selected providers, notified via the P2P Golem protocol. Each subtask consists of the task description described previously, so that each provider computes results consistent with other providers and the required computation slice. An example of this could be a frame, in the case of an animation: when one is set to be computed, each provider may be assigned a single frame, which may result in a many-fold increase in the computation speed.
Each provider sends back partial results after its computation, denoted by the COLLECT state. This also happens via both P2P Golem and large binary data transport layer, as the provider needs a way to notify a particular requestor that the partial result is ready, and send a potentially large binary result back to the requestor.
Once all partial results are collected, Golem builds the final result out of them and passes it to the user.
This is in principle, a computation ’s lifecycle in Golem from the requestor’s perspective.
If we skip the details for a while and assume that the network consists of honest nodes, this logic it is equivalent to:
The image above is an infographic for the single step in a map-reduce framework. A requestor sends smaller tasks to selected workers (providers), and providers return partial results which are used by the requestor to compose the final result.
If we would like to picture a single requestor-provider interaction graph, but in a bit more detailed way, the visualization would be:
In an honest network, this would be as simple as it looks in the graphic. Technical challenges still exist, however, they would be easier to tackle than the ones that have to be solved in an untrusted P2P network setting. If we could build a trusted subnetwork, things would get much easier to deal with.
But as mentioned earlier, all requestor states are there for a reason. We are in an untrusted P2P domain, which means that “here be dragons.” So let’s discuss the challenges of such a setting for a while.
In general, a requestor has three goals:
- Getting valid results
- Getting the job done in an efficient manner
- Pay as little as possible
We’re not going to discuss the third point in this blog post, yet it is worth noting that in a game theoretical setting, the price also plays a role. However, it does not impact the general reasoning presented below.
The first goal means the requestor wants to get the valid result by whatever means the provider generates them (if a provider can take them out of thin air, it is fine as long as validity is preserved).
In practice, computers need to do computations to deliver a result. And we have to find a trade-off between the generality of the task that can be computed and the possibility of results validation.
In general, the broader the class of tasks to be computed, the harder it will be to come up with a general solution for the verification of such.
I.e., if there’d be a class of tasks for which a Proof of Work exists. One of such tasks is prime factorization which is hard to compute and easy to verify, provided that a single provider has to deliver the factorization for a specific number.
On the other hand, there are non-deterministic tasks, such as rendering, for which the results cannot be easily verified and may be different (and still valid) even if computed with the same input parameters.
The best we can do in such case is to make sure that the requested computation is performed correctly.
To explain further, we will stay in the broad non-deterministic tasks class. If we can find a generic approach for this class, it would be easily applicable to deterministic computations or computation for which PoW exist (even though PoW can be used as well).
There are a few possible approaches here, out of which Golem is focusing on three of them:
- keeping track of each provider’s reputation
- different verification methods
- by computing with Trusted Execution Environments
As this series is devoted to SGX, we’re not going to focus on the first two, but we will recap them shortly below, before explaining the third.
Keeping track of each provider’s reputation:
If trusted reputation values can be obtained for each node, the requestor would be able to pick the nodes with the highest chances of delivering valid results. A nice side effect of this approach is that the reputation can be associated not only with the validity of the results but also with the provider’s efficiency. The caveat here is that valid reputation values have to be produced, tracked and updated which is not a straightforward task, especially as reputation can originate from outside of the Golem Network.
On the main picture (3a.), reputation is used during the negotiation phase to select only the best candidate nodes/providers.
Different verification methods:
Verification methods can be used to automatically accept only valid results with some tunable probability, which may result in a longer computation (for details, please resort to link to London presentation, verification section), but that does not involve user interaction. If the number of invalid results is small, then this method may be good enough. The main problem here is the verification algorithm itself, as it has to be handcrafted for each new task class.
Automated verification is run during the collection phase when the partial results are delivered.
The approaches mentioned above can be used simultaneously in a hybrid solution using verification to update the reputation values and use the reputation to select the best providers for the task.
This happens during the negotiation phase and the partial results collection phase.
Trusted Execution Environments (SGX):
But with the advent of Trusted Execution Environments, a wide array of additional solutions have become available.
In this case, we refer directly to Intel SGX technology (we will deep-dive on it on the following blog post of this series, so this is a quick walk-through).
Intel SGX is promising in many aspects. The most important of such promises, from the Golem perspective, is the ability to carry out a verifiable computation on an untrusted machine. If you take a look at the general requirement that we named for the requestor: “make sure that the requested computation is carried out correctly,” you can already spot that it exactly matches this computational requirement in an untrusted setting.
If a requestor can choose provider nodes with this kind of Trusted Execution Environment enabled, he/she effectively gains access to a trusted subnetwork in which partial results do not have to be verified at all.
This is a direct correspondence to Pic 5a, and the corresponding detailed requestor-provider interaction is almost identical to the one presented before.
Trusted Execution Environments are not the silver bullet though. First of all, there is an efficiency overhead associated with using them. If Intel SGX is considered, the numbers look good. We have run a few performance tests for Intel SGX, and the results are very promising. However, we need to be aware that this solution comes associated with certain costs.
Even though Intel SGX can be used to assure verifiable computations, there is no way to force the provider to carry out the computation in an efficient manner.
With that clarification set aside, Intel SGX is the most promising of the solutions listed above.
And with its set of features, it can fit the bigger picture in more than one way (e.g., local verification could be probably carried out directly on the provider’s machine, with the subtask being regularly computed on the same machine), which imposes less overhead on the requestor.
We understand, however, that this solution brings certain hardware limitations: we cannot expect that all providers will be equipped with Trusted Execution Environment enabled hardware. It means that the network will consist of nodes with different trust guarantees. But even with some percentage of Intel SGX nodes, things get significantly better.
For example, a hybrid approach using both the reputation and Intel SGX for computation would result in much better quality guarantees than in a network with regular nodes only. Reputation can help to select users who indeed share the promised amount of resources.
Besides, Intel SGX nodes can be used to verify the work of regular nodes making the hybrid network more reliable.
There are other possible use cases, but we’ll address them in the following blog posts.
Thanks for reading this high-level, however long guide on the Golem computations point of view. We hope this guide sheds some light on our findings with ITL, and we encourage you to ask questions and discuss with the community on the #trustedcomputations channel on Rocket Chat.
In the next installment, as mentioned, we will present the SGX technology from the Golem perspective and further down the road waits for the Graphene-ng solution. Stay tuned!