"\u001b[1;31mNameError\u001b[0m: name 'plt' is not defined"
"output_type": "display_data"
]
}
}
],
],
"source": [
"source": [
...
@@ -420,14 +419,6 @@
...
@@ -420,14 +419,6 @@
"Even accounting for this being an upper-bound estimate, replicating an HPC cluster comparable in scope to Albedo would result in costs that are **10 to 15 times higher** than maintaining an on-site local HPC system.\n",
"Even accounting for this being an upper-bound estimate, replicating an HPC cluster comparable in scope to Albedo would result in costs that are **10 to 15 times higher** than maintaining an on-site local HPC system.\n",
The transition to a cloud-based high-performance computing (HPC) system offers exciting opportunities for scalability, flexibility, and operational efficiency. However, understanding
The transition to a cloud-based high-performance computing (HPC) system offers exciting opportunities for scalability, flexibility, and operational efficiency. However, understanding
the cost implications of such a move is critical to determining its feasibility for AWI. Unlike traditional on-premises systems with fixed capital expenditures and predictable
the cost implications of such a move is critical to determining its feasibility for AWI. Unlike traditional on-premises systems with fixed capital expenditures and predictable
operating costs, cloud HPC introduces a dynamic cost structure based on resource consumption, storage, and data transfer. This flexibility can be both a benefit and a challenge, as costs
operating costs, cloud HPC introduces a dynamic cost structure based on resource consumption, storage, and data transfer. This flexibility can be both a benefit and a challenge, as costs
may fluctuate significantly depending on computational demand, storage requirements, and long-term usage patterns. In this section, we will analyze the cost components of cloud HPC, including
may fluctuate significantly depending on computational demand, storage requirements, and long-term usage patterns. In this section, we will analyze the cost components of cloud HPC, including
compute resources, storage, data egress, and additional factors such as training and workflow adaptation, to provide a clear picture of its financial viability.
compute resources, storage, data egress, and additional factors such as training and workflow adaptation, to provide a clear picture of its financial viability.
To generate an upper bound for the cost estimate, we propose a system that would replicate our current hardware and assume constant usage at maximum capacity. The cloud-based system's cost
To generate an upper bound for the cost estimate, we propose a system that would replicate our current hardware and assume constant usage at maximum capacity. The cloud-based system's cost
is calculated based on the following components:
is calculated based on the following components:
* Compute Nodes
* Compute Nodes
* Storage
* Storage
* Data Retrieval
* Data Retrieval
For demostration purposes (and ease of use) we will base the examples on [Amazon Web Services' "Cloud HPC" concept](https://aws.amazon.com/hpc/solution-components/).
For demostration purposes (and ease of use) we will base the examples on [Amazon Web Services' "Cloud HPC" concept](https://aws.amazon.com/hpc/solution-components/).
## Compute Nodes
## Compute Nodes
Amazon uses [Elastic Compute Cloud (designated as EC2 instances)](https://aws.amazon.com/ec2/) as the compute nodes. For cloud-based HPC applications, these need to be properly orchestrated,
Amazon uses [Elastic Compute Cloud (designated as EC2 instances)](https://aws.amazon.com/ec2/) as the compute nodes. For cloud-based HPC applications, these need to be properly orchestrated,
and the AWS sub-service offering this is the [AWS Parallel Computing Service (AWS PCS)](https://aws.amazon.com/pcs/?refid=9aef7b0d-16dc-4679-bf76-6d20b57a4500). [Pricing](https://aws.amazon.com/pcs/pricing/)
and the AWS sub-service offering this is the [AWS Parallel Computing Service (AWS PCS)](https://aws.amazon.com/pcs/?refid=9aef7b0d-16dc-4679-bf76-6d20b57a4500). [Pricing](https://aws.amazon.com/pcs/pricing/)
based on the current size of Albedo (240 compute nodes) would be for a Medium size deployment:
based on the current size of Albedo (240 compute nodes) would be for a Medium size deployment:
> Table 1: Slurm Controller sizes
> Table 1: Slurm Controller sizes
| Slurm Controller Size | Number of Instances Orchestrated | Number of Active & Queued Jobs |
| Slurm Controller Size | Number of Instances Orchestrated | Number of Active & Queued Jobs |
**In addition** we also need to pay for the actual compute machines. Specifically, we would use `hpc6a.48xlarge`, as recommended [by AWS](https://aws.amazon.com/hpc/solution-components/)(see the section **Compute Intensive** under **Compute & Networking**) which have [the following characteristics](https://aws.amazon.com/ec2/instance-types/hpc6a/):
**In addition** we also need to pay for the actual compute machines. Specifically, we would use `hpc6a.48xlarge`, as recommended [by AWS](https://aws.amazon.com/hpc/solution-components/)(see the section **Compute Intensive** under **Compute & Networking**) which have [the following characteristics](https://aws.amazon.com/ec2/instance-types/hpc6a/):
> Table 2: Compute node characteristics
> Table 2: Compute node characteristics
| Instance | AWS | Albedo |
| Instance | AWS | Albedo |
| --- | --- | --- |
| --- | --- | --- |
| Name | hpc6a.48xlarge | prod-[001-240] |
| Name | hpc6a.48xlarge | prod-[001-240] |
| Physical Cores | 192 | 128 |
| Physical Cores | 192 | 128 |
| Memory (GiB) | 384 | 256 |
| Memory (GiB) | 384 | 256 |
| EFA Network Bandwidth (Gbps) | 100 |
| EFA Network Bandwidth (Gbps) | 100 |
| Network Bandwidth (Gbps) | 25 |
| Network Bandwidth (Gbps) | 25 |
These run on AMD EPYC 7003. For comparison, Albedo compute nodes use AMD EPYC 7702. Unfortunately, there does not seem to be any pricing information for `hpc6a.48xlarge` instances directly, however, a good estimate seems to be around 8€ / hour based on [other "large" or "xlarge" instances](https://aws.amazon.com/ec2/pricing/on-demand/). There is also a "savings" price for paying for compute in advance rather than on-demand, with typical reduction of about [±30%](https://aws.amazon.com/savingsplans/compute-pricing/). The following calculation uses this "savings" variant, with ~6€ / hour for compute nodes using the `c6a.48xlarge` instance as a compute node (note: without the `hpc` prefix!).
These run on AMD EPYC 7003. For comparison, Albedo compute nodes use AMD EPYC 7702. Unfortunately, there does not seem to be any pricing information for `hpc6a.48xlarge` instances directly, however, a good estimate seems to be around 8€ / hour based on [other "large" or "xlarge" instances](https://aws.amazon.com/ec2/pricing/on-demand/). There is also a "savings" price for paying for compute in advance rather than on-demand, with typical reduction of about [±30%](https://aws.amazon.com/savingsplans/compute-pricing/). The following calculation uses this "savings" variant, with ~6€ / hour for compute nodes using the `c6a.48xlarge` instance as a compute node (note: without the `hpc` prefix!).
Markdown(f"For {instances.magnitude} compute nodes and controller costs, we would have a total monthly bill of {total_compute_cost:,.2f} for compute resources. Breaking this down per node and per month this would be: {total_compute_cost_per_node:,.2f} / month, or {total_compute_cost_per_node/30:,.2f} / day")
Markdown(f"For {instances.magnitude} compute nodes and controller costs, we would have a total monthly bill of {total_compute_cost:,.2f} for compute resources. Breaking this down per node and per month this would be: {total_compute_cost_per_node:,.2f} / month, or {total_compute_cost_per_node/30:,.2f} / day")
```
```
%% Output
%% Output
For 240 compute nodes and controller costs, we would have a total monthly bill of 1,050,495.62 euro for compute resources. Breaking this down per node and per month this would be: 4,377.07 euro / node / month, or 145.90 euro / node / day
For 240 compute nodes and controller costs, we would have a total monthly bill of 1,050,495.62 euro for compute resources. Breaking this down per node and per month this would be: 4,377.07 euro / node / month, or 145.90 euro / node / day
Amazon recommends the ["Elastic Block Storage" (EBS)](https://aws.amazon.com/ebs/pricing/) as main storage (General Purpose SSD: gp3 or HDD backed storage: sc 1). We assume that we have 0.0952 euro/gigabyte-month for storage cost, 0.0006 euro per IOPS-month and 0.0448 euro per MB/s-month. We provision 10,000 IOPS and 500 MB/s for this volume. The first 3000 IOPs and 125 MB/s are free in the baseline performance.
Amazon recommends the ["Elastic Block Storage" (EBS)](https://aws.amazon.com/ebs/pricing/) as main storage (General Purpose SSD: gp3 or HDD backed storage: sc 1). We assume that we have 0.0952 euro/gigabyte-month for storage cost, 0.0006 euro per IOPS-month and 0.0448 euro per MB/s-month. We provision 10,000 IOPS and 500 MB/s for this volume. The first 3000 IOPs and 125 MB/s are free in the baseline performance.
Note that this calculation is probably an "upper limit", since we would not have 100% of the data on SSDs, and it assumes that the storage is at capacity. Real costs would be lower, since one only pays for actively blocked storage amount, and we would likely only have part of the data on SSD, with a majority on HDD. However, considerations such as snapshots have not been factored in, and these would cost extra.
Note that this calculation is probably an "upper limit", since we would not have 100% of the data on SSDs, and it assumes that the storage is at capacity. Real costs would be lower, since one only pays for actively blocked storage amount, and we would likely only have part of the data on SSD, with a majority on HDD. However, considerations such as snapshots have not been factored in, and these would cost extra.
**Still to be clarified**
**Still to be clarified**
+ [ ] Does one need S3 "on top"?
+ [ ] Does one need S3 "on top"?
+[ ] Is [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) extra, or a replacement for this?
+[ ] Is [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) extra, or a replacement for this?
In additon, we need to consider that transfeing data into and out of AWS incures fees. According [to the documentation](https://aws.amazon.com/s3/pricing/?nc=sn&loc=4):
In additon, we need to consider that transfeing data into and out of AWS incures fees. According [to the documentation](https://aws.amazon.com/s3/pricing/?nc=sn&loc=4):
* First 10 Tb/Month: 0.09 Euro/GB
* First 10 Tb/Month: 0.09 Euro/GB
* Next 40 Tb/Month: 0.085 Euro/GB
* Next 40 Tb/Month: 0.085 Euro/GB
* Next 100 Tb/Month: 0.07 Euro/GB
* Next 100 Tb/Month: 0.07 Euro/GB
* Greater 150 TB/Month: 0.05 Euro/GB
* Greater 150 TB/Month: 0.05 Euro/GB
Assuming the total average user load -- so all users of the machine in sum -- might need to access 50 Tb of their data locally on AWI machines, this would be:
Assuming the total average user load -- so all users of the machine in sum -- might need to access 50 Tb of their data locally on AWI machines, this would be:
A preliminary "click through" of the pricing can be found here: https://calculator.aws/#/estimate?id=e1067ac67b8108402c80dcbd0dadef5eb61d6b3d. The exact sum is different, but on the same order of magnitude.
A preliminary "click through" of the pricing can be found here: https://calculator.aws/#/estimate?id=e1067ac67b8108402c80dcbd0dadef5eb61d6b3d. The exact sum is different, but on the same order of magnitude.
Even accounting for this being an upper-bound estimate, replicating an HPC cluster comparable in scope to Albedo would result in costs that are **10 to 15 times higher** than maintaining an on-site local HPC system.
Even accounting for this being an upper-bound estimate, replicating an HPC cluster comparable in scope to Albedo would result in costs that are **10 to 15 times higher** than maintaining an on-site local HPC system.