Skip to content
Snippets Groups Projects
Verified Commit cb7a539d authored by Paul Gierz's avatar Paul Gierz
Browse files

wip

parent 70735612
No related branches found
No related tags found
No related merge requests found
Pipeline #16120 failed
File moved
File moved
File moved
%% Cell type:markdown id:8dbfa184-eab1-4239-be76-3ccc0396c2ea tags:
# Preliminary Cost Calculation
%% Cell type:markdown id:55019cca-0984-4a6a-8d71-7d410e2a9377 tags:
The transition to a cloud-based high-performance computing (HPC) system offers exciting opportunities for scalability, flexibility, and operational efficiency. However, understanding
the cost implications of such a move is critical to determining its feasibility for AWI. Unlike traditional on-premises systems with fixed capital expenditures and predictable
operating costs, cloud HPC introduces a dynamic cost structure based on resource consumption, storage, and data transfer. This flexibility can be both a benefit and a challenge, as costs
may fluctuate significantly depending on computational demand, storage requirements, and long-term usage patterns. In this section, we will analyze the cost components of cloud HPC, including
compute resources, storage, data egress, and additional factors such as training and workflow adaptation, to provide a clear picture of its financial viability.
To generate an upper bound for the cost estimate, we propose a system that would replicate our current hardware and assume constant usage at maximum capacity. The cloud-based system's cost
is calculated based on the following components:
* Compute Nodes
* Storage
* Data Retrieval
For demostration purposes (and ease of use) we will base the examples on [Amazon Web Services' "Cloud HPC" concept](https://aws.amazon.com/hpc/solution-components/).
## Compute Nodes
Amazon uses [Elastic Compute Cloud (designated as EC2 instances)](https://aws.amazon.com/ec2/) as the compute nodes. For cloud-based HPC applications, these need to be properly orchestrated,
and the AWS sub-service offering this is the [AWS Parallel Computing Service (AWS PCS)](https://aws.amazon.com/pcs/?refid=9aef7b0d-16dc-4679-bf76-6d20b57a4500). [Pricing](https://aws.amazon.com/pcs/pricing/)
based on the current size of Albedo (240 compute nodes) would be for a Medium size deployment:
> Table 1: Slurm Controller sizes
| Slurm Controller Size | Number of Instances Orchestrated | Number of Active & Queued Jobs |
|-----------------------|----------------------------------|--------------------------------|
| Small | Up to 32 | Up to 256 |
| **Medium** | **Up to 512** | **Up to 8192** |
| Large | Up to 2048 | Up to 16384 |
%% Cell type:code id:498d34e8-e1ec-4419-8531-b7bd264977f9 tags:
``` python
import pint
from IPython.display import Markdown
import matplotlib.pyplot as plt
```
%% Cell type:code id:771fd440-c2bc-45a5-bccc-2fa30f6094d4 tags:
``` python
ureg = pint.UnitRegistry(preprocessors = [lambda s: s.replace("", "EUR")])
ureg.define("euro = [currency] = € = EUR")
ureg.define("node = 1")
hours_per_month = 30 * 24 * ureg.hour
```
%% Cell type:code id:d653d717-cecc-4e2f-aec4-72d6d65e2c0b tags:
``` python
instances = 240 * ureg("node")
```
%% Cell type:markdown id:970d5fd1-2060-4e83-a2c0-28b6b3a36520 tags:
If we were to use the Regional pricing for Frankfurt (AWS designation `eu-central-1`):
%% Cell type:code id:2a05c101-1b6c-49ea-b866-3ae3348ab16f tags:
``` python
controller_fee = 4.3841 * ureg("€ / hour")
node_management_fee = 0.1077 * ureg("€ / hour / node")
controller_cost = (controller_fee * hours_per_month) + (instances * node_management_fee * hours_per_month)
Markdown(f"For {instances.magnitude} compute nodes we would need {controller_cost:,.2f} / month in AWS PCS costs")
```
%% Output
For 240 compute nodes we would need 21,767.11 euro / month in AWS PCS costs
<IPython.core.display.Markdown object>
%% Cell type:markdown id:3695a0e5-15de-4fd3-8db2-d89728f6839c tags:
**In addition** we also need to pay for the actual compute machines. Specifically, we would use `hpc6a.48xlarge`, as recommended [by AWS](https://aws.amazon.com/hpc/solution-components/) (see the section **Compute Intensive** under **Compute & Networking**) which have [the following characteristics](https://aws.amazon.com/ec2/instance-types/hpc6a/):
> Table 2: Compute node characteristics
| Instance | AWS | Albedo |
| --- | --- | --- |
| Name | hpc6a.48xlarge | prod-[001-240] |
| Physical Cores | 192 | 128 |
| Memory (GiB) | 384 | 256 |
| EFA Network Bandwidth (Gbps) | 100 |
| Network Bandwidth (Gbps) | 25 |
These run on AMD EPYC 7003. For comparison, Albedo compute nodes use AMD EPYC 7702. Unfortunately, there does not seem to be any pricing information for `hpc6a.48xlarge` instances directly, however, a good estimate seems to be around 8€ / hour based on [other "large" or "xlarge" instances](https://aws.amazon.com/ec2/pricing/on-demand/). There is also a "savings" price for paying for compute in advance rather than on-demand, with typical reduction of about [±30%](https://aws.amazon.com/savingsplans/compute-pricing/). The following calculation uses this "savings" variant, with ~6€ / hour for compute nodes using the `c6a.48xlarge` instance as a compute node (note: without the `hpc` prefix!).
%% Cell type:code id:8de69f25-f1ec-4cc3-85b6-66ea5dd51ef5 tags:
``` python
instance_fee = 5.95329 * ureg("€ / (hour * node)")
```
%% Cell type:markdown id:359624e6-dd61-45f2-8673-d8a03a87e64f tags:
Assuming we want a cluster with a similar size as Albedo:
%% Cell type:code id:0b50087c-f059-40e2-8f67-752a5d496cd0 tags:
``` python
compute_cost = hours_per_month * instance_fee * instances
Markdown(f"Compute costs for {instances.magnitude} would be {compute_cost:,.2f} per month in EC2 costs")
```
%% Output
Compute costs for 240 would be 1,028,728.51 euro per month in EC2 costs
<IPython.core.display.Markdown object>
%% Cell type:code id:6d198c1e-589b-40e6-b047-70ede86a4081 tags:
``` python
total_compute_cost = compute_cost + controller_cost
total_compute_cost_per_node = total_compute_cost / instances
Markdown(f"For {instances.magnitude} compute nodes and controller costs, we would have a total monthly bill of {total_compute_cost:,.2f} for compute resources. Breaking this down per node and per month this would be: {total_compute_cost_per_node:,.2f} / month, or {total_compute_cost_per_node/30:,.2f} / day")
```
%% Output
For 240 compute nodes and controller costs, we would have a total monthly bill of 1,050,495.62 euro for compute resources. Breaking this down per node and per month this would be: 4,377.07 euro / node / month, or 145.90 euro / node / day
<IPython.core.display.Markdown object>
%% Cell type:markdown id:93d85ce7-9e69-47ee-9495-dc2044598412 tags:
## Storage
There are several elements that need to considered for storage. For all calculations, we assume a total filesystem of 5Pb (similar to Albedo).
%% Cell type:code id:4b453e4e-50d8-4a91-957e-ce5d66f723cf tags:
``` python
ureg.define("iops = 1")
ureg.define("mbs = 1")
```
%% Cell type:code id:07d2b3b9-d280-4222-9090-4f3f2e6d5e93 tags:
``` python
data_amount = 5 * ureg("petabyte")
iops_provisioned = 10_000 * ureg("iops")
throughput_provisioned = 500 * ureg("mbs")
```
%% Cell type:markdown id:8754d19e-a175-4fe0-92be-69e8ceb47583 tags:
Amazon recommends the ["Elastic Block Storage" (EBS)](https://aws.amazon.com/ebs/pricing/) as main storage (General Purpose SSD: gp3 or HDD backed storage: sc 1). We assume that we have 0.0952 euro/gigabyte-month for storage cost, 0.0006 euro per IOPS-month and 0.0448 euro per MB/s-month. We provision 10,000 IOPS and 500 MB/s for this volume. The first 3000 IOPs and 125 MB/s are free in the baseline performance.
%% Cell type:code id:077bff95-35e7-4bdc-bf92-0b7adc5391e0 tags:
``` python
volume_charge = 0.0952 * ureg("euro") / (ureg("gigabyte") * ureg("month"))
iops_charge = 0.006 * ureg("euro") / (ureg("iops") * ureg("month"))
throughput_charge = 0.048 * ureg("euro") / (ureg("mbs") * ureg("month"))
```
%% Cell type:code id:dddbbc0d-0dd2-4fc1-a9ea-912fd3c361c3 tags:
``` python
base_storage = (data_amount.to("gigabyte") * volume_charge) \
+ (iops_charge * (iops_provisioned - 3_000 * ureg("iops"))) \
+ (throughput_charge * (throughput_provisioned - 125*ureg("mbs")))
Markdown(f"Estimated costs for storage assuming 100% SSD usage: {base_storage:,.2f}")
```
%% Output
Estimated costs for storage assuming 100% SSD usage: 476,060.00 euro / month
<IPython.core.display.Markdown object>
%% Cell type:markdown id:5cfa487a-4720-47f5-95fd-9fffb6fde0b4 tags:
Note that this calculation is probably an "upper limit", since we would not have 100% of the data on SSDs, and it assumes that the storage is at capacity. Real costs would be lower, since one only pays for actively blocked storage amount, and we would likely only have part of the data on SSD, with a majority on HDD. However, considerations such as snapshots have not been factored in, and these would cost extra.
**Still to be clarified**
+ [ ] Does one need S3 "on top"?
+ [ ] Is [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) extra, or a replacement for this?
%% Cell type:markdown id:6fe42666-b2d0-454b-bd5c-3bcd3bc5e476 tags:
## Network
Network appears to be automatically configured via the Amazon PCS, so there are no additional costs here.
%% Cell type:markdown id:ac7f1050-2818-4a84-a880-b8a69826f120 tags:
## Data Retrival
In additon, we need to consider that transfeing data into and out of AWS incures fees. According [to the documentation](https://aws.amazon.com/s3/pricing/?nc=sn&loc=4):
* First 10 Tb/Month: 0.09 Euro/GB
* Next 40 Tb/Month: 0.085 Euro/GB
* Next 100 Tb/Month: 0.07 Euro/GB
* Greater 150 TB/Month: 0.05 Euro/GB
Assuming the total average user load -- so all users of the machine in sum -- might need to access 50 Tb of their data locally on AWI machines, this would be:
%% Cell type:code id:a076d16b-c5f0-4932-8795-c44bb9502b1e tags:
``` python
total_access = 50*ureg("terabyte")
first_rate = 0.09 * ureg("euro / gigabyte")
second_rate = 0.085 * ureg("euro / gigabyte")
data_transfer_cost = ((10*ureg("terabyte").to("gigabyte")*first_rate) + (40*ureg("terabyte").to("gigabyte")*second_rate))
Markdown(f"With very crude assumptions, we would use {data_transfer_cost:,.2f} for transfering data")
```
%% Output
With very crude assumptions, we would use 4,300.00 euro for transfering data
<IPython.core.display.Markdown object>
%% Cell type:code id:0bf7422b-880f-4275-a118-9222579daa8f tags:
``` python
total = (base_storage*1*ureg("month")) + total_compute_cost + data_transfer_cost
print(f"{total/30.:,.2f} per day")
print(f"{total:,.2f} per month")
print(f"{total*12:,.2f} per year")
```
%% Output
51,028.52 euro per day
1,530,855.62 euro per month
18,370,267.49 euro per year
%% Cell type:code id:9df69459-30ed-4511-b2b7-2917450e6485 tags:
``` python
plt.pie([base_storage.magnitude, total_compute_cost.magnitude, data_transfer_cost.magnitude], labels=["Storage", "Compute", "Data Access"]);
```
%% Output
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 plt.pie([base_storage.magnitude, total_compute_cost.magnitude, data_transfer_cost.magnitude], labels=["Storage", "Compute", "Data Access"])
NameError: name 'plt' is not defined
%% Cell type:markdown id:6e3356bd-b423-4f6b-9e9d-f6811d260f8f tags:
## Some other comparisons
* Gregor Knorr kindly provided an estimate of costs he had used for EU Funding proposals: 9.70 Euros/day/node on DKRZ, without storage considerations.
* AWI's Controlling calculates 10 Euros/day/node for Albedo, also without storage considerations.
%% Cell type:markdown id:f778eea0-5f98-4476-84f5-00a5267aed61 tags:
## AWS Pricing Calculator
A preliminary "click through" of the pricing can be found here: https://calculator.aws/#/estimate?id=e1067ac67b8108402c80dcbd0dadef5eb61d6b3d. The exact sum is different, but on the same order of magnitude.
%% Cell type:markdown id:4adcbb60-210c-4201-a4aa-95db35bc3438 tags:
```{admonition} Bottom Line
Even considering that this is an upper-end estimate of the costs, replicating a HPC cluster of similar scope to Albedo would be a cost increase of a factor 10x to 15x more expensive than a local HPC on-site.
Even accounting for this being an upper-bound estimate, replicating an HPC cluster comparable in scope to Albedo would result in costs that are **10 to 15 times higher** than maintaining an on-site local HPC system.
```
%% Cell type:code id:b1cc756d-f0dc-45da-a144-9a88f29da767 tags:
``` python
```
......
File moved
File moved
File moved
File moved
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment