Skip to content
Snippets Groups Projects
cost-calculation.ipynb 17 KiB
Newer Older
Paul Gierz's avatar
Paul Gierz committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8dbfa184-eab1-4239-be76-3ccc0396c2ea",
   "metadata": {},
   "source": [
    "# Preliminary Cost Calculation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55019cca-0984-4a6a-8d71-7d410e2a9377",
   "metadata": {},
   "source": [
    "The transition to a cloud-based high-performance computing (HPC) system offers exciting opportunities for scalability, flexibility, and operational efficiency. However, understanding\n",
    "the cost implications of such a move is critical to determining its feasibility for AWI. Unlike traditional on-premises systems with fixed capital expenditures and predictable\n",
    "operating costs, cloud HPC introduces a dynamic cost structure based on resource consumption, storage, and data transfer. This flexibility can be both a benefit and a challenge, as costs\n",
    "may fluctuate significantly depending on computational demand, storage requirements, and long-term usage patterns. In this section, we will analyze the cost components of cloud HPC, including\n",
    "compute resources, storage, data egress, and additional factors such as training and workflow adaptation, to provide a clear picture of its financial viability.\n",
    "\n",
    "To generate an upper bound for the cost estimate, we propose a system that would replicate our current hardware and assume constant usage at maximum capacity. The cloud-based system's cost\n",
    "is calculated based on the following components:\n",
    "\n",
    "* Compute Nodes\n",
    "* Storage\n",
    "* Data Retrieval\n",
    "\n",
    "For demostration purposes (and ease of use) we will base the examples on [Amazon Web Services' \"Cloud HPC\" concept](https://aws.amazon.com/hpc/solution-components/). \n",
    "\n",
    "## Compute Nodes\n",
    "Amazon uses [Elastic Compute Cloud (designated as EC2 instances)](https://aws.amazon.com/ec2/) as the compute nodes. For cloud-based HPC applications, these need to be properly orchestrated,\n",
    "and the AWS sub-service offering this is the [AWS Parallel Computing Service (AWS PCS)](https://aws.amazon.com/pcs/?refid=9aef7b0d-16dc-4679-bf76-6d20b57a4500). [Pricing](https://aws.amazon.com/pcs/pricing/) \n",
    "based on the current size of Albedo (240 compute nodes) would be for a Medium size deployment:\n",
    "\n",
    "> Table 1: Slurm Controller sizes\n",
    "\n",
    "| Slurm Controller Size | Number of Instances Orchestrated | Number of Active & Queued Jobs |\n",
    "|-----------------------|----------------------------------|--------------------------------|\n",
    "| Small | Up to 32 | Up to 256 |\n",
    "| **Medium** | **Up to 512** | **Up to 8192** |\n",
    "| Large | Up to 2048 | Up to 16384 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "498d34e8-e1ec-4419-8531-b7bd264977f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pint\n",
    "from IPython.display import Markdown\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "771fd440-c2bc-45a5-bccc-2fa30f6094d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "ureg = pint.UnitRegistry(preprocessors = [lambda s: s.replace(\"€\", \"EUR\")])\n",
    "ureg.define(\"euro = [currency] = € = EUR\")\n",
    "ureg.define(\"node = 1\")\n",
    "hours_per_month = 30 * 24 * ureg.hour"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d653d717-cecc-4e2f-aec4-72d6d65e2c0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "instances = 240 * ureg(\"node\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "970d5fd1-2060-4e83-a2c0-28b6b3a36520",
   "metadata": {},
   "source": [
    "If we were to use the Regional pricing for Frankfurt (AWS designation `eu-central-1`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2a05c101-1b6c-49ea-b866-3ae3348ab16f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "For 240 compute nodes we would need 21,767.11 euro / month in AWS PCS costs"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "controller_fee = 4.3841 * ureg(\"€ / hour\")\n",
    "node_management_fee = 0.1077 * ureg(\"€ / hour / node\")\n",
    "controller_cost = (controller_fee * hours_per_month) + (instances * node_management_fee * hours_per_month)\n",
    "Markdown(f\"For {instances.magnitude} compute nodes we would need {controller_cost:,.2f} / month in AWS PCS costs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3695a0e5-15de-4fd3-8db2-d89728f6839c",
   "metadata": {},
   "source": [
    "**In addition** we also need to pay for the actual compute machines. Specifically, we would use `hpc6a.48xlarge`, as recommended [by AWS](https://aws.amazon.com/hpc/solution-components/) (see the section **Compute Intensive** under **Compute & Networking**) which have [the following characteristics](https://aws.amazon.com/ec2/instance-types/hpc6a/):\n",
    "> Table 2: Compute node characteristics\n",
    "\n",
    "| Instance | AWS | Albedo |\n",
    "| --- | --- | --- |\n",
    "| Name | hpc6a.48xlarge | prod-[001-240] |\n",
    "| Physical Cores | 192 | 128 |\n",
    "| Memory (GiB) | 384 | 256 |\n",
    "| EFA Network Bandwidth (Gbps) | 100 |\n",
    "| Network Bandwidth (Gbps) | 25 |\n",
    "\n",
    "\n",
    "These run on AMD EPYC 7003. For comparison, Albedo compute nodes use AMD EPYC 7702. Unfortunately, there does not seem to be any pricing information for `hpc6a.48xlarge` instances directly, however, a good estimate seems to be around 8€ / hour based on [other \"large\" or \"xlarge\" instances](https://aws.amazon.com/ec2/pricing/on-demand/). There is also a \"savings\" price for paying for compute in advance rather than on-demand, with typical reduction of about [±30%](https://aws.amazon.com/savingsplans/compute-pricing/). The following calculation uses this \"savings\" variant, with ~6€ / hour for compute nodes using the `c6a.48xlarge` instance as a compute node (note: without the `hpc` prefix!)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8de69f25-f1ec-4cc3-85b6-66ea5dd51ef5",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_fee = 5.95329 * ureg(\"€ / (hour * node)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "359624e6-dd61-45f2-8673-d8a03a87e64f",
   "metadata": {},
   "source": [
    "Assuming we want a cluster with a similar size as Albedo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0b50087c-f059-40e2-8f67-752a5d496cd0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Compute costs for 240 would be 1,028,728.51 euro per month in EC2 costs"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_cost = hours_per_month * instance_fee * instances\n",
    "Markdown(f\"Compute costs for {instances.magnitude} would be {compute_cost:,.2f} per month in EC2 costs\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6d198c1e-589b-40e6-b047-70ede86a4081",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "For 240 compute nodes and controller costs, we would have a total monthly bill of 1,050,495.62 euro for compute resources. Breaking this down per node and per month this would be: 4,377.07 euro / node / month, or 145.90 euro / node / day"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_compute_cost = compute_cost + controller_cost\n",
    "total_compute_cost_per_node = total_compute_cost / instances\n",
    "Markdown(f\"For {instances.magnitude} compute nodes and controller costs, we would have a total monthly bill of {total_compute_cost:,.2f} for compute resources. Breaking this down per node and per month this would be: {total_compute_cost_per_node:,.2f} / month, or {total_compute_cost_per_node/30:,.2f} / day\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93d85ce7-9e69-47ee-9495-dc2044598412",
   "metadata": {},
   "source": [
    "## Storage\n",
    "\n",
    "There are several elements that need to considered for storage. For all calculations, we assume a total filesystem of 5Pb (similar to Albedo). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4b453e4e-50d8-4a91-957e-ce5d66f723cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "ureg.define(\"iops = 1\")\n",
    "ureg.define(\"mbs = 1\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "07d2b3b9-d280-4222-9090-4f3f2e6d5e93",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_amount = 5 * ureg(\"petabyte\")\n",
    "iops_provisioned = 10_000 * ureg(\"iops\")\n",
    "throughput_provisioned = 500 * ureg(\"mbs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8754d19e-a175-4fe0-92be-69e8ceb47583",
   "metadata": {},
   "source": [
    "Amazon recommends the [\"Elastic Block Storage\" (EBS)](https://aws.amazon.com/ebs/pricing/) as main storage (General Purpose SSD: gp3 or HDD backed storage: sc 1). We assume that we have 0.0952 euro/gigabyte-month for storage cost, 0.0006 euro per IOPS-month and 0.0448 euro per MB/s-month. We provision 10,000 IOPS and 500 MB/s for this volume. The first 3000 IOPs and 125 MB/s are free in the baseline performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "077bff95-35e7-4bdc-bf92-0b7adc5391e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "volume_charge = 0.0952 * ureg(\"euro\") / (ureg(\"gigabyte\") * ureg(\"month\"))\n",
    "iops_charge = 0.006 * ureg(\"euro\") / (ureg(\"iops\") * ureg(\"month\"))\n",
    "throughput_charge = 0.048 * ureg(\"euro\") / (ureg(\"mbs\") * ureg(\"month\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "dddbbc0d-0dd2-4fc1-a9ea-912fd3c361c3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Estimated costs for storage assuming 100% SSD usage: 476,060.00 euro / month"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base_storage = (data_amount.to(\"gigabyte\") * volume_charge) \\\n",
    "+ (iops_charge * (iops_provisioned - 3_000 * ureg(\"iops\"))) \\\n",
    "+ (throughput_charge * (throughput_provisioned - 125*ureg(\"mbs\")))\n",
    "Markdown(f\"Estimated costs for storage assuming 100% SSD usage: {base_storage:,.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cfa487a-4720-47f5-95fd-9fffb6fde0b4",
   "metadata": {},
   "source": [
    "Note that this calculation is probably an \"upper limit\", since we would not have 100% of the data on SSDs, and it assumes that the storage is at capacity. Real costs would be lower, since one only pays for actively blocked storage amount, and we would likely only have part of the data on SSD, with a majority on HDD. However, considerations such as snapshots have not been factored in, and these would cost extra. \n",
    "\n",
    "**Still to be clarified** \n",
    "+ [ ] Does one need S3 \"on top\"?\n",
    "+ [ ] Is [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) extra, or a replacement for this?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fe42666-b2d0-454b-bd5c-3bcd3bc5e476",
   "metadata": {},
   "source": [
    "## Network\n",
    "Network appears to be automatically configured via the Amazon PCS, so there are no additional costs here."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac7f1050-2818-4a84-a880-b8a69826f120",
   "metadata": {},
   "source": [
    "## Data Retrival\n",
    "\n",
    "In additon, we need to consider that transfeing data into and out of AWS incures fees. According [to the documentation](https://aws.amazon.com/s3/pricing/?nc=sn&loc=4):\n",
    "* First 10 Tb/Month: 0.09 Euro/GB\n",
    "* Next 40 Tb/Month: 0.085 Euro/GB\n",
    "* Next 100 Tb/Month: 0.07 Euro/GB\n",
    "* Greater 150 TB/Month: 0.05 Euro/GB\n",
    "\n",
    "Assuming the total average user load -- so all users of the machine in sum -- might need to access 50 Tb of their data locally on AWI machines, this would be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a076d16b-c5f0-4932-8795-c44bb9502b1e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "With very crude assumptions, we would use 4,300.00 euro for transfering data"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_access = 50*ureg(\"terabyte\")\n",
    "first_rate = 0.09 * ureg(\"euro / gigabyte\")\n",
    "second_rate = 0.085 * ureg(\"euro / gigabyte\")\n",
    "\n",
    "data_transfer_cost = ((10*ureg(\"terabyte\").to(\"gigabyte\")*first_rate) + (40*ureg(\"terabyte\").to(\"gigabyte\")*second_rate))\n",
    "Markdown(f\"With very crude assumptions, we would use {data_transfer_cost:,.2f} for transfering data\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0bf7422b-880f-4275-a118-9222579daa8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "51,028.52 euro per day\n",
      "1,530,855.62 euro per month\n",
      "18,370,267.49 euro per year\n"
     ]
    }
   ],
   "source": [
    "total =  (base_storage*1*ureg(\"month\")) + total_compute_cost + data_transfer_cost\n",
    "print(f\"{total/30.:,.2f} per day\")\n",
    "print(f\"{total:,.2f} per month\")\n",
    "print(f\"{total*12:,.2f} per year\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9df69459-30ed-4511-b2b7-2917450e6485",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'plt' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[1], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m plt\u001b[38;5;241m.\u001b[39mpie([base_storage\u001b[38;5;241m.\u001b[39mmagnitude, total_compute_cost\u001b[38;5;241m.\u001b[39mmagnitude, data_transfer_cost\u001b[38;5;241m.\u001b[39mmagnitude], labels\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mStorage\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCompute\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mData Access\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n",
      "\u001b[1;31mNameError\u001b[0m: name 'plt' is not defined"
     ]
    }
   ],
   "source": [
    "plt.pie([base_storage.magnitude, total_compute_cost.magnitude, data_transfer_cost.magnitude], labels=[\"Storage\", \"Compute\", \"Data Access\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e3356bd-b423-4f6b-9e9d-f6811d260f8f",
   "metadata": {},
   "source": [
    "## Some other comparisons\n",
    "* Gregor Knorr kindly provided an estimate of costs he had used for EU Funding proposals: 9.70 Euros/day/node on DKRZ, without storage considerations.\n",
    "* AWI's Controlling calculates 10 Euros/day/node for Albedo, also without storage considerations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f778eea0-5f98-4476-84f5-00a5267aed61",
   "metadata": {},
   "source": [
    "## AWS Pricing Calculator\n",
    "A preliminary \"click through\" of the pricing can be found here: https://calculator.aws/#/estimate?id=e1067ac67b8108402c80dcbd0dadef5eb61d6b3d. The exact sum is different, but on the same order of magnitude."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4adcbb60-210c-4201-a4aa-95db35bc3438",
   "metadata": {},
   "source": [
    "```{admonition} Bottom Line\n",
Paul Gierz's avatar
Paul Gierz committed
    "Even accounting for this being an upper-bound estimate, replicating an HPC cluster comparable in scope to Albedo would result in costs that are **10 to 15 times higher** than maintaining an on-site local HPC system.\n",
Paul Gierz's avatar
Paul Gierz committed
    "```"
   ]
Paul Gierz's avatar
Paul Gierz committed
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1cc756d-f0dc-45da-a144-9a88f29da767",
   "metadata": {},
   "outputs": [],
   "source": []
Paul Gierz's avatar
Paul Gierz committed
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Paul Sandbox",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}