Overview of HPC@Mines

Mines has two distinct high performance computing platforms. This page provides an overview of high performance computing and Mines’ platforms, and points to additional information.

The term “high performance computer” has evolved over the years and continues to evolve. A calculation that required a high performance computer 10 years ago, might today, be done on a laptop. Cell phones have the computing power of early generations of HPC platforms.

Today’s definition of an HPC platform often involves parallelism. That is, most HPC platforms gang many processing cores to work on a single problem. A processing core is what most people think of as the central part of a computer; a chip or set of chips with the capability to access memory and perform calculations. Most modern computers contain more than one computing core, actually even more than one core on a single chip. For example, the Intel Core i5 chip found in many low-end laptops contains two computing cores. A computing node encapsulates the cores, memory, networking and related technologies. A node may or may not have video output capability. A high performance computer is one that can effectively use multiple cores on a single node or in a collection of nodes to perform a calculation, by distributing the work to the various cores. The HPC platforms at Mines are augmented with a high speed network connecting nodes, to facilitate efficiency. The nodes are enclosed in a collection of racks with 10s of nodes per rack. The individual cores of Mines’ HPC platforms may not be any more powerful than the cores of a recent generation laptop. There are, however, thousands of them available.

About the Environment

Operating System

Mines’ HPC platforms run variants of CentOS, a Linux-based operating system, with a command line interface. Users need to be familiar with Linux to make effective use of Mines’ machines. For new users to Linux, we recommend going through our New Users Guide.

Logging On

All access to Mines’ HPC clusters is via SSH. SSH is part of macOS- and Linux-based machines. Windows will require remote access software to connect to Mines’ HPC resources; some options include, Windows Subsystem for Linux (WSL), MobaXTerm, or Cygwin.

Runtime Environment

Much of the runtime environment is managed via a module system. Module systems are in common use in many HPC environments. You learn more about our HPC systems at our General Information page.

Scheduling and Running in Parallel

Running parallel applications on Mines’ HPC resources is managed via a scheduler. The same scheduling software is used on all machines; an assortment of links to information and tutorials can be accessed under Slurm Guides.
Running a parallel application requires first creating a script. The script contains a request for resources and commands to run the job on these resources. The script is submitted to the scheduler and will run when resources become available. A quick example for a SLURM submit script can be found here.

Wendian is the newest high performance computing platform at Mines. Wendian came on line in the fall of 2018. It contains 87 compute nodes; Skylake Intel processors comprise the bulk of the system, with nine Nvidia GPU nodes and two OpenPower nodes.  Max performance rating is over 350 teraflops.  Additionally, Wendian has three administration nodes and six file system nodes holding up to 1152 terabytes of raw storage with over 10 gigabytes/second transfer speeds. Wendian currently runs CentOS version 7, a community-driven, functionality-compatible computing platform to Red Hat Enterprise Linux. Simulations are managed as jobs using the SLURM scheduler. The programming languages of choice include, but are not limited to: C, C++, Fortran, Python, OpenMP, OpenACC, Cuda and MPI.

ProcessorCoresMemory (GB)Nodes / CardsCores TotalMemory Total (GB)
Skylake 615436192391,4047,488
Skylake 615436384391,40414,976
Skylake 5118241925120960
Skylake 5218321924128768
GPU cards for 5118's (Volta V100)3220640
GPU cards for 5218's (Tesla A100)4016640
OpenPower 816256232512
OpenPower 9 16256232512
Totals123311226496

Wendian Details

78 Relion XO1132g Server – Skylake Nodes

  • 1OU (1/3rd Width) w/ 2x 2.5″ Fixed 12Gb SATA Bay
  • Dual Intel Xeon 6154 (18C, 3.0GHz,200W)
  • 39 nodes with 192GB RAM, DDR4-2666MHz REG, ECC, 1R (12 x 16GB)
  • 39 nodes with 384GB RAM, DDR4-2666MHz REG, ECC, 1R (12 x 32GB)
  • 256 Gbyte SSD
  • Integrated AHCI, Intel C621, 6Gb SATA: Linux RAID 0/1/5/6/10/50/60 Integrated NIC, Intel I350, 2x RJ-45/GbE (1-Port Shared with BMC for IPMI) HCA, Mellanox ConnectX-4, 1x QSFP28/EDR
  • Preload, CentOS, Version 7
  • Processors water cooled
  • 3-Year Standard Warranty

5 Relion XO1114GTS Server GPU nodes

  • 1OU (Full Width) w/ 4x 2.5″ Hot Swap 12Gb SAS Bay
  • Dual Intel Xeon Gold 5118 CPU (12C, 2.30GHz, 105W)
  • 192GB RAM, DDR4-2666MHz REG, ECC, 2R (12 x 16GB)
  • Integrated AHCI, Intel C621, 6Gb SATA: Linux RAID 0/1/5/6/10/50/60
  • 256GB SSD, 2.5″, SATA, 6Gbps, 0.2 DWPD, 3D TLC (Micron 1100)
  • Integrated NIC, Intel I350, 2x RJ-45/GbE (1-Port Shared with BMC for IPMI)
  • PBB, 96 Lanes, 1x PCIE Gen3 x16 to 5x PCIE Gen3 x16 (4x GPU + 1x PCIE)
  • HCA, Mellanox ConnectX-5, 1x QSFP28/100Gb VPI
  • 4 x Accelerator, NVIDIA Tesla V100-SXM2, 32GB HBM2, 5120 CUDA, 640 Tensor, 300W
  • Preload, CentOS, Version 7
  • Standard 3-Year Warranty
  • 3-Year On-Site Service, 8×5 Next Business Day

4 Relion XO1114GT GPU Server Nodes 

  • 4 x A100 GPUs 1OU (Full Width) w/ 4x 2.5″ Hot Swap 12Gb SAS Bay
  • Dual Intel Xeon Gold 5218 (16C, 2.30GHz, 125W)
  • 192GB RAM, DDR4-2933MHz REG, ECC, 1Rx4 (12 x 16GB)
  • 256GB SSD, 2.5″, SATA, 6Gbps, 0.1 DWPD, 3D TLC (Micron 1300)
  • Integrated AHCI, Intel C621, 6Gb SATA: Linux RAID 0/1/5/6/10/50/60
  • 4 x NVIDIA A100-PCIe, 40GB HBM2, Passive
  • HCA, Mellanox ConnectX-5 VPI, PCIE3 x16, 1x QSFP28/EDR/100GbE, LP Support, Mellanox HCAs/NICs, Silver, 3Yrs
  • Integrated NIC, Intel I350, 2x RJ-45/GbE (1-Port Shared with BMC for IPMI) 1OU Full-width Tray & Brackets
  • Preload, CentOS, Version 7
  • Service, Warranty, 3 Year, (Standard)
  • Service, On-Site, US, 3 Year, NBD (GPU SVR)

2 Magna 2002S Server – OpenPower8 Nodes

  • 2U, 2x 2.5″ Hot Swap 6Gb SATA Bay w/ 2x 1300W Hot Swap PSU Dual IBM POWER8 Murano 00UL670 CPU (8C/64T, 3.2GHz, 190W) 8 x Memory Module, 4 x DDR4 Slot
  • 256GB RAM, DDR4-2400, REG, ECC, (32 x 8GB)
  • Integrated AHCI, Marvell 88SE9235 6Gb SATA: Linux RAID 0/1/5/6/10/50/60 Integrated NIC, 2x RJ-45/GbE (1-Port Shared with BMC for IPMI)
  • HCA, Mellanox ConnectX-4, 1x QSFP28/EDR
  • Preload, Ubuntu 16.04
  • Standard 3-Year Warranty
  • 3-Year On-Site Service, 8×5 Next Business Day

2 Magna 2xxx Server – OpenPower9 Nodes

  • To be installed when they become available
  • Details to follow

File System

  1. 960TB Usable Capacity @ 10GB/s
  2. Relion 1900 Server – running the BeeGFS MDS
  3. 2 x 150GB SSD, 2.5″, SATA, 6Gbps, 1 DWPD, 3D MLC
  4. 4 x 400GB SSD, 2.5″, SATA, 6Gbps, 3 DWPD, MLC
  5. IceBreaker 4936 Server – running BeeGFS OSS
  6. 2 x 150GB SSD, 2.5″, SATA, 6Gbps, 1 DWPD, 3D MLC
  7. 4 x 36 x 8TB HDD, 3.5″, SAS, 12Gbps – 1,152TB raw
  8. Ability to create parallel file systems from local disk on the fly

Cooling

The XO1132g and X01114GTS servers have on-board water cooling for the CPUs. These are all fed water from a cooling distribution unit, a CDU. This removes about 60% of the total heat generated. The water to the compute resources is in a closed loop. The CDU has a heat exchanger with the heat emitted by the closed loop warming chilled water from central facilities. Remaining heat from the servers and heat generated by the other nodes is removed via two in-row coolers. The equipment list includes (2) APC ACRC301S In-Row Coolers and a MOTIVAIR Coolant Distribution Unit MCDU25

Mio is currently closed to new node purchases from faculty. For new faculty interested in using Mines HPC resources, please refer to the “Wendian” tab for more information.

About

“Mio” is currently the oldest running HPC system @ Mines. It is a 120+ Tflop HPC cluster for Mines student and faculty research use.

The name “Mio” is a play on words. It is a Spanish translation of the word “mine,” as in “belongs to me.” The phrase “The computer is mine” can be translated as “El ordenador es mío.”

Students

Students have already purchased some access to Mio with Tech Fee funds—usable for general research, class projects, and learning HPC techniques. Students may also at times use Mio nodes purchased by their academic advisor or other professors. The HPC Group offers assistance to students (and faculty) to get up and running on Mio. Individual consultations and workshops are available.

Faculty

Mio holds many advantages for professors:

  • There’s no need to manage their own HPC resources
  • Professors can access other professors’ resources when allowed
  • Mines supplies high-quality Infiniband network infrastructure, which greatly improves the scalability of multinode applications

Hardware description

  • 8 -28 compute cores per node
  • 2.4GHz – 3.06GHz
  • 24-256 GB/Node
  • Infiniband Interconnect
  • 2 GPU nodes – 7.23 Tflops
  • 240 TB parallel file system
  • 2 Power8 w/GPU nodes

More Information

  • Configuration – Describes Mio’s node configuration and current node owners
  • Panasas – Details of the Mio file system upgrade with its Pansas servers

 

AuN (“Golden”) was a traditional HPC platform using standard Intel processors and based on the IBM iDataplex platform. It comprises 144 compute nodes connected by a high-speed network. Each node contains 16 Intel SandyBridge compute cores and 64 GB of memory for a total 2,304 cores and 9,216 GB of memory. AuN is rated at 50 Tflops. It is housed in two double-wide racks with 72 nodes in each rack. AuN is designed to run jobs that require high memory per core. AuN is now deprecated and its nodes are now available to Wendian users through using the following command in their SLURM preamble:

# SBATCH -p aun
 

Our HPC Systems @ Mines

File System Usage Information

File System Names and Usage

  • Home: Home directories have a low quota. (Primary use is system files)
  • Scratch: Scratch directories will be purged regularly, more unstable. (Good for dumping large input and output files)
  • Bins: Bins directories are a special file system designed for holding programs. (Good for storing source and binary files, not input files)

File System Quotas

 $SCRATCH$HOME + $BINS (Combined Total)
Wendian2,000,000 Files20 GBs
Mio2,000,000 Files20 GBs

Most Unix style file systems will see a performance decrease as the number of files per directory increases, this will only be noticeable as the number of files per directory gets into the hundreds or thousands.

Getting Around the Various File Systems

When you login to Wendian or Mio, you will see that your have the directories:

Wendian:

bins  scratch

Mio:

bins runs scratch

Policies

NOTE: Policy numbers 1-5 are currently suspended. The mechanism for ‘priority access’ on Wendian is being re-evaluated. Please submit a request to discuss your needs and determine the best way to move forward via our help desk:
 
Poliyc numbers 6 and above continue to provide accurate guidance for running on Wendian.
  1. Nodes on Wendian are available for purchase.
  2. Node purchases are Mines-subsidized for a per-node cost of $8500 / $10427 (high mem).
  3. When the number of PI-purchased nodes reaches 75% of the total, new nodes will be added.
  4. The percent of nodes kept from private (PI) purchase will remain at or above 25% of the total.
  5. New nodes acquired at policy-dictated intervals will be of the current generation.
  6. Queue management of this hybrid environment is outlined below, with periodic re-evaluation as HPC community experience evolves:
    • There are two types of queue:
      • Group: Comprises QoSes of nodes purchased by PIs;
      • Full: Comprises all nodes on machine (including purchased nodes).
  7. Allocations on Wendian are by proposal.
  8. Allocations are awarded in fixed core-hours; upon expiration or depletion, priority for running jobs will decrease.
  9. Allocations will not be debited if users run jobs on their set of purchased nodes.
  10. Allocations will be charged for 36 core per node by jobs run as ‘exclusive’, regardless of cores used.
  11. A default amount of memory is set per job. Users are encouraged to request only the amount of memory needed by a job.
  12. Allocations will be charged based on the higher of two metrics; number of cores or amount of memory used; for nonexclusive jobs.
  13. Wendian will have approximately 1 Pbyte (1000 Tbytes) of storage with the majority in scratch.  Additional storage is available for research groups to purchase.  Files stored in owned storage will not be purged, and are NOT BACKED UP.
  1. Tech Fee allows students who are not supported by a researcher to run on Mio. Use of Mio in this capacity precludes faculty from authorship of papers based on associated research.
  2. Research group members (students and faculty) may run on Mio if their research group has purchased nodes.
  3. Running on nodes outside of a research group’s partition (running in compute) exposes said job(s) to risk of pre-emption.
  4. Nodes on Mio that fail outside of warranty will be retired.
  5. Neither new nodes nor new research groups will be added to Mio.
  1. Directories of users who leave Mines will be deleted after 3 months. It is the PI’s responsibility to archive any desired data before that time.
  2. Directories inactive and not accessed for 1 year are subject to deletion. (Mines retains the right to remove directories should circumstances mandate).
  3. Wendian will have approximately 1 Pbyte (1000 Tbytes) of storage with a majority of the storage in scratch. Research groups may “purchase” additional storage. Files stored in owned storage will not expire and are not backed up.

Machine Status Information

Ganglia

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids.
Mio Ganglia
Wendian Ganglia

Running Jobs

The following links show a web page displaying the running jobs (same info as the command line tool).

Mio Jobs
Wendian Jobs

Node Usage

The following links show a web page displaying each node’s status (same info as the command line tool).

Mio Nodes
Wendian Nodes