Preserve

Overview

Researchers should ensure that all research data, regardless of format, is stored securely and backed up or copied regularly. During and after a research project, it is the responsibility of the PI, via a sound Data Management Plan, to specify what data need to be archived and preserved and where they should be preserved.

Sustainable Data Formats

Overview

The file format in which data are stored and archived is a primary factor in the ability to use data in the future. As the custodian of the primary data, the researher should adopt an orderly system of data organization and should communicate the chosen system to all members of a research group and to the appropriate administrative personnel, where appropriate or applicable.
File formats and file naming according to standards are necessary to ensure that data can be uniquely identified and made accessible for future uses.

When selecting tools for storing your data and preparing it for archiving, pay special attention to the output formats of your data. Data stored in a proprietary or obsolete format may be unusable to other researchers.


Accessible Formats

Formats more likely to be accessible in the future are:

  • Non-proprietary
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Preferred Formats (general)

  • PDF/A – not Microsoft Word
  • ASCII – not Microsoft Excel
  • MPEG-4 – not Quicktime
  • TIFF or JPEG2000 – not GIF or JPG
  • XML or RDF – not RDBMS

Preferred Formats (detailed)

Below are tables that list differnt data types and the preferred formats for long-term preservation of the data. Other acceptable formats are listed but these may or may not ensure the long-term preservation of the data. Not all of these formats are accepted in the Mines institutional repository.


Digital Audio Data

FeatureValue
Teraflop rating154 teraflops. (Roughly 7xRA)
Memory17.4 terabytes
Nodes656
Cores10,496
Disk480 terabytes

Digital Image Data

SpecificationFeatures
Blue Gene QNew Architecture
PowerPC A2 17 CoreDesigned for large core count jobs
512 NodesHighly Scaleable
8,192 CoresMultilevel parallelism - Direction of HPC
8,192 GbytesRoom to Grow
104 TflopsFuture looking machine

Digital Video Data

SpecificationFeature
iDataPlexLatest Generation Intel Processors
Intel 8x2 core SandyBridgeLarge Memory / Node
144 NodesCommon architecture
2,304 CoresSimilar user environment to RA and Mio
9,216 GbytesQuickly get researchers up and running
50 Tflops

Chemistry Data:

Linux for HPCLinux
HPC OverviewHPC-Overview.pdf
Overview of BlueMnewblue.pdf
MPI Part 1mpi01.pdf
MPI Part 2mpi02.pdf
Finite Difference Code in MPI (description)stoma.pdf
Finite Difference Code in MPI (basic versions)stomb.pdf
Finite Difference Code in MPI (advanced versions)stomc.pdf
OpenMPopenmp.pdf
Hybrid MPI/OpenMPhybrid.pdf
Source Code for the Above Tutorialsexamples
Full List of Tutorial Examples
(very large - growing list)
Examples
Fortran 90 for Fortran .le. 77 ProgrammersFortran 90
Batch Scripting for Parallel Systems
Updated to include SLURM examples
Batch
Connecting to Mio/AuN/Mc2
Setting up Keys
Connecting

Geospatial Data:

Script for Mio Power Nodes

Explanation of the differences
#!/bin/bash 
#SBATCH --job-name="hybrid"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks=4
##SBATCH --exclusive
#SBATCH --time=00:05:00
#SBATCH -p ppc
#SBATCH --export=NONE
#SBATCH --get-user-env=10L


# Go to the directory from
# which our job was launched
cd $SLURM_SUBMIT_DIR

module purge
module load StdEnv

srun --mpi=pmi2 --export=ALL ./helloc

#SBATCH -p ppc

Forces the job to run on the Power nodes

#SBATCH --export=NONE

Prevents the environmental variables set on Mio from being used in the script

#SBATCH --get-user-env=10L

Sets up the environment similar to what you would get if you logged on the node.

--mpi=pmi2

Required to get the proper setting for MPI

--export=ALL

Makes all of the Power nodes variables available to your program

Qualitative Data:

OwnerDepartmentReferenceNodes
Brennecka, GeoffMetallurgical &
Materials Eng.
gbrenneccompute[198-201]
Brune, JuergenMining Eng.jbrunecompute[032-033]
compute[036-037]
compute[100-101]
Carr, LincolnPhysicslcarrcompute024
compute[062-067]
compute[073-077]
compute[128-129]
compute[172-173]
compute196
Ciobanu, CristianMechanical Eng.cciobanucompute054
compute[090-091]
Durfee, ChipPhysicscdurfeecompute[176-177]
Eberhart, MarkChemistrymeberharcompute[194-195]
Ganesh, MahadevanApplied Mathematics &
Statistics
mganeshcompute[056-059]
compute061
compute[160-167]
gpu003
Gomez Gualdron,Diego Chemical &
Biological Eng.
gualdroncompute[180-191]
compute197
Gregg,Karen(Leiderman)Applied Mathematics &
Statistics
kleidermancompute025
compute[178-179]
Kaiser,TimHPChpccompute[078-079]
compute[084-089]
compute[192-193]
ppc[001-002]
Kappes, BrandenMechanical Eng.bkappescompute[174-175]
Kazemi, HosseinPetroleum Eng.hkazemicompute080
Lusk, MarkPhysicsmluskcompute[038-039]
compute[092-093]
compute[126-127]
Monney, MikeCivil &
Environmental Eng.
mooneycompute[049-050]
Newman, AlexandraMechanical Eng.anewmancompute055
Packard, CorinneMetallurgical &
Materials Eng.
cpackardcompute125
Pankavich, StephenApplied Mathematics &
Statistics
pankaviccompute124
Pankavich,Stephen Applied Mathematics &
Statistics
pankaviccompute026
compute124
Sava, PaulGeophysicspsavacompute083
compute103
compute[105-112]
compute[114-121]
compute[136-159]
Shragge,JeffreyGEOPgeopcompute[000-011]
Sullivan, NealMechanical Eng.nsullivacompute[122-123]
compute[132-135]
Sum, AmadeuChemical &
Biological Eng.
asumcompute[051-052]
compute[094-099]
Taylor, PatMetallurgical &
Materials Eng.
prtaylorgpu004
Thomas, BrianMechanical Eng.bgthomascompute[168-171]
Tilton, NilsMechanical Eng.ntiltoncompute[130-131]
compute[202-203]
Tucker, GarrittMechanical Eng.tuckercompute[204-219]
Tura, AliRCPrcp
compute[012-023]
Vyas, Shubham Chemistrysvyascompute[040-041]
compute[043-045]
compute[068-072]
Zimmerman, JeramyPhysicsjdzimmercompute027

Quantitative Data:

Warranty
Expires
ServersmodelCores/nodeNodes/UnitNodescomputeNode Owners
2013-03-092 R17028240-3jbrune
2013-04-216 R170282124-15tkaiser
mlusk
zhiwu
2013-05-145 R1702821016-25cmmaupin
2013-08-313 R170282626-31lcarr
hkazemi
2013-10-068 R17521221632-47psava
2014-06-241 R1752122249-50mooney
2014-06-241 R1752122251-52asum
2014-07-124 R1752122854-61cciobanu
mganesh
anewman
2014-11-0412 R17521222462-83psava
2014-11-023 R1752122684-91tkaiser
cciobanu
2015-03-265 R17521221092-101asum
mlusk
jbrune
2015-09-285 R284016420102-121ireimani
psava
2016-08-261D12 -161622122-123nsulliva
2017-03-151D12 -161622124-125pconstan
2017-06-151D12 -202022126-127musk
2017-06-201D12 -202022128-129lcarr
2017-06-171D12 -202022130-131ntilton
2018-04-017NTS-602824428132-159nsulliva
psava
2018-08-012NTS-60282448160-167mganesh
2019-02-011NTS-60282444168-171bgthomas

Quantitative Data:

Line ItemItem Description

1U Two Node Server
QTY
11U NexlinkD12 Server with 1280W 80+ PSU (2 Nodes Per 1U)1
2Intel 8 Core Xeon E5 2680 2.7GHz S20114
38GB DDR3 1600MHz Memory Reg ECC SR 1.5V (64GB per node)16
42TB Enterprise Class HDD SATA 7200RPM2
5Intel QDR InfiniBand HCA2
6Standard 3 Year Return to Depot Warranty, 3 Year Advanced Cross-ship1

Scripts and Computer Code

Preferred Formats
Work directly with Research Support Services for latest information

Documentation

 $SCRATCH$HOME + $BINS (Combined Total)
BlueM2,000,000 Files20 GBs
Mio2,000,000 Files20 GBs


Above tables adapted from:

Physical Samples

Overview

Under the NSF research results dissemination and sharing guidelines, the definition of research data includes samples and physical collections. The researcher is responsible for management of physical samples during the research project. Mines is required to archive physical samples if they are needed to verify or reproduce research results or to extend the research in new directions.


Physical Storage

Some departments have physical storage spaces for samples used during student research. Check with your department. If no physical department storage is available, then arrangements may be possible with the Mines Geology museum. The researcher needs to consult with the Museum Director about storage after a project ends and any related budget issues. Ideally, this has been considered before writing a proposal data management plan. The project budget must include the total cost of archival storage including specific equipment and facilities needed for the proposed storage time. Additionally, sample storage requirements that are beyond the present capabilities of existing facilities must include plans needed to develop cost accounting and implement budgeting procedures.


Documentation for Physical Collections and Samples

Create a document that can be on file with proposal that includes the following:

  • A general description of sample type(s), such as polished sections of metallurgical alloys; microscope slides with mounted sections of tissue; concrete samples; ampoules of liquid; etc.
  • A general description of sample size
  • An estimate of the number of samples that will be generated during the study. The estimate will only need to be to the order of magnitude: tens, hundreds, thousands of samples is sufficient
  • Special conditions for storage should be described, such as temperature control, vacuum, isolation, etc.
  • Special security or access issues to samples should be described
  • The time period for archival storage must be specified. In some cases, only an estimate may be possible, but the conditions for extension of archival times should be clearly described
Sponsor Requirements for Preservation

Overview

Many federal agencies and other funders expect researchers to share their data after a research project ends, and some journals and societies now require data archiving. Depending on the type of research, various subject domain and funding agency requirements exist for how soon data are expected to be made available and for how long.


Mines Policy

Mines recommends that research data should be archived for a minimum of three years after the final project close-out, with original data retained wherever possible. The researcher should review funder/sponsor requirements.


Federal Agency Expectations

Most funding agency data sharing policies ask that data from projects be shared in a timely matter, understanding that what constitutes a “timely matter” will vary from project to project. Many funding agencies allow for embargo periods for political/commercial/patent reasons, as long as they are explained in the Data Management Plan.When retention periods are specified, it is important to understand when the clock starts ticking. OMB Circular A-100 states that the retention period is three years from the date the final financial report is submitted. NIH uses that same language. But the NSF General Grant Conditions states that records must be retained for three years after submission of all required reports. The researcher should check the retention requirements for each sponsor they are involved with. Here are some examples of funding agency data availability and retention periods:

  • NSF Engineering Directorate: Accessible for a minimum of three years after the end of the project or public release, whichever comes first. Release “at the earliest reasonable time.”
  • NSF Earth Sciences Division: Made openly available no later than two years after data collection
  • NSF Ocean Sciences Division: Made openly available no later than two years after data collection
  • NOAA: Available no later than two years after data collection
  • NIH: Available no later than the acceptance for publication of main findings from final data set
Dos and Don'ts

Overview

This page provicde some quick Do and Don’t to make preservation of your data easier.


Planning

  • Create a sound data management plan addressing both funder requirements and the expectations of the field in regards in data collection, managing and sharing
  • Estimate the amount of data required for your project as early as possible,
  • Include costs for data storage (including storage of back-up copies) in proposal budgets
  • Notify Research Support Services and CCIT of upcoming storage requirements so they can help with planning and avoid delays
  • If access to data needs to be restricted, be sure a clear rationale is provided to funders and describe any limitations or permissions that may exist for access
  • Be sure to address after project access
  • Keep in mind that a bit more “up-front” planning may actually mean less work in the long-term

Project Work

  • Enact a robust backup plan to ensure data are not lost and that they are shared with other project researchers securely and safely
  • Establish a data organization structure and file-naming cnvention and use it consistently

End of Project

  • Based on the data management plan, determine which data should be preseved and work with Research Support Services to deposit in the institutional reposit or another appropriate repository
  • Have three copies of your data—the original master file, a local backup (e.g., on an external hard drive) and an external backup (e.g., on a managed networked drive or on a web-based storage service).

Sharing Data

  • Ensure data do not become inaccessible if someone leaves the project
  • Determine if any data needs to be restricted and work with Research Support Services to enable gatekeeping