The latest Advanced RISC Machine (ARM) processors are said to be faster and more economical than x86, but exactly how fast and economical? Where’s the optimal cost-performance sweet spot? Typical bioinformatics and high-performance computing (HPC) researchers don’t have time or experience to systematically benchmark their own applications on all available CPU processors or calculate analysis cost on of the public cloud.
This blog fills these gaps with a systematic job runtime benchmark and cloud cost estimate using the popular award-winning life sciences HPC application, Sentieon DNAseq, on the latest ARM and x86 instances offered by Oracle Cloud Infrastructure (OCI).
Genomic sequence analysis is an important component in most segments of the life sciences and healthcare industries and is key for numerous technology breakthroughs including the latest Covid-19 vaccines. Among several popular genomics sequence analysis applications, Sentieon DNAseq distinguishes itself as a drop-in replacement of the gold standard BWA-GATK pipeline. This professionally developed software not only provides identical results to BWA-GATK, but is also easy to set up and runs 5–20 times faster than GATK. For these reasons, we selected the latest Sentieon DNAseq v202112.01 release for this benchmark exercise.
Treat the computing cost quoted in this report as a rough estimate of the minimum cost of cloud resource for WGS analysis with Sentieon DNAseq. The complete process for WGS analysis can include other actions, such as setting up job execution scripts, uploading input data sets, and downloading output files. The cost of cloud to go through these steps is shared by all the analysis jobs performed on the instance but was not included in the estimated cost in Table 2. The cost for other cloud resources, such as FastConnect networking and object storage, required for production cloud computing operation were not included, and the cost of Sentieon DNAseq license wasn’t included either.
Benchmarked OCI instance shapes
This benchmark work covered the latest CPU processors commonly provided by public cloud. Table 1 summarizes the OCI instance shape and CPU processor specifications.
Table. 1 Specification of benchmarked OCI instance shapes and processors
|
Processor |
OCI instance shape
|
Processor specification |
Max # of OCPU/vCPU in VM shape |
Processor clock rate (MHz) |
1 |
Intel |
VM.Standardard3.flex |
Xeon(R) Platinum 8358 (Ice Lake) |
32/64 |
2660 |
2 |
AMD |
VM.Starndard.E4.flex |
AMD EPYC 7J13 (Milan) |
64/128 |
2445 |
3 |
ARM |
VM.Standard.A1.flex |
Ampere Altra A1 |
80/80 |
3000 |
We selected Oracle Cloud’s virtual machine (VM) shapes for this work because they can be flexibly configured with CPU core count and amount of memory, allowing you to customize resource usage to your application. In contrast, bare metal shapes are designed for resource-hungry workloads. The genomic analysis studied here have lower resource requirements, so we excluded bare metal shapes in this study because they’re unlikely to provide an optimal cost-performance configuration
Benchmarked test cases
The whole genome sequencing (WGS) secondary analysis pipeline implemented by Sentieon DNAseq matches the industry-standard BWA-GATK best practices pipeline. We benchmarked and analyzed seven sets of WGS data in this exercise and found all data sets demonstrated similar runtime and cost characteristics. For simplicity, this blog focuses on two data sets: 30X HG002 PCR-free samples sequenced on Illumina’s HiSeq X and NovaSeq systems.
We used GRCh38 as the reference, including the GRCh38 primary contigs and other decoy contigs, without ALT contigs or HLA genes. GRCh38 has significant improvements compared to the previous generation GRCh37 and is commonly adopted by the human genetics community.
The benchmarked pipeline takes FASTQ files as input and produces VCF files as output, including alignment, sort, deduplication, BQSR, and variant calling.
Benchmarked results
Table 2. Sentieon DNAseq genomic analysis job execution time and cost per sample
Processor shape size |
vCORE/Mem(GB) /Storage(GB) |
Instance cost ($ USD) |
Boot volume cost ($ USD) |
Total cost ($ USD) |
NovaSeq runtime (hours) |
NovaSeq total cost ($ USD) |
Hiseqx runtime (hours) |
Hiseqx total cost ($ USD) |
Intel-M |
64/48/500 |
1.352 |
0.01713 |
1.369 |
1.7 |
2.327 |
1.88 |
2.574 |
Intel-L |
64/64/1000 |
1.376 |
0.0457 |
1.422 |
1.49 |
2.119 |
1.5 |
2.133 |
Intel-X |
64/128/1000 |
1.472 |
0.0457 |
1.518 |
1.46 |
2.216 |
1.52 |
2.307 |
AMD-S |
64/48/500 |
0.872 |
0.01713 |
0.889 |
1.46 |
1.298 |
1.68 |
1.494 |
AMD-M |
64/64/1000 |
0.896 |
0.0457 |
0.942 |
1.19 |
1.121 |
1.27 |
1.196 |
AMD-L |
128/64/1000 |
1.696 |
0.0457 |
1.742 |
0.8 |
1.394 |
0.97 |
1.69 |
AMD-X |
128/128/1000 |
1.792 |
0.0457 |
1.838 |
0.752 |
1.382 |
0.88 |
1.617 |
ARM-S |
40/40/500 |
0.3834 |
0.01713 |
0.401 |
2.25 |
0.902 |
2.42 |
0.97 |
ARM-M |
48/48/1000 |
0.4754 |
0.0457 |
0.521 |
1.83 |
0.953 |
1.85 |
0.964 |
ARM-L |
80/80/1000 |
0.8434 |
0.064 |
0.907 |
1.06 |
0.961 |
1.09 |
0.99 |
ARM-X |
80/128/1000 |
0.9154 |
0.064 |
0.979 |
1.03 |
1.008 |
1.05 |
1.03 |
Hourly cost data was obtained from the
OCI product cost estimator. An extra $0.0171 or $0.0457 was added to reflect the hourly cost of 500-GB or 1000-GB storage subsystem. Here, we used a balanced block volume (VPU: 10) with IOPS target of 25,000 and throughput target of 240 MB/s or 480MB/s.
Figure 1. On-demand compute cost of cloud computing resource per data sample
Cost performance sweet spot
The on-demand cost of cloud computing resource versus analysis job runtime is plotted in Figure 2. You can identify the sweet spot for optimal cost-performance and the spots for the lowest cost and shortest runtime.
Figure 2. Sentieon DNAseq genomic analysis job runtime versus on-demand cost per sample
Figure 2 shows that AMD-M (64-core) and ARM-L/X (80-core) can finish the analysis in about one hour with cost at or less than $1. So, these examples are the optimal configuration sweet spots for this type of workload. Configured with the same number of vCPU cores, both Intel and AMD shapes delivered same level of job run speed. Intel shape is limited by the maximum numbers of vCPU cores available (64 for VM.Standard3.Flex) compared to AMD (128) or ARM (80) shapes, and Intel instances also have a higher price per vCPU. Benchmarks using the HiSeq X dataset demonstrated similar cost-performance characteristics as demonstrated in Figure 2.
Stay tuned for our next blog, which focuses on the unique RDMA network capability provided only by OCI’s Intel shapes and the benefit to applications such as Sentieon DNAseq, which can run parallel jobs across multiple instances.
Deliver optimal cloud economy to customers
The OCI sweet spot configurations ensured optimal computing resource utilization and delivered optimal cloud economy to customers. Figure 3 shows the job performance metrics captured by OCI control panel during a typical benchmark run on a VM.Standard.A1.Flex instance.
Sentieon DNAseq achieved nearly 100% CPU utilization, except when the system was handling heavy IO. This result is consistent with our expectation that alignment and haplotype variant calling are CPU-intensive stages, while sorting and deduplication is IO-intensive. Memory utilization was up to 90%, while the peak disk read bytes and write bytes were up to target limit of 240 MB/s during the most IO-intensive stage. We also observed narrower I/O peaks, or shorter runtime during IO-intensive stage, when we increased disk size from 500 GB to 1,000 GB, with I/O performance target increased from 240 MB/s to 480 MB/s. In many test cases, this increase reduced the total runtime by about 5–8 minutes.
Figure 3. Performance metrics from aVM.Standard.A1.Flex (OCPU=40) during benchmark runs.
Discussion
CPU performance
To further improve runtime, Sentieon DNAseq can utilize more vCPUs by running on multiple instances in parallel. While this metric is beyond the scope of this study, which focuses on testing single VM configuration with latest CPU processors, our next blog focuses on parallel job runs across multiple instances, and we plan to return to benchmark Sentieon DNAseq’s capability in this area.
In our observation, the job run time on x86 benefited from hyperthreading. Although ARM processors don’t support hyperthreading, they delivered competitive performance against x86 processors.
Memory performance
Sentieon DNAseq is also efficient in memory management. All stages except for alignment use less than 10 GB of memory. This expansion allows for efficient handling of much larger dataset without resorting to much more expensive instance shapes. The alignment stage benefits slightly from higher memory, but it reaches diminishing return at 128-GB memory with Model X.
I/O performance
To accurately represent production analysis use cases and to best evaluate the I/O capability of the benchmarked instance shapes, we included all key output files commonly required by WGS analysis, resulting in about 90-GB result files for NovaSeq samples and 120 GB for HiSeq X samples. All instances used in this study have a single boot volume as the storage subsystem. We observed I/O speed improvement when the boot volume size (and corresponding throughput performance target) was increase from 500 GB (240 MB/s) to 1,000 GB (480 MB/s).
Throughput and scalability
If we define the throughput of a Compute instance as the number of 30X whole genome processed per day, then the throughput = 24/(runtime hours needed to process a single 30X dataset).
Figure 4 plots the throughput against the number of vCPUs for the benchmarked instance shapes. The AMD-X shape can process 32 30X dataset each day, while the ARM-S shape can only process 10 datasets during the same period.
This figure also demonstrates the runtime scalability of Sentieon DNAseq analysis on OCI. The throughput scales proportionally to the number of vCPU cores, demonstrating that different VM shapes of different CPU architectures on OCI Cloud have delivered optimal speed performance, and the Sentieon DNAseq maximally utilizes available computing resources.
Figure 4. Number of 30X WGS processed per day versus number of VCPUs
Source: oracle.com
0 comments:
Post a Comment