-
×InformationNeed Windows 11 help?Check documents on compatibility, FAQs, upgrade information and available fixes.
Windows 11 Support Center. -
-
×InformationNeed Windows 11 help?Check documents on compatibility, FAQs, upgrade information and available fixes.
Windows 11 Support Center. -
- HP Community
- Notebooks
- Notebook Hardware and Upgrade Questions
- Missing SSD slot access and graphics card misconfiguration

Create an account on the HP Community to personalize your profile and ask a question
10-28-2021 04:41 AM
Please have a look on PCIe related remarks in the report below.
The NVIDIA Validation Suite (NVVS) is the system administrator and cluster manager's tool for detecting and troubleshooting common problems affecting NVIDIA® Tesla™ GPUs in a high performance computing environments. NVVS focuses on software and system configuration issues, diagnostics, topological concerns, and relative performance.
The NVIDIA Validation Suite is designed to:
- Provide a system-level tool, in production environments, to assess cluster readiness levels before a workload is deployed.
- Facilitate multiple run modes:
- Interactive via an administrator or user in plain text.
- Scripted via another tool with easily parseable output.
- Provide multiple test timeframes to facilitate different preparedness or failure conditions:
- Quick tests to use as a readiness metric
- Medium tests to use as an epilogue on failure
- Long tests to be run by an administrator as post-mortem
- Integrate the following concepts into a single tool to discover deployment, system software and hardware configuration issues, basic diagnostics, integration issues, and relative system performance.
- Deployment and Software Issues
- NVML library access and versioning
- CUDA library access and versioning
- Software conflicts
- Hardware Issues and Diagnostics
- Pending Page Retirements
- PCIe interface checks
- NVLink interface checks
- Framebuffer and memory checks
- Compute engine checks
- Integration Issues
- PCIe replay counter checks
- Topological limitations
- Permissions, driver, and cgroups checks
- Basic power and thermal constraint checks
- Stress Checks
- Power and thermal stress
- Throughput stress
- Constant relative system performance
- Maximum relative system performance
- Memory Bandwidth
- Deployment and Software Issues
- Provide troubleshooting help
- Easily integrate into Cluster Scheduler and Cluster Management applications
- Reduce downtime and failed GPU jobs
Beyond the Scope of the NVIDIA Validation Suite
NVVS is not designed to:
- Provide comprehensive hardware diagnostics
- Actively fix problems
- Replace the field diagnosis tools. Please refer to http://docs.nvidia.com/deploy/hw-field-diag/index.html for that process.
- Facilitate any RMA process. Please refer to http://docs.nvidia.com/deploy/rma-process/index.html for those procedures.
- NVVS requires a NVIDIA Linux driver to be installed. Both the standard display driver and Tesla Recommended Driver will work. You can obtain a driver from http://www.nvidia.com/object/unix.html.
- NVVS requires the standard C++ runtime library with GLIBCXX of at least version 3.4.5 or greater.
The NVIDIA Validation Suite supports Tesla GPUs running on 64-bit Linux (bare metal) operating systems. NVIDIA® Tesla™ Line:
- All Kepler, Maxwell, Pascal, and Volta architecture GPUs
The various command line options of NVVS are designed to control general execution parameters, whereas detailed changes to execution behavior are contained within the configuration files detailed in the next chapter.
The various options for NVVS are as follows:
Short option
Long option
Description
--statspath
Write the plugin statistics to a given path rather than the current directory.
-a
--appendLog
When generating a debug logfile, do not overwrite the contents of a current log. Used in conjuction with the -d and -l options.
-c
--config
Specify the configuration file to be used. The default is /etc/nvidia-validation-suite/nvvs.conf
--configless
Run NVVS in a configless mode. Executes a "long" test on all supported GPUs.
-d
--debugLevel
Specify the debug level for the output log. The range is 0 to 5 with 5 being the most verbose. Used in conjunction with the -l flag.
-g
--listGpus
List the GPUs available and exit. This will only list GPUs that are supported by NVVS.
-i
--indexes
Comma separated list of indexes to run NVVS on.
-j
--jsonOutput
Instructs nvvs to format the output as JSON.
-l
--debugLogFile
Specify the logfile for debug information. This will produce an encrypted log file intended to be returned to NVIDIA for post-run analysis after an error.
--quiet
No console output given. See logs and return code for errors.
-p
--pluginpath
Specify a custom path for the NVVS plugins.
-s
--scriptable
Produce output in a colon-separated, more script-friendly and parseable format.
--specifiedtest
Run a specific test in a configless mode. Multiple word tests should be in quotes, and if more than one test is specified it should be comma-separated.
--parameters
Specify test parameters via the command-line. For example: --parameters "sm stress.test_duration=300" would set the test duration for the SM Stress test to 300 seconds.
--statsonfail
Output statistic logs only if a test failure is encountered.
-t
--listTests
List the tests available to be executed through NVVS and exit. This will list only the readily loadable tests given the current path and library conditions.
-v
--verbose
Enable verbose reporting.
--version
Displays the version information and exits.
-h
--help
Display usage information and exit.
To display the list of GPUs available on the system.
user@hostname
$ nvvs -g
NVIDIA Validation Suite (version 352.00)
Supported GPUs available:
[0000:01:00.0] -- Tesla K40c
[0000:05:00.0] -- Tesla K20c
[0000:06:00.0] -- Tesla K20c
An example "quick" test (explained later) using a custom configuration file.
user@hostname
$ nvvs -c Tesla_K40c_quick.conf
NVIDIA Validation Suite (version 352.00)
Software
Blacklist ......................................... PASS
NVML Library ...................................... PASS
CUDA Main Library ................................. PASS
CUDA Toolkit Libraries ............................ PASS
Permissions and OS-related Blocks ................. PASS
Persistence Mode .................................. PASS
Environmental Variables ........................... PASS
To output an encrypted debug file at the highest debug level to send to NVIDIA for analysis after a problem.
user@hostname
$ nvvs -c Tesla_K40c_medium.conf -d 5 -l debug.log
NVIDIA Validation Suite (version 352.00)
Software
Blacklist ......................................... PASS
NVML Library ...................................... PASS
CUDA Main Library ................................. PASS
CUDA Toolkit Libraries ............................ PASS
Permissions and OS-related Blocks ................. PASS
Persistence Mode .................................. PASS
Environmental Variables ........................... PASS
Hardware
Memory GPU0 ....................................... PASS
Integration
PCIe .............................................. FAIL
*** GPU 0 is running at PCI link width 8X, which is below the minimum allowed link width of 16X (parameter:
min_pci_width)"
The output file, debug.log would then be returned to NVIDIA.
The NVVS configuration file is a YAML-formatted (e.g. human-readable JSON) text file with three main stanzas controlling the various tests and their execution.
The general format of a configuration file consists of:
%YAML 1.2
---
globals:
key1: value
key2: value
test_suite_name:
- test_class_name1:
test_name1:
key1: value
key2: value
subtests:
subtest_name1:
key1: value
key2: value
test_name2:
key1: value
key2: value
-test_class_name2:
test_name3:
key1: value
key2: value
gpus:
- gpuset: name
properties:
key1: value
key2: value
tests:
name: test_suite_name
There are three distinct sections: globals, test_suite_name, and gpus each with its own subsection of parameters and as is with any YAML document, indentation is important thus if errors are generated from your own configuration files please refer to this example for indentation reference.
Keyword
Value Type
Description
Logfile
String
The prefix for all detailed test data able to be used for post-processing.
logfile_type
String
Can be json, text, or binary. Used in conjunction with the logfile global parameter. Default is JSON.
Scriptable
Boolean
Accepts true, or false. Produces a script-friendly, colon-separated output and is identical to the -s command line parameter.
serial_override
Boolean
Accepts true, or false. Some tests are designed to run in parallel if multiple GPUs are given. This parameter overrides that behavior serializing execution across all tests.
require_persistence_mode
Boolean
Accepts true, or false. Persistence mode is a prerequisite for some tests, this global overrides that requirement and should only be used if it is not possible to activate persistence mode on your system.
The gpus stanza may consist of one or more gpusets which will each match zero or more GPUs on the system based on their properties(a match of zero will produce an error).
GPUs are matched based on the following criteria with their configuration file keywords in parenthesis:
- Name of the GPU, i.e. Tesla K40c (name)
- Brand of the GPU, i.e. Tesla (brand)
- A comma separated list of indexes (index)
- The GPU UUID (uuid)
- or the PCIe Bus ID (busid)
The matching rules are based off of exclusion. First, the list of supported GPUs is taken and if no properties tag is given then all GPUs will be used in the test. Because a UUID or PCIe Bus ID can only match a single GPU, if those properties are given then only that GPU will be used if found. The remaining properties, index, brand, and name work in an "AND" fashion such that, if specified, the result must match at least one GPU on the system for a test to be performed.
For example, if name is set to "Tesla K40c" and index is set to "0" NVVS will error if index 0 is not a Tesla K40c. By specifying both brand and index a user may limit a test to specific "Tesla" cards for example. In this version of NVVS, all matching GPUs must be homogeneous.
The second identifier for a gpuset is tests. This parameter specifies either the suite of tests that a user wishes to run or the test itself.
At present the following suites are available:
- Quick -- meant as a pre-run sanity check to ensure that the GPUs are ready for a job. Currently runs the Deployment tests described in the next chapter.
- Medium -- meant as a quick, post-error check to make sure that nothing very obvious such as ECC enablement or double-bit errors have occurred on a GPU. Currently runs the Deployment, Memory/Hardware, and PCIe/Bandwidth tests. The Hardware tests are meant to be relatively short to find obvious issues.
- Long -- meant as a more extensive check to find potential power and/or performance problems within a cluster. Currently runs an extensive test that involves Deployment, Memory/Hardware, PCI/Bandwidth, Power, Stress, and Memory Bandwidth. The Hardware tests will run in a longer-term iterative mode that are meant to try and capture transient failures as well as obvious issues.
An individual test can also be specified. Currently the keywords are: Memory, Diagnostic, Targeted Stress, Targeted Power, PCIe, SM Stress, and Memory Bandwidth. Please see the "custom" section in the next subchapter to configure and tweak the parameters when this method is used.
The format of the NVVS configuration file is designed for extensibility. Each test suite above can be customized in a number of ways described in detail in the following chapter for each test. Individual tests belong to a specific class of functionality which, when wanting to customize specific parameters, must also be specified.
The classes and the respective tests they perform are as follows:
Class name
Tests
Brief description
Software
Deployment
Checks for various runtime libraries, persistence mode, permissions, environmental variables, and blacklisted drivers.
Hardware
Diagnostic
Execute a series of hardware diagnostics meant to exercise a GPU or GPUs to their factory specified limits.
Integration
PCIe
Test host to GPU, GPU to host, and P2P (if possible) bandwidth. P2P between GPUs occurs over NvLink (if possible) or PCIe.
Stress
Targeted Stress
Sustain a specific targeted stress level for a given amount of time.
Targeted Power
Sustain a specific targeted power level for a given amount of time.
SM Stress
Sustain a workload on the Streaming Multiprocessors (SMs) of the GPU for a given amount of time.
Memory Bandwidth
Verify that a certain memory bandwidth can be achieved on the framebuffer of the GPU.
Some tests also have subtests that can be enabled by using the subtests keyword and then hierarchically adding the subtest parameters desired beneath. An example would be the PCIe Bandwidth test which may have a section that looks similar to this:
long:
- integration:
pcie:
test_unpinned: false
subtests:
h2d_d2h_single_pinned:
min_bandwidth: 20
min_pci_width: 16
When only a specific test is given in the GPU set portion of the configuration file, both the suite and class of the test are custom. For example:
%YAML 1.2
---
globals:
logfile: nvvs.log
custom:
- custom:
targeted stress:
test_duration: 60
gpus:
- gpuset: all_K40c
properties:
name: Tesla K40c
tests:
- name: targeted stress
The NVIDIA Validation Suite consists of a series of plugins that are each designed to accomplish a different goal.
The deployment plugin's purpose is to verify the compute environment is ready to run Cuda applications and is able to load the NVML library.
Preconditions
- LD_LIBRARY_PATH must include the path to the cuda libraries, which for version X.Y of Cuda is normally /usr/local/cuda-X.Y/lib64, which can be set by running export LD_LIBRARY_PATH=/usr/local/cuda-X.Y/lib64
- The linux nouveau driver must not be running, and should be blacklisted since it will conflict with the nvidia driver
Configuration Parameters
None at this time.
Stat Outputs
None at this time.
Failure
The plugin will fail if:
- The corresponding device nodes for the target GPU(s) are being blocked by the operating system (e.g. cgroups) or exist without r/w permissions for the current user.
- The NVML library libnvidia-ml.so cannot be loaded
- The Cuda runtime libraries cannot be loaded
- The nouveau driver is found to be loaded
- Any pages are pending retirement on the target GPU(s)
- Any other graphics processes are running on the target GPU(s) while the plugin runs
The HW Diagnostic Plugin is designed to identify HW failures on GPU silicon and board-level components, extending out to the PCIE and NVLINK interfaces. It is not intended to identify HW or system level issues beyond the NVIDIA-provided HW. Nor is it intended to identify SW level issues above the HW, e.g. in the NVIDIA driver stack. The plugin runs a series of tests that target GPU computational correctness, GDDR/HBM memory resiliency, GPU and SRAM high power operation, SM stress and NVLINK/PCIE correctness. The plugin can run with several combinations of tests corresponding to medium and long NVVS operational modes. This plugin will take about three minutes to execute.
The plugin produces a simple pass/fail output. A failing output means that a potential HW issue has been found. However, the NVVS HW Diagnostic Plugin is not by itself a justification for GPU RMA. Any failure in the plugin should be followed by execution of the full NVIDIA Field Diagnostic after the machine has been taken offline. Only a failure of the Field Diagnostic tool constitutes grounds for RMA. Since the NVVS HW Diagnostic Plugin is a functional subset of the Field Diagnostic a failure in the plugin is a strong indicator of a future Field Diagnostic failure.
Preconditions
- No other GPU processes can be running.
Configuration Parameters
Parameter Name
Type
Default
Value Range
Description
test_duration
Float
180.0
30.0 - 3600.0
How long the performance test should run for in seconds. It is recommended to set this to at least 30 seconds to make sure you actually get some stress from the test.
use_doubles
Boolean
False
True or False
If set to true, tells the test to use double-point precision in its calculations. By default, it is false and the test will use floating point precision.
temperature_max
Float
100.0
30.0 - 120.0
The maximum temperature in C that the card is allowed to reach during the test. Use nvidia-smi -q to see the normal temperature limits of your device.
Stat Outputs
Stat Name
Stat Scope
Type
Description
power_usage
GPU
Time series Float
Per second power usage of each GPU in watts. Note that for multi-GPU boards, each GPU gets a fraction of the power budget of the board.
graphics_clock
GPU
Time series Float
Per second clock rate of each GPU in MHZ
memory_clock
GPU
Time series Float
Per second clock rate of the GPU’s memory in MHZ
nvml_events
GPU
Time series Int64
Any events that were read with nvmlEventSetWait - including single or double bit errors or XID errors - during the test.
power_violation
GPU
Time series Float
Percentage of time this GPU was violating power constraints.
gpu_temperature
GPU
Time series Float
Per second temperature of the GPU in degrees C
thermal_violation
GPU
Time series Float
Percentage of time this GPU was violating thermal constraints.
perf_gflops
GPU
Time Series Float
The per second reading of average gflops since the test began.
Failure
The plugin will fail if:
- The corresponding device nodes for the target GPU(s) are being blocked by the operating system (e.g. cgroups) or exist without r/w permissions for the current user.
- Other GPU processes are running
- A hardware issue has been detected. This is not an RMA actionable failure but rather an indication that more investigation is required.
- The temperature reaches unacceptable levels during the test.
- If GPU double bit ECC errors occur or the configured amount of SBE errors occur.
- If a critical XID occurs
The GPU bandwidth plugin's purpose is to measure the bandwidth and latency to and from the GPUs and the host.
Preconditions
None
Sub Tests
The plugin consists of several self-tests that each measure a different aspect of bandwidth or latency. Each subtest has either a pinned/unpinned pair or a p2p enabled/p2p disabled pair of identical tests. Pinned/unpinned tests use either pinned or unpinned memory when copying data between the host and the GPUs.
This plugin will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe
Each sub test is represented with a tag that is used both for specifying configuration parameters for the sub test and for outputting stats for the sub test. P2p enabled/p2p disabled tests enable or disable GPUs on the same card talking to each other directly rather than through the PCIe bus.
Sub Test Tag
Pinned/Unpinned
P2P Enabled/P2P Disabled
Description
h2d_d2h_single_pinned
Pinned
Device <-> Host Bandwidth, one GPU at a time
h2d_d2h_single_unpinned
Unpinned
Device <-> Host Bandwidth, one GPU at a time
h2d_d2h_concurrent_pinned
Pinned
Device <-> Host Bandwidth, all GPUs concurrently
h2d_d2h_concurrent_unpinned
Unpinned
Device <-> Host Bandwidth, all GPUs concurrently
h2d_d2h_latency_pinned
Pinned
Device <-> Host Latency, one GPU at a time
h2d_d2h_latency_unpinned
Unpinned
Device <-> Host Latency, one GPU at a time
p2p_bw_p2p_enabled
P2P Enabled
Device <-> Device bandwidth one GPU pair at a time
p2p_bw_p2p_disabled
P2P Disabled
Device <-> Device bandwidth one GPU pair at a time
p2p_bw_concurrent_p2p_enabled
P2P Enabled
Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1
p2p_bw_concurrent_p2p_disabled
P2P Disabled
Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1
1d_exch_bw_p2p_enabled
P2P Enabled
Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l)
1d_exch_bw_p2p_disabled
P2P Disabled
Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l)
p2p_latency_p2p_enabled
P2P Enabled
Device <-> Device Latency, one GPU pair at a time
p2p_latency_p2p_disabled
P2P Disabled
Device <-> Device Latency, one GPU pair at a time
Configuration Parameters- Global
Parameter Name
Type
Default
Value Range
Description
test_pinned
Bool
True
True/False
Include subtests that test using pinned memory.
test_unpinned
Bool
True
True/False
Include subtests that test using unpinned memory.
test_p2p_on
Bool
True
True/False
Include subtests that require peer to peer (P2P) memory transfers between cards to occur.
test_p2p_off
Bool
True
True/False
Include subtests that do not require peer to peer (P2P) memory transfers between cards to occur.
max_pcie_replays
Float
80.0
1.0 - 1000000.0
Maximum number of PCIe replays to allow per GPU for the duration of this plugin. This is based on an expected replay rate <8 per minute for PCIe Gen 3.0, assuming this plugin will run for less than a minute and allowing 10x as many replays before failure.
Configuration Parameters- Sub Test
Parameter Name
Default (Range)
Affected Sub Tests
Description
min_bandwidth
Null
(0.0 - 100.0)
h2d_d2h_single_pinned, h2d_d2h_single_unpinned, h2d_d2h_concurrent_pinned, h2d_d2h_concurrent_unpinned
Minimum bandwidth in GB/s that must be reached for this sub-test to pass.
max_latency
100,000.0
(0.0 - 1,000,000.0)
h2d_d2h_latency_pinned, h2d_d2h_latency_unpinned
Latency in microseconds that cannot be exceeded for this sub-test to pass.
min_pci_generation
1.0
(1.0 - 3.0)
h2d_d2h_single_pinned, h2d_d2h_single_unpinned
Minimum allowed PCI generation that the GPU must be at or exceed for this sub-test to pass.
min_pci_width
1.0
(1.0 - 16.0)
h2d_d2h_single_pinned, h2d_d2h_single_unpinned
Minimum allowed PCI width that the GPU must be at or exceed for this sub-test to pass. For example, 16x = 16.0.
Stat Outputs - Global
Stat Name
Stat Scope
Type
Description
pcie_replay_count
GPU
Float
The per second reading of PCIe replays that have occurred since the start of the GPU Bandwidth plugin.
Stat Outputs -Sub Test
Stats for the GPU Bandwidth test are also output on a test by test basis, using the sub test name as the group name key. The following stats sections are organized by sub test.
h2d_d2h_single_pinned/h2d_d2h_single_unpinned
Stat Name
Type
Description
N_h2d
Float
Average bandwidth from host to device for device N
N_d2h
Float
Average bandwidth from device to host for device N
N_bidir
Float
Average bandwidth from device to host and host to device at the same time for device N
h2d_d2h_concurrent_pinned/h2d_d2h_concurrent_unpinned
Stat Name
Type
Description
N_h2d
Float
Average bandwidth from host to device for device N
N_d2h
Float
Average bandwidth from device to host for device N
N_bidir
Float
Average bandwidth from device to host and host to device at the same time for device N
sum_bidir
Float
Sum of the average bandwidth from device to host and host to device for all devices.
sum_h2d
Float
Sum of the average bandwidth from host to device for all devices.
sum_d2h
Float
Sum of the average bandwidth from device to host for all devices.
h2d_d2h_latency_pinned/h2d_d2h_latency_unpinned
Stat Name
Type
Description
N_h2d
Float
Average latency from host to device for device N
N_d2h
Float
Average latency from device to host for device N
N_bidir
Float
Average latency from device to host and host to device at the same time for device N
p2p_bw_p2p_enabled/p2p_bw_p2p_disabled
Stat Name
Type
Description
N_M_onedir
Float
Average bandwidth from device N to device M, copying one direction at a time.
N_M_bidir
Float
Average bandwidth from device N to device M, copying both directions at the same time.
p2p_bw_concurrent_p2p_enabled/p2p_bw_concurrent_p2p_disabled
Stat Name
Type
Description
l2r_N_M
Float
Average bandwidth from device N to device M
r2l_N_M
Float
Average bandwidth from device M to device N
bidir_N_M
Float
Average bandwidth from device M to device N, copying concurrently
r2l_sum
Float
Sum of average bandwidth for all right (M) to left (N) copies
r2l_sum
Float
Sum of average bidirectional bandwidth for all right (M) to left (N) and left to right copies copies
1d_exch_bw_p2p_enabled/1d_exch_bw_p2p_disabled
Stat Name
Type
Description
l2r_N
Float
Average bandwidth from device N to device N+1
r2l_N
Float
Average bandwidth from device N to device N-1
l2r_sum
Float
Sum of all l2r average bandwidth stats
r2l_sum
Float
Sum of all l2r average bandwidth stats
p2p_latency_p2p_enabled/p2p_latency_p2p_disabled
Stat Name
Type
Description
N_M
Float
Average latency from device N to device M
Failure
The plugin will fail if:
- The latency exceeds the configured threshold for relevant tests.
- The bandwidth cannot exceed the configured threshold for relevant tests.
- If the number of PCIe retransmits exceeds a user-provided threshold.
The purpose of the Memory Bandwidth plugin is to validate that the bandwidth of the framebuffer of the GPU is above a preconfigured threshold.
Preconditions
This plugin only runs on GV100 GPUs at this time.
Configuration Parameters
Parameter Name
Type
Default
Value Range
Description
minimum_bandwidth
Float
Differs per GPU
1.0 - 1000000.0
Minimum framebuffer bandwidth threshold that must be achieved in order to pass this test in MB/sec.
Stat Outputs
Stat Name
Stat Scope
Type
Description
power_usage
GPU
Time series Float
Per second power usage of each GPU in watts. Note that for multi-GPU boards, each GPU gets a fraction of the power budget of the board.
memory_clock
GPU
Time series Float
Per second clock rate of the GPU’s memory in MHZ
nvml_events
GPU
Time series Int64
Any events that were read with nvmlEventSetWait during the test and the timestamp it was read it.
Failure
The plugin will fail if:
- the minimum bandwidth specified in minimum_bandwidth cannot be achieved.
- If GPU double bit ECC errors occur or the configured amount of SBE errors occur.
- If a critical XID occurs
The SM performance plugin’s purpose is to bring the Streaming Multiprocessors (SMs) of the target GPU(s) to a target performance level in gigaflops by doing large matrix multiplications using cublas. Unlike the Targeted Stress plugin, the SM stress plugin does not copy the source arrays to the GPU before every matrix multiplication. This allows the SM performance plugin's performance to not be capped by device to host bandwidth. The plugin calculates how many matrix operations per second are necessary to achieve the configured performance target and fails if it cannot achieve that target.
This plugin should be used to watch for thermal, power and related anomalies while the target GPU(s) are under realistic load conditions. By setting the appropriate parameters a user can ensure that all GPUs in a node or cluster reach desired performance levels. Further analysis of the generated stats can also show variations in the required power, clocks or temperatures to reach these targets, and thus highlight GPUs or nodes that are operating less efficiently.
Preconditions
None
Configuration Parameters
Parameter Name
Type
Default
Value Range
Description
test_duration
Float
90.0
30.0 - 3600.0
How long the performance test should run for in seconds. It is recommended to set this to at least 30 seconds for performance to stabilize.
temperature_max
Float
Null
30.0 - 120.0
The maximum temperature in C the card is allowed to reach during the test. Note that this check is disabled by default. Use nvidia-smi -q to see the normal temperature limits of your device.
target_stress
Float
Null
SKU dependent
The maximum relative performance each card will attempt to achieve.
Stat Outputs
Stat Name
Stat Scope
Type
Description
power_usage
GPU
Time series Float
Per second power usage of each GPU in watts. Note that for multi-GPU boards, each GPU gets a fraction of the power budget of the board.
graphics_clock
GPU
Time series Float
Per second clock rate of each GPU in MHZ
memory_clock
GPU
Time series Float
Per second clock rate of the GPU’s memory in MHZ
nvml_events
GPU
Time series Int64
Any events that were read with nvmlEventSetWait - including single or double bit errors or XID errors - during the test.
power_violation
GPU
Time series Float
Percentage of time this GPU was violating power constraints.
gpu_temperature
GPU
Time series Float
Per second temperature of the GPU in degrees C
perf_gflops
GPU
Time series Float
The per second reading of average gflops since the test began.
flops_per_op
GPU
Float
Flops (floating point operations) per operation queued to the GPU stream. One operation is one call to cublasSgemm or cublasDgemm
bytes_copied_per_op
GPU
Float
How many bytes are copied to + from the GPU per operation
num_cuda_streams
GPU
Float
How many cuda streams were used per gpu to queue operations to the GPUs
try_ops_per_sec
GPU
Float
Calculated number of ops/second necessary to achieve target gigaflops
Failure
The plugin will fail if:
- The GPU temperature exceeds a user-provided threshold.
- If thermal violation counters increase
- If the target performance level cannot be reached
- If GPU double bit ECC errors occur or the configured amount of SBE errors occur.
- If a critical XID occurs
The Targeted Stress plugin’s purpose is to bring the GPU to a target performance level in gigaflops by doing large matrix multiplications using cublas. The plugin calculates how many matrix operations per second are necessary to achieve the configured performance target and fails if it cannot achieve that target.
This plugin should be used to watch for thermal, power and related anomalies while the target GPU(s) are under realistic load conditions. By setting the appropriate parameters a user can ensure that all GPUs in a node or cluster reach desired performance levels. Further analysis of the generated stats can also show variations in the required power, clocks or temperatures to reach these targets, and thus highlight GPUs or nodes that are operating less efficiently.
Preconditions
None
Configuration Parameters
Parameter Name
Type
Default
Value Range
Description
test_duration
Float
120.0
30.0 - 3600.0
How long the Targeted Stress test should run for in seconds. It is recommended to set this to at least 30 seconds for performance to stabilize.
temperature_max
Float
Null
30.0 - 120.0
The maximum temperature in C the card is allowed to reach during the test. Note that this check is disabled by default. Use nvidia-smi -q to see the normal temperature limits of your device.
target_stress
Float
Null
SKU dependent
The maximum relative stress each card will attempt to achieve.
max_pcie_replays
Float
160.0
1.0 - 1000000.0
Maximum number of PCIe replays to allow per GPU for the duration of this plugin. This is based on an expected replay rate <8 per minute for PCIe Gen 3.0, assuming this plugin will run for 2 minutes (configurable) and allowing 10x as many replays before failure.
Stat Outputs
Stat Name
Stat Scope
Type
Description
power_usage
GPU
Time series Float
Per second power usage of each GPU in watts. Note that for multi-GPU boards, each GPU gets a fraction of the power budget of the board.
graphics_clock
GPU
Time series Float
Per second clock rate of each GPU in MHZ
memory_clock
GPU
Time series Float
Per second clock rate of the GPU’s memory in MHZ
nvml_events
GPU
Time series Int64
Any events that were read with nvmlEventSetWait during the test and the timestamp it was read it.
power_violation
GPU
Time series Float
Percentage of time this GPU was violating power constraints.
gpu_temperature
GPU
Time series Float
Per second temperature of the GPU in degrees C
perf_gflops
GPU
Time series Float
The per second reading of average gflops since the test began.
flops_per_op
GPU
Float
Flops (floating point operations) per operation queued to the GPU stream. One operation is one call to cublasSgemm or cublasDgemm
bytes_copied_per_op
GPU
Float
How many bytes are copied to + from the GPU per operation
num_cuda_streams
GPU
Float
How many cuda streams were used per gpu to queue operations to the GPUs
try_ops_per_sec
GPU
Float
Calculated number of ops/second necessary to achieve target gigaflops
pcie_replay_count
GPU
Float
The per second reading of PCIe replays that have occurred since the start of the Targeted Stress plugin.
Failure
The plugin will fail if:
- The GPU temperature exceeds a user-provided threshold.
- If temperature violation counters increase
- If the target stress level cannot be reached
- If GPU double bit ECC errors occur or the configured amount of SBE errors occur.
- If the number of PCIe retransmits exceeds a user-provided threshold.
- A crtical XID occurs
The purpose of the power plugin is to bring the GPUs to a preconfigured power level in watts by gradually increasing the compute load on the GPUs until the desired power level is achieved. This verifies that the GPUs can sustain a power level for a reasonable amount of time without problems like thermal violations arising.
Preconditions
None
Configuration Parameters
Parameter Name
Type
Default
Value Range
Description
test_duration
Float
120.0
30.0 - 3600.0
How long the performance test should run for in seconds. It is recommended to set this to at least 60 seconds for performance to stabilize.
temperature_max
Float
Null
30.0 - 120.0
The maximum temperature in C the card is allowed to reach during the test. Note that this check is disabled by default. Use nvidia-smi -q to see the normal temperature limits of your device.
target_power
Float
Differs per GPU
Differs per GPU. Defaults to TDP - 1 watt.
What power level in wattage we should try to maintain. If this is set to greater than the enforced power limit of the GPU, then we will try to power cap the device
Stat Outputs
Stat Name
Stat Scope
Type
Description
power_usage
GPU
Time series Float
Per second power usage of each GPU in watts. Note that for multi-GPU boards, each GPU gets a fraction of the power budget of the board.
graphics_clock
GPU
Time series Float
Per second clock rate of each GPU in MHZ
memory_clock
GPU
Time series Float
Per second clock rate of the GPU’s memory in MHZ
nvml_events
GPU
Time series Int64
Any events that were read with nvmlEventSetWait during the test and the timestamp it was read it.
power_violation
GPU
Time series Float
Percentage of time this GPU was violating power constraints.
gpu_temperature
GPU
Time series Float
Per second temperature of the GPU in degrees C
Failure
The plugin will fail if:
- The GPU temperature exceeds a user-provided threshold.
- If temperature violation counters increase
- If the target performance level cannot be reached
- If GPU double bit ECC errors occur or the configured amount of SBE errors occur.
- If a critical XID occurs
The output of tests can be collected by setting the "logfile" global parameter which represents the prefix for the detailed outputs produced by each test. The default type of output is JSON but text and binary outputs are available as well. The latter two are meant more for parsing and direct reading by custom consumers respectively so this portion of the document will focus on the JSON output.
The JSON output format is keyed based off of the "stats" keys given in each test overview from Chapter 3. These standard JSON files can be processed in any number of ways but two example Python scripts have been provided to aid in visualization in the default installation directory.. The first is a JSON to comma-separated value script (json2csv.py) which can be used to import key values in to a graphing spreadsheet. Proper usage would be:
user@hostname
$ python json2csv.py -i stats_targeted_performance.json -o stats.csv -k gpu_temperature,power_usage
Also provided is an example Python script that uses the pygal library to generate readily viewable scalar vector graphics charts (json2svg.py), able to be opened in any browser. Proper usage would be:
user@hostname
$ python json2svg.py -i stats_targeted_performance.json -o stats.svg -k gpu_temperature,power_usage
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
© 2014-2020 NVIDIA Corporation. All rights reserved.
- « Previous
-
- 1
- 2
- Next »