9. Profiling Features

The 3DS GPU includes profiling features to allow you to benchmark and fine-tune your hardware performance.

The following table lists the profiling features.

Table 9-1. List of Profiling Features
Feature Description
Busy Counter

Compares the number of times each module outputs a busy signal within a set interval and determines which module output the most busy signals.

This is repeated a specified number of times, and the number of times each module was determined to output the most busy signals is counted.

Shader Execution Clock Counter Counts the number of times each of the four shader processors’ program counters have transitioned and the clock cycles they have stalled.
Vertex Cache Input Vertex Counter Counts the number of vertices input into the post-vertex cache.
Input/Output Polygon Counter Counts the number of vertices and polygons input into, and the number of polygons output from, the triangle setup module.
Input Fragment Counter Counts the number of fragments input into the per-fragment operation module.
Memory Access Counter Counts the number of times that memory, such as VRAM or the vertex buffer, has been accessed.
Note:

Do not call the profiling feature functions while a command list is executing. Doing so does not cause an error, but it can cause the GPU to perform illegal operations.

9.1. Starting and Stopping the Profiling Features

Some of the profiling features must be explicitly started or stopped by calling the following functions from your application.

Code 9-1. Starting and Stopping the Profiling Features
void nngxStartProfiling(GLenum item);
void nngxStopProfiling(GLenum item); 

For the item parameter, specify one of the profiling feature values defined in the following table. Specifying any value that is not defined in this table causes a GL_ERROR_80A2_DMP error on calls to nngxStartProfiling(), and a GL_ERROR_80A3_DMP error on calls to nngxStopProfiling().

Table 9-2. Defined Values for Starting and Stopping the Profiling Features
Defined Value Profiling Feature
NN_GX_PROFILING_BUSY Busy Counter. (Make sure that you set the parameters to specify the counter’s measurement period.)
NN_GX_PROFILING_VERTEX_CACHE Vertex Cache Input Vertex Counter. (Running this can increase power consumption. Make sure that you stop this counter when not using it.)

9.2. Profiling Feature Parameters

Call the following function to set parameters for some of the profiling features.

Code 9-2. Setting Parameters for the Profiling Features
void nngxSetProfilingParameter(GLenum pname, GLuint param); 

The following table shows how the values you can specify for param differ depending on the defined value specified for pname.

Table 9-3. List of Profiling Feature Parameters
Defined Value Specified for pname Value Specified for param
NN_GX_PROFILING_BUSY_SAMPLING_TIME Specify the measurement period for each busy counter measurement (number of GPU clock cycles) as a nonzero 16-bit value.
NN_GX_PROFILING_BUSY_SAMPLING_TIME_MICRO_SECOND

NN_GX_PROFILING_BUSY_SAMPLING_TIME is specified in units of microseconds.

When this value is converted to the number of GPU clock cycles, the converted value must fit into 16 bits. So specify the value in the range from 1 through 244.

NN_GX_PROFILING_BUSY_SAMPLING_TIME_NANO_SECOND

NN_GX_PROFILING_BUSY_SAMPLING_TIME is specified in units of nanoseconds.

When this value is converted to the number of GPU clock cycles, the converted value must fit into 16 bits. So specify the value in the range from 1 through 244537.

NN_GX_PROFILING_BUSY_SAMPLING_COUNT

Specify the number of times to measure the busy counter as a 16-bit value.

When specifying 0, the busy counter continues counting until explicitly told to stop (on each occurrence, counting for the time specified by NN_GX_PROFILING_BUSY_SAMPLING_TIME, and repeating until stopped).

When specifying any other value, the busy counter stops after being explicitly told to stop, or after (measurement time × measurement count) number of clock ticks has passed since the busy counter started.

When a value of 1 or greater is set, and the result is retrieved after the measurements have completed, the total of the busy counter measurement results for all modules equals the total number of measurements.

Table 9-4. Errors Generated by the nngxSetProfilingParameter() Function
Error Cause
GL_ERROR_80A5_DMP An invalid value was specified for pname.
GL_ERROR_80A6_DMP An invalid value was specified for param.
Note:

Measured times can be converted to GPU clock cycles using a frequency of 268 MHz (268,000,000 Hz).

Note:

The parameter initial values are undefined, so you must set values for all relevant parameters.

9.3. Getting Profiling Results

Call the following function to get the results of profiling.

Code 9-3. Function for Getting Profiling Results
void nngxGetProfilingResult(GLenum item, GLuint* result); 

For the item parameter, specify the profiling feature for which to get results. GL_ERROR_80A4_DMP An invalid value was specified for item.

The result parameter stores the results of the specified profiling feature. Note that the size of the buffer required to hold these results changes depending on the profiling feature specified.

For more information about the profiling features you can specify for item and the results stored in result, see the following sections.

Note:

All profiling features that do not require explicit starting and stopping run constantly. These profiling feature counters are reset when the hardware is booted. Consequently, when getting profiling results, use the difference between results obtained at the start of the measurement period and results obtained at the end of this period.

Profiling counters restart at zero when they overflow.

9.3.1. Busy Counter

Specify a value of NN_GX_PROFILING_BUSY for item to get the results of the busy counter. For result, specify an array of GLuint types of the same number of elements as NN_GX_PROFILING_RESULT_BUFSIZE_BUSY.

The busy counter is reset to zero when it is started by a call to the nngxStartProfiling() function. It compares the number of times that busy is output in each period of a set length, and counts the number of times each module is busy the most, until the counter is stopped. In addition, each time you get the measurement results, the counter is reset to zero. As a result, if the profiling function is initiated, and then the results are retrieved before the function has stopped, the results retrieved will be those calculated from the previous time the function was run.

The results for each module are stored as 16-bit values. These results are stored for each module in the following order.

Table 9-5. Busy Counter Data Storage Order
Data Module
result[0] bits [31:16] Shader processor 0 (shared with the geometry processor)
result[0] bits [15:0] Command buffer and vertex array load module
result[1] bits [31:16] Rasterization module
result[1] bits [15:0] Triangle setup
result[2] bits [31:16] Fragment lighting
result[2] bits [15:0] Texture units
result[3] bits [31:16] Per-fragment operation module
result[3] bits [15:0] Texture combiners

The busy counter result is the number of times a module output the busy signal the most often, not the number of times the busy signal was output.

When the busy counter measurement is started, the number of times each module outputs a busy signal in each measurement period is compared, and the counter for the module outputting the most busy signals is incremented. This measurement is repeated a specified number of times. If multiple modules are tied for outputting the most number of busy signals in a measurement cycle, the counter for the module among these that is at the earliest stage is incremented. When the busy signal output count for all modules is zero, the counter for the earliest-stage command buffer or vertex array load module is incremented. For this reason, if the number of measurements is set to one or more and the results are obtained after measurements have completed, the sum of the values for all modules will equal the total number of measurements.

You can assume that the modules with the largest values in the measurement results are acting as the bottleneck for the largest amounts of time. Focus your optimization efforts on these modules to fine-tune performance.

9.3.1.1. Resolving Bottlenecks in Triangle Setup/Rasterization Module

You can analyze the possibility that triangle setup (TS) or the rasterization module (RAS) are becoming bottlenecks from the busy counter result.

TS is affected by the polygon count and RAS is affected by both the polygon count and the number of generated pixels. For this reason, they have the following characteristics.

  • As long as the processed polygon count, the polygon coordinates, and the number of generated pixels do not change, it does not matter to the performance of TS and RAS whether the polygons are generated by loading vertex data or using a geometry shader.
  • As long as the processed polygon count, the polygon coordinates, and the number of generated pixels do not change, it does not matter to the performance of TS and RAS how many times the draw function is called and rendered.
  • Changes in the number of vertex attributes (such as the number of texture coordinates) do not affect the performance of TS and RAS.

If you suspect that TS or RAS may be bottlenecks, check whether performance changes if you insert up to 30 or 40 nop in the vertex shader, or when you enable scissoring on the entire screen. If performance does not change in either case, there is a high probability that TS or RAS are bottlenecks. Reduce the polygon count to reduce the processing load on TS. Some methods to achieve this include using the LOD of the rendered object or clipping the object with the CPU. If the load on TS is reduced, the load on RAS will also be reduced.

9.3.2. Shader Execution Clock Counter

To get the results of the shader execution clock counter, specify the defined value NN_GX_PROFILING_VERTEX_SHADERn (where n is a value from 0 through 3 that indicates the processor number), corresponding to the shader processor to get, as the value of item. For result, specify an array of GLuint types of the same number of elements as NN_GX_PROFILING_RESULT_BUFSIZE_VERTEX_SHADERn (where n is a value from 0 through 3 that indicates the processor number). The geometry processor is shared with shader processor 0.

The shader execution clock counter is reset to zero when the hardware is booted, and it runs constantly.

The results are stored as 32-bit values. The following table shows the storage order and the corresponding information.

Table 9-6. Shader Execution Clock Counter Data Storage Order
Data Description
result[0] Program counter transition count (same as the number of executed shader assembler commands)
result[1] Clock count for stalls due to shader assembler command dependencies (for NN_GX_PROFILING_VERTEX_SHADER0 only, this also includes instances when the geometry shader is valid and the later modules are busy, causing a vout instruction to delay being issued)
result[2] Clock count for stalls due to address register updates (mova commands issued)
result[3] Clock count for stalls due to status register updates (cmp commands issued)
result[4] Clock count for stalls due to program pre-fetch misses (such as when the program counter makes a nonsequential transition due to a branching command)

9.3.3. Vertex Cache Input Vertex Counter

Specify a value of NN_GX_PROFILING_VERTEX_CACHE for item to get the results of the vertex cache input vertex counter. For result, specify an array of GLuint types of the same number of elements as NN_GX_PROFILING_RESULT_BUFSIZE_BUSY.

The vertex cache input vertex counter is reset to zero when a render is begun that uses the vertex buffer (writes a 1 to register 0x022F). Consequently, you can only get the measurement results for the last executed render. Compare this against the total number of vertex indices used in rendering to estimate the efficiency of the post-vertex cache. However, the count may be slightly high or low depending on differences in when vertex indices are loaded and on whether later modules are busy, even when rendering the same vertex data. In addition, it is possible to get profiling results as if the counter were running, even without calling the nngxStartProfiling() function to start the counter, but such results will probably be incorrect. Make sure that you call the nngxStartProfiling() function to start measurements to get the correct results.

The results are stored as 32-bit values. The following table shows the storage order and the corresponding information.

Table 9-7. Vertex Cache Input Vertex Counter Data Storage Order
Data Description
result[0] Count of vertices input into the post-vertex cache

9.3.4. Input/Output Polygon Counter

Specify a value of NN_GX_PROFILING_POLYGON for item to get the results of the input polygon counter. For result, specify an array of GLuint types of the same number of elements as NN_GX_PROFILING_RESULT_BUFSIZE_BUSY.

The input polygon counter is reset to zero when the hardware is booted, and runs constantly.

The results are stored as 32-bit values. The following table shows the storage order and the corresponding information.

Table 9-8. Input Polygon Counter Data Storage Order
Data Description
result[0] Number of vertices input into the triangle setup module
result[1] Number of polygons input into the triangle setup module
result[2] Number of polygons output from the triangle setup module

The number of output polygons is the value of the input polygon count minus the number of polygons stripped out by clipping and culling. Any polygon that intersects the clipping volume is output as multiple polygons, but it is only counted as one polygon.

9.3.5. Input Fragment Counter

Specify a value of NN_GX_PROFILING_FRAGMENT for item to get the results of the input fragment counter. For result, specify an array of GLuint types of the same number of elements as NN_GX_PROFILING_RESULT_BUFSIZE_FRAGMENT.

The input fragment counter is reset to zero when the hardware is booted, and runs constantly.

The results are stored as 32-bit values. The following table shows the storage order and the corresponding information.

Table 9-9. Input Polygon Counter Data Storage Order
Data Description
result[0] Number of fragments input into the per-fragment operation module

The number of fragments does not include fragments discarded by clipping, or by scissor or early depth tests. Fragments are counted before the alpha, stencil, and depth tests, so these test results have no effect on the count.

9.3.6. Memory Access Counter

Specify a value of NN_GX_PROFILING_MEMORY_ACCESS for item to get the results of the memory access counter. For result, specify an array of GLuint types of the same number of elements as NN_GX_PROFILING_RESULT_BUFSIZE_MEMORY_ACCESS.

The memory access counter is reset to zero when the hardware is booted, and it runs constantly.

The results are stored as 32-bit values. The following table shows the storage order and the corresponding information.

Table 9-10. Input Polygon Counter Data Storage Order
Data Description
result[0] GPU reads of VRAM (A channel)
result[1] GPU writes to VRAM (A channel)
result[2] GPU reads of VRAM (B channel)
result[3] GPU writes to VRAM (B channel)
result[4] Reads of command buffers, vertex arrays, and index arrays by modules loading command buffers and vertex arrays
result[5] Texture unit reads of texture memory
result[6] Per-fragment operation module reads of depth and stencil buffers
result[7] Per-fragment operation module writes to depth and stencil buffers
result[8] Per-fragment operation module reads of color buffer
result[9] Per-fragment operation module writes to color buffer
result[10] Upper LCD controller reads of the display buffer
result[11] Lower LCD controller reads of the display buffer
result[12] Post-transfer module reads (transfers by functions such as nngxTransferRenderImage and glCopyTexImage2D)
result[13] Post-transfer module writes (transfers by functions such as nngxTransferRenderImage and glCopyTexImage2D)
result[14] Memory fill module channel 0 buffer writes (buffer clears using functions like glClear)
result[15] Memory fill module channel 1 buffer writes (buffer clears using functions like glClear)
result[16] CPU reads of VRAM (using functions like glReadPixels)
result[17] CPU (DMA transfer) writes to VRAM (DMA transfers via functions like nngxAddVramDmaCommand)

CONFIDENTIAL