8. Appendix

8.1. Profiling CPU and GPU Operations

If you run the GPU during CPU processing, the CPU receives interrupts from the GPU and competes with the GPU to access main memory. These delays make it impossible to accurately measure execution time.

To accurately measure the execution time of the CPU and GPU, you must wait for an operation on the CPU to finish completely before running an operation on the GPU.

The following procedure provides a specific example of how to do this.

  1. Call nngxClearCmdlist to clear the command list.
  2. Call nngxStopCmdlist to stop execution of the GPU.
  3. Call nngxBindCmdlist to bind the command list.
  4. Start measuring the CPU time.
  5. Issue commands from the CPU.
  6. Stop measuring the CPU time.
  7. Call nngxSplitCmdlist to issue a command request (render command request).
  8. Start measuring the GPU time.
  9. Call nngxRunCmdlist to issue command lists from the GPU.
  10. Call nngxWaitCmdlistDone to wait for the GPU to finish its operations.
  11. Stop measuring the GPU time.

 

Related Functions:

nngxClearCmdlist, nngxStopCmdlist, nngxBindCmdlist, nngxRunCmdlist, nngxWaitCmdlistDone

8.2. Reducing the Processing Load to Display Model Data (Important)

The processing load to display a model is split between the CPU, vertex shader (GPU), and fragment shader (GPU). To reduce the processing load, you must determine which process is the bottleneck when you create model data. Also, if you place texture data and vertex data in main memory, note that access conflicts may cause the CPU performance to decline.

The following figure shows a procedure for reducing the processing load to display model data.

Figure 8-1. Reducing the Processing Load to Display Model Data

Is processing more CPU-intensive or more GPU-intensive? Combine materials (Reduce setting commands) Reduce meshes Reduce the number of bones (joints) You must remove lights or take other measures to reduce CPU processing Reduce the vertex count You must reduce the number of vertex lights or other processing within the vertex shader Does GPU processing decrease significantly when the model is no longer visible from the camera? Enable back-face culling Reduce the number of fragment lights Decrease the texture resolution Change the texture format Color: ETC1 Color/alpha: ETC1A4 Normals: HILO8 Change the layer configuration to one that is cheaper to process CPU-intensive GPU-intensive No Yes Still CPU-intensive Reduce the number of textures Use mipmaps Check the following

8.3. Hardware Configuration

This section describes the configuration of the main hardware related to graphics.

When an application runs as a standard application on SNAKE, the hardware configuration is identical to CTR.

Figure 8-2. Hardware Configuration (Standard Application)

LCD Upper Screen (800 x 240 px) Lower Screen (320 x 240 px) GPU P3D PPF PSC PDC GPU Internal Bus VRAM A (3 MB) VRAM B (3 MB) Main Bus CPU (ARM11 268 MHz 2 Cores) Main Memory (64 MB) DMA Other Modules

Hardware Configuration for Extended Applications

When an application runs as an extended application on SNAKE, the CPU performance and main memory change as illustrated below.

Figure 8-3. Hardware Configuration (Extended Application)

8.3.1. CPU

The CPU generates graphics commands and conveys instructions to the GPU. One of the CPU cores is used exclusively by applications as the application core, while the remaining core is used as the system core. The CPU is connected to the GPU, main memory, and other modules via the main bus.

When an application runs as an extended application on SNAKE, the CPU clock rate increases to 804 MHz and the 2 MB L2 cache shared by all CPU cores becomes available.

8.3.2. Main Memory

Many kinds of data can be placed in main memory. A maximum of 64 MB of main memory can be used from the application. This main memory is connected to the main bus.

When an application runs as an extended application on SNAKE, up to 124 MB of memory is available for use by the application.

8.3.3. DMA

This is used by command request DMA transfer commands. Its uses include the transfer of texture and vertex data from main memory to VRAM.

8.3.4. Main Bus

This bus connects the CPU, the GPU, main memory and other devices. Many kinds of data are exchanged primarily via this bus. To adjust the priority of modules that use the main bus, use the nngxSetMemAccessPrioMode function.

8.3.5. GPU

The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.

Table 8-1. The GPU Modules

Module

Description

P3D

This module references the 3D commands accumulated in the command buffer and performs the actual rendering.

PPF

This module performs block-to-linear conversion based on command request post-transfer commands and transfers data.

PSC

This module performs memory fill commands (clearing) for command requests.

PDC

This module transfers the content of the display buffer to the LCD.

VRAM A/VRAM B

Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately.

GPU internal bus

The bus that connects the modules in the GPU. It is connected to the main bus.

Note:

The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.

When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.

8.3.6. LCD

This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.

8.4. GPU Profiling Feature

You can use the GPU profiling feature to measure processing in each hardware module in the GPU. The following table lists what information can be obtained using this feature.

Table 8-2. Information Obtainable With the Profiling Feature

Information

Description

Busy clock

You can get the number of busy clock cycles that occur as a result of the vertex shader and fragment lighting in the various GPU modules.

Shader execution clock

You can get the number of execution clock cycles and stall clock cycles for each vertex shader processor.

Number of vertices entered in vertex cache

You can get the number of vertices entered into post vertex cache. By taking the difference from the number of vertices actually used for rendering, you can determine the effective number of vertices in post vertex cache.

Number of input/output polygons

You can get the number of polygons input for triangle setup and the number of polygons output after clipping.

Number of input fragments

You can get the number of fragments input to the per-fragment operation module.

Number of accesses to memory

You can get the number of times memory is accessed by each GPU hardware module.

 

The busy-clock obtained for each hardware module includes not only the busy clock cycles related to the actual processing of data, but also the time that the module is waiting and cannot output data because later modules are busy. It is effective to optimize not only modules that have particularly high busy-clock values, but also the final-stage module among modules with high busy values.

Note:

For more information on the profiling feature, see the CTR Programming Manual: Advanced Graphics.

8.5. Notes for Using the DMPGL

The following sections provide notes about performance when using the DMPGL.

8.5.1. Maintaining Internal State Consistency

The internal state refers to local data that is saved by the DMPGL driver; it could also be called a mirror of the hardware settings. Because DMPGL function calls update the internal state and commands configure the hardware settings, the DMPGL driver’s internal state may become inconsistent with the hardware settings when commands are issued directly or when command caches are used.

If there is a discrepancy between the internal state and hardware settings, you can preserve consistency by forcing the hardware settings to be validated (by issuing complete command packets). Validation actually occurs after nngxUpdateState(NN_GX_STATE_ALL) is run.

This issues commands only for hardware settings that have been changed by the command cache or by commands that were issued directly, rather than for all settings, allowing you to trim the process of issuing commands. However, applications must carefully keep track of which states correspond to and which states are dependent on the commands that have been used.

When nngxSetCommandGenerationMode(NN_GX_CMDGEN_MODE_UNCONDITIONAL) is called, commands are issued regardless of the comparison results for the internal state. Only the following settings are affected by this mode.

  • Uniform settings for the reserved fragment shader.
  • Integer uniform settings for the vertex shader.
  • LUT data settings.
  • DMPGL functions associated with NN_GX_STATE_OTHERS.

 

Related Functions:

nngxUpdateState, nngxValidateState, glDrawElements, glDrawArrays, nngxSetCommandGenerationMode

8.5.2. Removing the Use of glGetUniformLocation (Important)

One way to configure a shader uniform using DMPGL involves using the glGetUniformLocation function to get its location. Because it performs processor-intensive operations like string comparison, heavy use of glGetUniformLocation is not recommended.

To get Location values for the fixed fragment shader, use the constants defined for that purpose. The locations of each uniform of a program object are guaranteed not to change until the program object is either destroyed by the glDeleteProgram function or relinked by the glLinkProgram function.

Note:

Macros for the Location values are defined in $CTR_SDK/include/nn/gx/CTR/gx_UniformLocationForFragmentShader.h.

 

Related Functions:

glGetUniformLocation, glDeleteProgram, glLinkProgram

8.5.3. Cost of Switching Programs With glUseProgram

When glUseProgram is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.

It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram.

 

Related Functions:

glUseProgram

8.5.4. Cost of Calling glUseProgram(0)

Calling glUseProgram(0) causes all validation flags to be set the next time glUseProgram is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0) explicitly.

 

Related Functions:

glUseProgram

8.5.5. Notes About the transpose Parameter of glUniformMatrix

Matrices are treated as column-major in OpenGL ES. The transpose parameter of the glUniformMatrix functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE to be specified as the transpose parameter of the glUniformMatrix functions.

Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE is specified in the transpose parameter of the glUniformMatrix functions.

Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE in the transpose parameter of the glUniformMatrix functions, you can set matrices without transposing them.

 

Related Functions:

glUniformMatrix

8.5.6. Using Textures in the Native PICA Format

The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.

 

Related Functions:

glTexImage2D

8.5.7. Setting Uniforms for Vertex Shaders

Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.

  • Case 1: Overwrite all registers ([c0...c15]).
  • Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).

In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.

 

Related Functions:

glUniform

8.5.8. Updating Buffers

The glBufferData and glTexImage2D functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.

We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.

The same type of overhead is involved in using the glBufferSubData function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData.

 

Related Functions:

glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D

8.5.9. Validation

With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.

  • glDrawArray
  • glDrawElements
  • nngxValidateState

Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray and glDrawElements increases with the number of categories to update.

If you use glUseProgram to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform and glEnable are used to change the parameters of a particular category, the update flag of that category is set.

Table 8-3. Categories Updated by Validation

Category

Functions Used

Framebuffers

  • glBindFramebuffer
  • glBindRenderbuffer
  • glDeleteFramebuffers
  • glDeleteRenderbuffers
  • glFramebufferRenderbuffer
  • glFramebufferTexture2D
  • glRenderbufferStorage
  • glReadPixels
  • glClear

Vertex buffers

  • glBindBuffer
  • glBufferData
  • glBufferSubData
  • glDeleteBuffers

Triangles

  • glEnable
  • glDisable
  • glUseProgram
  • glDepthRangef
  • glPolygonOffset

Lighting LUTs

  • glUseProgram
  • glUniform*
    dmp_LightEnv.lutEnabledSP
    dmp_LightEnv.lutEnabledD0
    dmp_LightEnv.lutEnabledD1
    dmp_LightEnv.fresnelSelector
    dmp_LightEnv.lutEnabledRefl
    dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
    dmp_FragmentLighting.enabled
  • glUniformsDMP
  • glRestoreProgramsDMP

Fog LUTs

  • glUseProgram
  • glUniform*
    dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
    dmp_Fog.sampler
  • glUniformsDMP
  • glRestoreProgramsDMP

Procedural texture LUTs

  • glUseProgram
  • glUniform*
    dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
    dmp_Texture[3].ptNoiseEnable
    dmp_Texture[3].ptAlphaSeparate
    dmp_Texture[3].samplerType
  • glUniformsDMP
  • glRestoreProgramsDMP

Vertex arrays

  • glBindBuffer
  • glEnableVertexAttribArray
  • glDisableVertexAttribArray
  • glVertexAttribPointer

Current vertices

  • glBindBuffer

Framebuffer access

  • glEnable
  • glDisable
  • glDepthFunc
  • glEarlyDepthFuncDMP
  • glColorMask
  • glDepthMask
  • glStencilMask
  • glUseProgram
  • glUniform*
    dmp_FragOperation.mode
  • glUniformsDMP
  • glRestoreProgramsDMP

Scissor/viewport

  • glEnable
  • glDisable
  • glScissor
  • glViewport

Texture 0

  • glUseProgram
  • glUniform*
    dmp_Texture[0].samplerType
  • glUniformsDMP
  • glRestoreProgramsDMP
  • glBindTexture
  • glDeleteTextures
  • glCompressedTexImage2D
  • glCopyTexImage2D
  • glCopyTexSubImage2D
  • glTexImage2D
  • glTexParameteriv
  • glTexParameterfv

Texture 1

  • Same as Texture 0
  • glUniform*
    dmp_Texture[1].samplerType

Texture 2

  • Same as Texture 0
  • glUniform*
    dmp_Texture[2].samplerType

Texture 3

  • glUseProgram
  • glRestoreProgramsDMP
  • glBindTexture
  • glDeleteTextures
  • glCompressedTexImage2D
  • glCopyTexImage2D
  • glCopyTexSubImage2D
  • glTexImage2D
  • glTexParameteriv
  • glTexParameterfv

Texture LUTs

  • glBindTexture
  • glDeleteTextures
  • glTexImage1D
  • glTexSubImage1D

Program

  • glUseProgram

Shader uniforms

  • glUseProgram
  • glUniform*
    Vertex shaders and uniforms for geometry shaders
  • glUniformsDMP

Rasterization

  • glDrawArrays
  • glDrawElements
  • glUniform*
    Uniforms for reserved fragment shaders
  • glUniformsDMP
  • glRestoreProgramsDMP

Shader binaries

  • glUseProgram

Vertex shader binaries

  • glUseProgram

Geometry shader binaries

  • glUseProgram

Geometry shader attachments

  • glUseProgram

Geometry shader detachments

  • glUseProgram

Gas LUTs

  • glUseProgram
  • glUniform*
    dmp_Fog.mode (GL_GAS_DMP)
    dmp_Gas.sampler{TR, TG, TB}
  • glUniformsDMP

8.5.10. Functions That Issue Command Requests

The GPU begins executing commands when nngxRunCmdlist is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.

The following functions issue command requests to a command list.

Table 8-4. Functions That Issue Command Requests

Function

Condition

Command Request Added

nngxSplitDrawCmdlist

Always

Render command request

nngxTransferRenderImage

Always

Render command request

Post transfer command request

glClear

Always

Render command request

Memory fill command request

glBufferData

When NN_GX_MEM_VRAMA or NN_GX_MEM_VRAMB is specified as the first argument

DMA transfer command request

glBufferSubData

When glBufferData meets the condition above

DMA transfer command request

glTexImage2D

When NN_GX_MEM_VRAMA or NN_GX_MEM_VRAMB is specified as the first argument

DMA transfer command request

glCopyTexImage2D

Always

Render command request

Copy texture command request

glCopyTexSubImage2D

Always

Render command request

Copy texture command request

glRestoreTextureCollectionsDMP

When conditions are met by the functions that generate commands to be restored (glBufferData, glBufferSubData, glTexImage2D).

DMA transfer command request

glRestoreVertexStateCollectionsDMP

glDrawArrays

When the reserved uniform dmp_Gas.autoAcc is GL_TRUE and the function is called for the first time after the reserved uniform dmp_FragOperation.mode has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP.

Render command request

glDrawElements

Note 1: Not added when nngxSplitDrawCmdlist is called in advance.

Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.

8.6. Functions That Cause System Blocking

If you measure performance using the CPU profiler, a large proportion of the measurements may be made while the system is blocking (SYSTEM_WAITING). When the system is blocking, it indicates that profiling is being performed on multiple threads, and none of those threads are active. This is because a system call from a particular thread is blocking, but the profiler is not able to determine which thread is blocking.

You can use the CPU profiler’s “select thread” and “reverse call tree” features to trace the process calculated as system blocking.

If the profiler is tracing the active thread, it is calculated as the system blocking because it cannot identify the last active thread or function. You can eliminate this by using the select-thread feature to limit the threads that are profiled. If you then find a function that was being calculated as system blocking, you can use the reverse call tree to find out where in the application that function is called and then trace it.

Most CTR-SDK functions have the latent potential to cause the system to block. If possible, Nintendo recommends using the CPU profiler to check performance at crucial points in your implementation.

8.7. FS Library Performance Details

This section describes the reference data about the access rates to each archive, when the CTR-SDK FS library is used. Specifically, measurement results for file system performance for the ROM archives, save data (backup region), and extended save data (SD Card) are described when the FS library from CTR-SDK 3.x series or later is used. For ROM archives and save data, measurements were made presuming the use of card applications (Nintendo 3DS Game Cards) or downloadable applications (SD Card).

Although the development card speed can be set to both “Fast” and “Slow” for card-based software application ROM archives, almost no devices on the market have a “Slow” setting performance. However, ROM archive performance worsens through repeated access. For example, when reading a 512 KB file, performance like the “Slow” setting only happens momentarily (about one or two internal command cycles) at first, but as the situation worsens “Slow”-like performance gradually becomes more common.

Because performance of 3DS Game Cards varies depending on the memory chips inside, use the data provided here as a source of reference. Also note that the performance of CARD2 varies slightly depending on the size of the save region.

To measure the performance of SD Cards, we use media that have been formatted with a formatter that complies with the SD File System Specification. Depending on the media capacity and the media maker, the performance may differ even for the same class of media. Because performance of SD cards varies depending on the media and the actual unit, use the data provided here as a source of reference.

Also note that the data for CTR-SDK 3.x series and later versions is treated the same because the results were almost identical.

 

The following environment was used for the measurements.

  • Release build
  • Wait simulation function OFF
  • Development card speed setting “Fast”
  • Normal access priority (PRIORITY_APP_NORMAL)

 

Note:

If you use this feature to access the file system with real-time priority, design your performance based on the data in the documentation bundled with the CTR-SDK, rather than by relying on actual measurements.

For more information about access priority,see the CTR-SDK documentation.

Note:

The performance of ROM archives under emulation in the PARTNER-CTR Debugger has been tuned to nearly match the listed figures (performance of development cards).

The performance of retail cards is almost the same as the performance of development cards.

8.7.1. ROM Archive

The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.

8.7.1.1. Batch Read

The figures below show the time required by the TryRead function to batch-read data of the specified size 50 times. In short, TryInitializeTryReadFinalize was performed 50 times.

To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.

Figure 8-4. Transfer Rates (Batch Read)

TransferRate (MB/s) ReadSize 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB

The tables below show the time taken by operations other than those of the TryRead function.

Table 8-5. Time Taken for Processes Other Than the TryRead Function

Process

Average

Best Score

Worst Score

TryInitialize

0.045 ms

0.021 ms

0.164 ms

Finalize

0.006 ms

0.002 ms

0.024 ms

8.7.1.2. Stream Reads

The figures below show the time required by the TryRead function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead function.

To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.

Figure 8-5. Transfer Rates (Stream Read)

Transfer Rate (MB/s) Buffer Size 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB

8.7.2. Save Data

The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.

8.7.2.1. Read

The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitializeTryReadFinalize was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.

The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.

Figure 8-6. Save Data Read Rates (Average Values)

Rate (KB/s) data Size (KB)

Figure 8-7. Comparison of CARD2 and Save Data Read Rates (Average Values)

Rate (KB/s) data Size (KB)

To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.

The tables below show the time taken by operations other than those of the TryRead function.

Table 8-6. Time Required for Operations Other Than the TryRead Function (Save Data)

Process

Measurement Conditions

Average

Best Score

Worst Score

TryInitialize

No automatic redundancy

1.276 ms

0.708 ms

7.194 ms

Automatic redundancy enabled

1.104 ms

0.715 ms

7.031 ms

Finalize

No automatic redundancy

0.480 ms

0.345 ms

1.910 ms

Automatic redundancy enabled

0.395 ms

0.277 ms

2.034 ms

8.7.2.2. Writing

This section shows the results measured on a test unit for the time required to execute the TryWrite function and the TryFlush function’s operations when writing save data of the specified size 50 times. In short, TryInitializeTryWriteTryFlushFinalize was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData function.

The backup memory set for CARD2 was 256 MB.

No Automatic Redundancy

The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).

CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.

Figure 8-8. Save Data Write Rates (No Automatic Redundancy)

Rate (KB/s) data Size (KB)

Figure 8-9. Comparison of CARD2 and Save Data Read Rates (No Automatic Redundancy)

Rate (KB/s) data Size (KB)

There is almost no difference in the time required by the TryWrite function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush function.

The tables below show the time taken by the TryInitialize function and the Finalize function under the different measurement conditions.

Table 8-7. Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)

Process

Measurement Conditions

Average

Best Score

Worst Score

TryInitialize

No file

620.133 ms

478.054 ms

926.344 ms

Overwrite file of the same size

1.269 ms

0.709 ms

3.889 ms

Finalize

No file

0.489 ms

0.337 ms

2.084 ms

Overwrite file of the same size

0.483 ms

0.337 ms

2.203 ms

The TryInitialize operation takes significantly more time creating a new file compared to when overwriting an existing file.

Automatic Redundancy Enabled

The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData function in addition to those of the TryWrite and TryFlush functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).

CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.

Figure 8-10. Save Data Write Rates (Automatic Redundancy Enabled)

Rate (KB/s) data Size (KB)

Figure 8-11. Comparison of CARD2 and Save Data Read Rates (Automatic Redundancy Enabled)

Rate (KB/s) data Size (KB)

The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData operations are around 100 ms faster when data of the same size are being overwritten.

The tables below show the time taken by the TryInitialize function and the Finalize function under the different measurement conditions.

Table 8-8. Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)

Process

Measurement Conditions

Average

Best Score

Worst Score

TryInitialize

No file

1.267 ms

0.813 ms

3.365 ms

Overwrite file of the same size

1.997 ms

0.667 ms

11.511 ms

Finalize

No file

0.368 ms

0.259 ms

1.807 ms

Overwrite file of the same size

0.384 ms

0.260 ms

1.713 ms

When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.

The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”

Figure 8-12. Save Data Write Rates (Comparison)

Rate (KB/s) data Size (KB)

Figure 8-13. CARD2 and Save Data Write Rates (Comparison)

Rate (KB/s) data Size (KB)

If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.

8.7.2.3. Mounting

The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData, MountSaveData, and Unmount. The backup memory size for the development card was 512 KB.

Table 8-9. Time Required for Operations Related to Mounting Save Data

Process

Measurement Conditions

Average

Best Score

Worst Score

FormatSaveData

No automatic redundancy (factory state)

2857.726 ms

2843.064 ms

2872.938 ms

No automatic redundancy

8535.636 ms

8339.167 ms

8661.502 ms

Automatic redundancy enabled (factory state)

2476.430 ms

2462.935 ms

2495.924 ms

Automatic redundancy enabled

8186.308 ms

8015.401 ms

8278.390 ms

MountSaveData

No automatic redundancy

26.955 ms

25.682 ms

29.150 ms

Automatic redundancy enabled

30.495 ms

28.150 ms

33.798 ms

Unmount

No automatic redundancy

1.476 ms

1.081 ms

3.318 ms

Automatic redundancy enabled

1.124 ms

0.873 ms

2.324 ms

It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.

8.7.2.4. Deleting Files

The tables below show the time required, as measured on a test unit, for the TryDeleteFile function. The backup memory size of the development card was 512 KB.

Table 8-10. Time Required for the TryDeleteFile Function

Process

Measurement Conditions

Average

Best Score

Worst Score

TryDeleteFile

No automatic redundancy

658.566 ms

528.313 ms

946.492 ms

Automatic redundancy enabled

1.465 ms

0.705 ms

4.004 ms

8.7.3. Extended Save Data

This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.

Note:

The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.

8.7.3.1. Mounting

The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData function for mounting extended save data, the DeleteExtSaveData function for deleting it, the CreateExtSaveData function for creating it, the MountExtSaveData function for remounting it, and the Unmount function for unmounting it.

Table 8-11. Time Required for Extended Save Data Operations

Process

Average

Best Score

Worst Score

MountExtSaveData

40.844 ms

39.278 ms

42.136 ms

DeleteExtSaveData

105.296 ms

77.204 ms

165.070 ms

CreateExtSaveData

2346.382 ms

2055.237 ms

2468.994 ms

MountExtSaveData

38.256 ms

36.832 ms

40.576 ms

Unmount

1.842 ms

1.619 ms

2.319 ms

8.7.3.2. Creating Files

The tables below show the time required, as measured on a test unit, for the TryCreateFile function to execute.

Table 8-12. Time Required to Create Files in Extended Save Data

Process

Average

Best Score

Worst Score

TryCreateFile

766.200 ms

363.559 ms

2125.721 ms

8.7.3.3. Loading Files

The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitializeTryReadFinalize was performed 50 times. The time required for the TryRead function’s processing was measured on a test unit.

Figure 8-14. Extended Save Data File Read Rates (Average Values)

Rate (KB/s) data Size (KB)

To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.

The tables below show the time taken by operations other than those of the TryRead function.

Table 8-13. Time Taken for Processes Other Than the TryRead Function

Process

Average

Best Score

Worst Score

TryInitialize

14.098 ms

11.246 ms

21.259 ms

Finalize

1.433 ms

1.050 ms

3.150 ms

8.7.3.4. Writing Files

The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite and TryFlush functions when writing files to extended save data 50 times. In short, TryInitializeTryWriteTryFlushFinalize was performed 50 times.

Figure 8-15. Extended Save Data File Write Rates (Average Values)

Rate (KB/s) data Size (KB)

To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.

The tables below show the time taken by the TryInitialize function and the Finalize function under the different measurement conditions.

Table 8-14. Time Taken by the TryInitialize and Finalize Functions

Process

Average

Best Score

Worst Score

TryInitialize

15.406 ms

12.528 ms

23.478 ms

Finalize

1.451 ms

1.040 ms

3.259 ms

8.7.3.5. Deleting Files

The tables below show the time required, as measured on a test unit, for the TryDeleteFile function to delete files in extended save data.

Table 8-15. Time Required to Delete Files in Extended Save Data

Process

Average

Best Score

Worst Score

TryDeleteFile

196.322 ms

108.301 ms

470.286 ms

8.7.4. ROM Archives (Downloadable Applications)

The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.

8.7.4.1. Batch Read

The figures below show the time required by the TryRead function to batch-read data of the specified size 50 times. In short, TryInitializeTryReadFinalize was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).

Figure 8-16. Downloadable Applications Transfer Rates (Batch Read)

Transfer Rate (MB/s) Read Size

To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.

The tables below show the time taken by operations other than those of the TryRead function.

Table 8-16. Downloadable Applications Time Required for Operations Other Than the TryRead Function

Process

Average

Best Score

Worst Score

TryInitialize

0.046 ms

0.022 ms

0.220 ms

Finalize

0.006 ms

0.002 ms

0.023 ms

8.7.4.2. Stream Reads

The figures below show the time required by the TryRead function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).

To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.

Figure 8-17. Downloadable Applications Transfer Rates (Stream Read)

Transfer Rate (MB/s) Buffer Size

8.7.5. Save Data (Downloadable Applications)

The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.

8.7.5.1. Read

The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitializeTryReadFinalize was performed 50 times. The time required for the TryRead function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.

Figure 8-18. Downloadable Applications Save Data Read Rates (Average Values)

Rate (KB/s) data Size (KB)

To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.

The tables below show the time taken by operations other than those of the TryRead function.

Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)

Process

Measurement Conditions

Average

Best Score

Worst Score

TryInitialize

No automatic redundancy

1.259 ms

0.715 ms

7.613 ms

Automatic redundancy enabled

1.242 ms

0.714 ms

5.633 ms

Finalize

No automatic redundancy

0.492 ms

0.349 ms

2.099 ms

Automatic redundancy enabled

0.398 ms

0.278 ms

2.015 ms

8.7.5.2. Writing

This section shows the results, measured on a test unit, for the time required to execute the TryWrite function and the TryFlush function’s operations when writing save data 50 times. In short, TryInitializeTryWriteTryFlushFinalize was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData function.

No Automatic Redundancy

The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).

Figure 8-19. Downloadable Applications Save Data Write Rates (No Automatic Redundancy)

Rate (KB/s) data Size (KB)

There is almost no difference in the time required by the TryWrite function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush function.

The tables below show the time taken by the TryInitialize function and the Finalize function under the different measurement conditions.

Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)

Process

Measurement Conditions

Average

Best Score

Worst Score

TryInitialize

No file

198.837 ms

55.597 ms

304.459 ms

Overwrite file of the same size

1.232 ms

0.720 ms

4.068 ms

Finalize

No file

0.467 ms

0.333 ms

2.152 ms

Overwrite file of the same size

0.466 ms

0.337 ms

2.134 ms

The TryInitialize function takes more time when a file is created than when an existing file is overwritten.

Automatic Redundancy Enabled

The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData function in addition to those of the TryWrite and TryFlush functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).

Figure 8-20. Downloadable Applications Save Data Write Rates (Automatic Redundancy Enabled)

Rate (KB/s) data Size (KB)

The tables below show the time taken by the TryInitialize function and the Finalize function under the different measurement conditions.

Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)

Process

Measurement Conditions

Average

Best Score

Worst Score

TryInitialize

No file

1.256 ms

0.810 ms

3.412 ms

Overwrite file of the same size

1.535 ms

0.662 ms

8.886 ms

Finalize

No file

0.396 ms

0.260 ms

2.002 ms

Overwrite file of the same size

0.385 ms

0.260 ms

2.033 ms

When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.

The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”

Figure 8-21. Downloadable Applications Save Data Write Rates (Comparison)

Rate (KB/s) data Size (KB)

If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.

8.7.5.3. Mounting

The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData, MountSaveData, and Unmount. The backup memory size for the downloadable application was 512 KB.

Table 8-20. Downloadable Applications Time Required for Operations Related to Mounting Save Data

Process

Measurement Conditions

Average

Best Score

Worst Score

FormatSaveData

No automatic redundancy (right after importing)

430.314 ms

360.571 ms

490.037 ms

No automatic redundancy

436.077 ms

428.289 ms

446.790 ms

Automatic redundancy enabled (right after importing)

206.321 ms

181.559 ms

253.866 ms

Automatic redundancy enabled

197.499 ms

180.707 ms

215.180 ms

MountSaveData

No automatic redundancy

24.079 ms

22.886 ms

26.050 ms

Automatic redundancy enabled

24.674 ms

22.770 ms

25.738 ms

Unmount

No automatic redundancy

1.694 ms

1.298 ms

3.496 ms

Automatic redundancy enabled

1.338 ms

0.895 ms

2.622 ms

For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.

8.7.5.4. Deleting Files

The tables below show the time required, as measured on a test unit, for the TryDeleteFile function. The backup memory size for the downloadable application was 512 KB.

Table 8-21. Downloadable Applications Time Required for the TryDeleteFile Function

Process

Measurement Conditions

Average

Best Score

Worst Score

TryDeleteFile

No automatic redundancy

171.093 ms

55.857 ms

299.145 ms

Automatic redundancy enabled

1.546 ms

0.712 ms

4.581 ms

8.7.6. Comparison of CTR Flash Cards and CARD2

CARD2 performance changes are shown below in comparison to the CTR flash card.

8.7.6.1. ROM Archive

The performance between CTR flash cards and CARD2 is approximately equal.

8.7.6.2. Save Data

No Automatic Redundancy

CARD2 file read performance increased by a maximum factor of approximately 5x.

File write performance increased by a maximum factor of approximately 30x.

Automatic Redundancy Enabled

CARD2 file read performance increased by a maximum factor of approximately 5x.

File write performance increased by a maximum factor of approximately 33x.


CONFIDENTIAL