8.1. Profiling CPU and GPU Operations
If you run the GPU during CPU processing, the CPU receives interrupts from the GPU and competes with the GPU to access main memory. These delays make it impossible to accurately measure execution time.
To accurately measure the execution time of the CPU and GPU, you must wait for an operation on the CPU to finish completely before running an operation on the GPU.
The following procedure provides a specific example of how to do this.
-
Call
nngxClearCmdlist
to clear the command list.
-
Call
nngxStopCmdlist
to stop execution of the GPU.
-
Call
nngxBindCmdlist
to bind the command list.
- Start measuring the CPU time.
- Issue commands from the CPU.
- Stop measuring the CPU time.
-
Call
nngxSplitCmdlist
to issue a command request (render command request).
- Start measuring the GPU time.
-
Call
nngxRunCmdlist
to issue command lists from the GPU.
-
Call
nngxWaitCmdlistDone
to wait for the GPU to finish its operations.
- Stop measuring the GPU time.
Related Functions:
nngxClearCmdlist, nngxStopCmdlist, nngxBindCmdlist, nngxRunCmdlist, nngxWaitCmdlistDone
8.2. Reducing the Processing Load to Display Model Data (Important)
The processing load to display a model is split between the CPU, vertex shader (GPU), and fragment shader (GPU). To reduce the processing load, you must determine which process is the bottleneck when you create model data. Also, if you place texture data and vertex data in main memory, note that access conflicts may cause the CPU performance to decline.
The following figure shows a procedure for reducing the processing load to display model data.
8.3. Hardware Configuration
This section describes the configuration of the main hardware related to graphics.
When an application runs as a standard application on SNAKE, the hardware configuration is identical to CTR.
Hardware Configuration for Extended Applications
When an application runs as an extended application on SNAKE, the CPU performance and main memory change as illustrated below.
8.3.1. CPU
The CPU generates graphics commands and conveys instructions to the GPU. One of the CPU cores is used exclusively by applications as the application core, while the remaining core is used as the system core. The CPU is connected to the GPU, main memory, and other modules via the main bus.
When an application runs as an extended application on SNAKE, the CPU clock rate increases to 804 MHz and the 2 MB L2 cache shared by all CPU cores becomes available.
8.3.2. Main Memory
Many kinds of data can be placed in main memory. A maximum of 64 MB of main memory can be used from the application. This main memory is connected to the main bus.
When an application runs as an extended application on SNAKE, up to 124 MB of memory is available for use by the application.
8.3.3. DMA
This is used by command request DMA transfer commands. Its uses include the transfer of texture and vertex data from main memory to VRAM.
8.3.4. Main Bus
This bus connects the CPU, the GPU, main memory and other devices. Many kinds of data are exchanged primarily via this bus. To adjust the priority of modules that use the main bus, use the nngxSetMemAccessPrioMode
function.
8.3.5. GPU
The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.
Module
Description
P3D
This module references the 3D commands accumulated in the command buffer and performs the actual rendering.
PPF
This module performs block-to-linear conversion based on command request post-transfer commands and transfers data.
PSC
This module performs memory fill commands (clearing) for command requests.
PDC
This module transfers the content of the display buffer to the LCD.
VRAM A/VRAM B
Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately.
GPU internal bus
The bus that connects the modules in the GPU. It is connected to the main bus.
Note:
The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.
When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.
8.3.6. LCD
This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.
8.4. GPU Profiling Feature
You can use the GPU profiling feature to measure processing in each hardware module in the GPU. The following table lists what information can be obtained using this feature.
Information
Description
Busy clock
You can get the number of busy clock cycles that occur as a result of the vertex shader and fragment lighting in the various GPU modules.
Shader execution clock
You can get the number of execution clock cycles and stall clock cycles for each vertex shader processor.
Number of vertices entered in vertex cache
You can get the number of vertices entered into post vertex cache. By taking the difference from the number of vertices actually used for rendering, you can determine the effective number of vertices in post vertex cache.
Number of input/output polygons
You can get the number of polygons input for triangle setup and the number of polygons output after clipping.
Number of input fragments
You can get the number of fragments input to the per-fragment operation module.
Number of accesses to memory
You can get the number of times memory is accessed by each GPU hardware module.
The busy-clock obtained for each hardware module includes not only the busy clock cycles related to the actual processing of data, but also the time that the module is waiting and cannot output data because later modules are busy. It is effective to optimize not only modules that have particularly high busy-clock values, but also the final-stage module among modules with high busy values.
Note:
For more information on the profiling feature, see the CTR Programming Manual: Advanced Graphics.
8.5. Notes for Using the DMPGL
The following sections provide notes about performance when using the DMPGL.
8.5.1. Maintaining Internal State Consistency
The internal state refers to local data that is saved by the DMPGL driver; it could also be called a mirror of the hardware settings. Because DMPGL function calls update the internal state and commands configure the hardware settings, the DMPGL driver’s internal state may become inconsistent with the hardware settings when commands are issued directly or when command caches are used.
If there is a discrepancy between the internal state and hardware settings, you can preserve consistency by forcing the hardware settings to be validated (by issuing complete command packets). Validation actually occurs after nngxUpdateState(NN_GX_STATE_ALL)
is run.
This issues commands only for hardware settings that have been changed by the command cache or by commands that were issued directly, rather than for all settings, allowing you to trim the process of issuing commands. However, applications must carefully keep track of which states correspond to and which states are dependent on the commands that have been used.
When nngxSetCommandGenerationMode(NN_GX_CMDGEN_MODE_UNCONDITIONAL)
is called, commands are issued regardless of the comparison results for the internal state. Only the following settings are affected by this mode.
- Uniform settings for the reserved fragment shader.
- Integer uniform settings for the vertex shader.
- LUT data settings.
-
DMPGL functions associated with
NN_GX_STATE_OTHERS
.
Related Functions:
nngxUpdateState, nngxValidateState, glDrawElements, glDrawArrays, nngxSetCommandGenerationMode
8.5.2. Removing the Use of glGetUniformLocation (Important)
One way to configure a shader uniform using DMPGL involves using the glGetUniformLocation
function to get its location. Because it performs processor-intensive operations like string comparison, heavy use of glGetUniformLocation
is not recommended.
To get Location values for the fixed fragment shader, use the constants defined for that purpose. The locations of each uniform of a program object are guaranteed not to change until the program object is either destroyed by the glDeleteProgram
function or relinked by the glLinkProgram
function.
Note:
Macros for the Location values are defined in $CTR_SDK/include/nn/gx/CTR/gx_UniformLocationForFragmentShader.h
.
Related Functions:
glGetUniformLocation, glDeleteProgram, glLinkProgram
8.5.3. Cost of Switching Programs With glUseProgram
When glUseProgram
is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.
It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram
.
Related Functions:
glUseProgram
8.5.4. Cost of Calling glUseProgram(0)
Calling glUseProgram(0)
causes all validation flags to be set the next time glUseProgram
is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0)
explicitly.
Related Functions:
glUseProgram
8.5.5. Notes About the transpose Parameter of glUniformMatrix
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
8.5.6. Using Textures in the Native PICA Format
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
8.6. Functions That Cause System Blocking
If you measure performance using the CPU profiler, a large proportion of the measurements may be made while the system is blocking (SYSTEM_WAITING
). When the system is blocking, it indicates that profiling is being performed on multiple threads, and none of those threads are active. This is because a system call from a particular thread is blocking, but the profiler is not able to determine which thread is blocking.
You can use the CPU profiler’s “select thread” and “reverse call tree” features to trace the process calculated as system blocking.
If the profiler is tracing the active thread, it is calculated as the system blocking because it cannot identify the last active thread or function. You can eliminate this by using the select-thread feature to limit the threads that are profiled. If you then find a function that was being calculated as system blocking, you can use the reverse call tree to find out where in the application that function is called and then trace it.
Most CTR-SDK functions have the latent potential to cause the system to block. If possible, Nintendo recommends using the CPU profiler to check performance at crucial points in your implementation.
8.7. FS Library Performance Details
This section describes the reference data about the access rates to each archive, when the CTR-SDK FS library is used. Specifically, measurement results for file system performance for the ROM archives, save data (backup region), and extended save data (SD Card) are described when the FS library from CTR-SDK 3.x series or later is used. For ROM archives and save data, measurements were made presuming the use of card applications (Nintendo 3DS Game Cards) or downloadable applications (SD Card).
Although the development card speed can be set to both “Fast” and “Slow” for card-based software application ROM archives, almost no devices on the market have a “Slow” setting performance. However, ROM archive performance worsens through repeated access. For example, when reading a 512 KB file, performance like the “Slow” setting only happens momentarily (about one or two internal command cycles) at first, but as the situation worsens “Slow”-like performance gradually becomes more common.
Because performance of 3DS Game Cards varies depending on the memory chips inside, use the data provided here as a source of reference. Also note that the performance of CARD2 varies slightly depending on the size of the save region.
To measure the performance of SD Cards, we use media that have been formatted with a formatter that complies with the SD File System Specification. Depending on the media capacity and the media maker, the performance may differ even for the same class of media. Because performance of SD cards varies depending on the media and the actual unit, use the data provided here as a source of reference.
Also note that the data for CTR-SDK 3.x series and later versions is treated the same because the results were almost identical.
The following environment was used for the measurements.
- Release build
- Wait simulation function OFF
- Development card speed setting “Fast”
- Normal access priority (
PRIORITY_APP_NORMAL
)
Note:
If you use this feature to access the file system with real-time priority, design your performance based on the data in the documentation bundled with the CTR-SDK, rather than by relying on actual measurements.
For more information about access priority,see the CTR-SDK documentation.
Note:
The performance of ROM archives under emulation in the PARTNER-CTR Debugger has been tuned to nearly match the listed figures (performance of development cards).
The performance of retail cards is almost the same as the performance of development cards.
8.7.1. ROM Archive
The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.1.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.045 ms
0.021 ms
0.164 ms
Finalize
0.006 ms
0.002 ms
0.024 ms
8.7.1.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
8.7.2. Save Data
The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.2.1. Read
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.276 ms
0.708 ms
7.194 ms
Automatic redundancy enabled
1.104 ms
0.715 ms
7.031 ms
Finalize
No automatic redundancy
0.480 ms
0.345 ms
1.910 ms
Automatic redundancy enabled
0.395 ms
0.277 ms
2.034 ms
8.7.2.2. Writing
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
620.133 ms
478.054 ms
926.344 ms
Overwrite file of the same size
1.269 ms
0.709 ms
3.889 ms
Finalize
No file
0.489 ms
0.337 ms
2.084 ms
Overwrite file of the same size
0.483 ms
0.337 ms
2.203 ms
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.267 ms
0.813 ms
3.365 ms
Overwrite file of the same size
1.997 ms
0.667 ms
11.511 ms
Finalize
No file
0.368 ms
0.259 ms
1.807 ms
Overwrite file of the same size
0.384 ms
0.260 ms
1.713 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
8.7.3. Extended Save Data
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
Note:
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
8.7.4. ROM Archives (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
If you run the GPU during CPU processing, the CPU receives interrupts from the GPU and competes with the GPU to access main memory. These delays make it impossible to accurately measure execution time.
To accurately measure the execution time of the CPU and GPU, you must wait for an operation on the CPU to finish completely before running an operation on the GPU.
The following procedure provides a specific example of how to do this.
-
Call
nngxClearCmdlist
to clear the command list. -
Call
nngxStopCmdlist
to stop execution of the GPU. -
Call
nngxBindCmdlist
to bind the command list. - Start measuring the CPU time.
- Issue commands from the CPU.
- Stop measuring the CPU time.
-
Call
nngxSplitCmdlist
to issue a command request (render command request). - Start measuring the GPU time.
-
Call
nngxRunCmdlist
to issue command lists from the GPU. -
Call
nngxWaitCmdlistDone
to wait for the GPU to finish its operations. - Stop measuring the GPU time.
Related Functions:
nngxClearCmdlist, nngxStopCmdlist, nngxBindCmdlist, nngxRunCmdlist, nngxWaitCmdlistDone
The processing load to display a model is split between the CPU, vertex shader (GPU), and fragment shader (GPU). To reduce the processing load, you must determine which process is the bottleneck when you create model data. Also, if you place texture data and vertex data in main memory, note that access conflicts may cause the CPU performance to decline.
The following figure shows a procedure for reducing the processing load to display model data.
8.3. Hardware Configuration
This section describes the configuration of the main hardware related to graphics.
When an application runs as a standard application on SNAKE, the hardware configuration is identical to CTR.
Hardware Configuration for Extended Applications
When an application runs as an extended application on SNAKE, the CPU performance and main memory change as illustrated below.
8.3.1. CPU
The CPU generates graphics commands and conveys instructions to the GPU. One of the CPU cores is used exclusively by applications as the application core, while the remaining core is used as the system core. The CPU is connected to the GPU, main memory, and other modules via the main bus.
When an application runs as an extended application on SNAKE, the CPU clock rate increases to 804 MHz and the 2 MB L2 cache shared by all CPU cores becomes available.
8.3.2. Main Memory
Many kinds of data can be placed in main memory. A maximum of 64 MB of main memory can be used from the application. This main memory is connected to the main bus.
When an application runs as an extended application on SNAKE, up to 124 MB of memory is available for use by the application.
8.3.3. DMA
This is used by command request DMA transfer commands. Its uses include the transfer of texture and vertex data from main memory to VRAM.
8.3.4. Main Bus
This bus connects the CPU, the GPU, main memory and other devices. Many kinds of data are exchanged primarily via this bus. To adjust the priority of modules that use the main bus, use the nngxSetMemAccessPrioMode
function.
8.3.5. GPU
The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.
Module
Description
P3D
This module references the 3D commands accumulated in the command buffer and performs the actual rendering.
PPF
This module performs block-to-linear conversion based on command request post-transfer commands and transfers data.
PSC
This module performs memory fill commands (clearing) for command requests.
PDC
This module transfers the content of the display buffer to the LCD.
VRAM A/VRAM B
Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately.
GPU internal bus
The bus that connects the modules in the GPU. It is connected to the main bus.
Note:
The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.
When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.
8.3.6. LCD
This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.
8.4. GPU Profiling Feature
You can use the GPU profiling feature to measure processing in each hardware module in the GPU. The following table lists what information can be obtained using this feature.
Information
Description
Busy clock
You can get the number of busy clock cycles that occur as a result of the vertex shader and fragment lighting in the various GPU modules.
Shader execution clock
You can get the number of execution clock cycles and stall clock cycles for each vertex shader processor.
Number of vertices entered in vertex cache
You can get the number of vertices entered into post vertex cache. By taking the difference from the number of vertices actually used for rendering, you can determine the effective number of vertices in post vertex cache.
Number of input/output polygons
You can get the number of polygons input for triangle setup and the number of polygons output after clipping.
Number of input fragments
You can get the number of fragments input to the per-fragment operation module.
Number of accesses to memory
You can get the number of times memory is accessed by each GPU hardware module.
The busy-clock obtained for each hardware module includes not only the busy clock cycles related to the actual processing of data, but also the time that the module is waiting and cannot output data because later modules are busy. It is effective to optimize not only modules that have particularly high busy-clock values, but also the final-stage module among modules with high busy values.
Note:
For more information on the profiling feature, see the CTR Programming Manual: Advanced Graphics.
8.5. Notes for Using the DMPGL
The following sections provide notes about performance when using the DMPGL.
8.5.1. Maintaining Internal State Consistency
The internal state refers to local data that is saved by the DMPGL driver; it could also be called a mirror of the hardware settings. Because DMPGL function calls update the internal state and commands configure the hardware settings, the DMPGL driver’s internal state may become inconsistent with the hardware settings when commands are issued directly or when command caches are used.
If there is a discrepancy between the internal state and hardware settings, you can preserve consistency by forcing the hardware settings to be validated (by issuing complete command packets). Validation actually occurs after nngxUpdateState(NN_GX_STATE_ALL)
is run.
This issues commands only for hardware settings that have been changed by the command cache or by commands that were issued directly, rather than for all settings, allowing you to trim the process of issuing commands. However, applications must carefully keep track of which states correspond to and which states are dependent on the commands that have been used.
When nngxSetCommandGenerationMode(NN_GX_CMDGEN_MODE_UNCONDITIONAL)
is called, commands are issued regardless of the comparison results for the internal state. Only the following settings are affected by this mode.
- Uniform settings for the reserved fragment shader.
- Integer uniform settings for the vertex shader.
- LUT data settings.
-
DMPGL functions associated with
NN_GX_STATE_OTHERS
.
Related Functions:
nngxUpdateState, nngxValidateState, glDrawElements, glDrawArrays, nngxSetCommandGenerationMode
8.5.2. Removing the Use of glGetUniformLocation (Important)
One way to configure a shader uniform using DMPGL involves using the glGetUniformLocation
function to get its location. Because it performs processor-intensive operations like string comparison, heavy use of glGetUniformLocation
is not recommended.
To get Location values for the fixed fragment shader, use the constants defined for that purpose. The locations of each uniform of a program object are guaranteed not to change until the program object is either destroyed by the glDeleteProgram
function or relinked by the glLinkProgram
function.
Note:
Macros for the Location values are defined in $CTR_SDK/include/nn/gx/CTR/gx_UniformLocationForFragmentShader.h
.
Related Functions:
glGetUniformLocation, glDeleteProgram, glLinkProgram
8.5.3. Cost of Switching Programs With glUseProgram
When glUseProgram
is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.
It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram
.
Related Functions:
glUseProgram
8.5.4. Cost of Calling glUseProgram(0)
Calling glUseProgram(0)
causes all validation flags to be set the next time glUseProgram
is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0)
explicitly.
Related Functions:
glUseProgram
8.5.5. Notes About the transpose Parameter of glUniformMatrix
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
8.5.6. Using Textures in the Native PICA Format
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
8.6. Functions That Cause System Blocking
If you measure performance using the CPU profiler, a large proportion of the measurements may be made while the system is blocking (SYSTEM_WAITING
). When the system is blocking, it indicates that profiling is being performed on multiple threads, and none of those threads are active. This is because a system call from a particular thread is blocking, but the profiler is not able to determine which thread is blocking.
You can use the CPU profiler’s “select thread” and “reverse call tree” features to trace the process calculated as system blocking.
If the profiler is tracing the active thread, it is calculated as the system blocking because it cannot identify the last active thread or function. You can eliminate this by using the select-thread feature to limit the threads that are profiled. If you then find a function that was being calculated as system blocking, you can use the reverse call tree to find out where in the application that function is called and then trace it.
Most CTR-SDK functions have the latent potential to cause the system to block. If possible, Nintendo recommends using the CPU profiler to check performance at crucial points in your implementation.
8.7. FS Library Performance Details
This section describes the reference data about the access rates to each archive, when the CTR-SDK FS library is used. Specifically, measurement results for file system performance for the ROM archives, save data (backup region), and extended save data (SD Card) are described when the FS library from CTR-SDK 3.x series or later is used. For ROM archives and save data, measurements were made presuming the use of card applications (Nintendo 3DS Game Cards) or downloadable applications (SD Card).
Although the development card speed can be set to both “Fast” and “Slow” for card-based software application ROM archives, almost no devices on the market have a “Slow” setting performance. However, ROM archive performance worsens through repeated access. For example, when reading a 512 KB file, performance like the “Slow” setting only happens momentarily (about one or two internal command cycles) at first, but as the situation worsens “Slow”-like performance gradually becomes more common.
Because performance of 3DS Game Cards varies depending on the memory chips inside, use the data provided here as a source of reference. Also note that the performance of CARD2 varies slightly depending on the size of the save region.
To measure the performance of SD Cards, we use media that have been formatted with a formatter that complies with the SD File System Specification. Depending on the media capacity and the media maker, the performance may differ even for the same class of media. Because performance of SD cards varies depending on the media and the actual unit, use the data provided here as a source of reference.
Also note that the data for CTR-SDK 3.x series and later versions is treated the same because the results were almost identical.
The following environment was used for the measurements.
- Release build
- Wait simulation function OFF
- Development card speed setting “Fast”
- Normal access priority (
PRIORITY_APP_NORMAL
)
Note:
If you use this feature to access the file system with real-time priority, design your performance based on the data in the documentation bundled with the CTR-SDK, rather than by relying on actual measurements.
For more information about access priority,see the CTR-SDK documentation.
Note:
The performance of ROM archives under emulation in the PARTNER-CTR Debugger has been tuned to nearly match the listed figures (performance of development cards).
The performance of retail cards is almost the same as the performance of development cards.
8.7.1. ROM Archive
The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.1.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.045 ms
0.021 ms
0.164 ms
Finalize
0.006 ms
0.002 ms
0.024 ms
8.7.1.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
8.7.2. Save Data
The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.2.1. Read
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.276 ms
0.708 ms
7.194 ms
Automatic redundancy enabled
1.104 ms
0.715 ms
7.031 ms
Finalize
No automatic redundancy
0.480 ms
0.345 ms
1.910 ms
Automatic redundancy enabled
0.395 ms
0.277 ms
2.034 ms
8.7.2.2. Writing
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
620.133 ms
478.054 ms
926.344 ms
Overwrite file of the same size
1.269 ms
0.709 ms
3.889 ms
Finalize
No file
0.489 ms
0.337 ms
2.084 ms
Overwrite file of the same size
0.483 ms
0.337 ms
2.203 ms
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.267 ms
0.813 ms
3.365 ms
Overwrite file of the same size
1.997 ms
0.667 ms
11.511 ms
Finalize
No file
0.368 ms
0.259 ms
1.807 ms
Overwrite file of the same size
0.384 ms
0.260 ms
1.713 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
8.7.3. Extended Save Data
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
Note:
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
8.7.4. ROM Archives (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
This section describes the configuration of the main hardware related to graphics.
When an application runs as a standard application on SNAKE, the hardware configuration is identical to CTR.
Hardware Configuration for Extended Applications
When an application runs as an extended application on SNAKE, the CPU performance and main memory change as illustrated below.
8.3.1. CPU
The CPU generates graphics commands and conveys instructions to the GPU. One of the CPU cores is used exclusively by applications as the application core, while the remaining core is used as the system core. The CPU is connected to the GPU, main memory, and other modules via the main bus.
When an application runs as an extended application on SNAKE, the CPU clock rate increases to 804 MHz and the 2 MB L2 cache shared by all CPU cores becomes available.
8.3.2. Main Memory
Many kinds of data can be placed in main memory. A maximum of 64 MB of main memory can be used from the application. This main memory is connected to the main bus.
When an application runs as an extended application on SNAKE, up to 124 MB of memory is available for use by the application.
8.3.3. DMA
This is used by command request DMA transfer commands. Its uses include the transfer of texture and vertex data from main memory to VRAM.
8.3.4. Main Bus
This bus connects the CPU, the GPU, main memory and other devices. Many kinds of data are exchanged primarily via this bus. To adjust the priority of modules that use the main bus, use the nngxSetMemAccessPrioMode
function.
8.3.5. GPU
The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.
Module
Description
P3D
This module references the 3D commands accumulated in the command buffer and performs the actual rendering.
PPF
This module performs block-to-linear conversion based on command request post-transfer commands and transfers data.
PSC
This module performs memory fill commands (clearing) for command requests.
PDC
This module transfers the content of the display buffer to the LCD.
VRAM A/VRAM B
Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately.
GPU internal bus
The bus that connects the modules in the GPU. It is connected to the main bus.
Note:
The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.
When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.
8.3.6. LCD
This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.
The CPU generates graphics commands and conveys instructions to the GPU. One of the CPU cores is used exclusively by applications as the application core, while the remaining core is used as the system core. The CPU is connected to the GPU, main memory, and other modules via the main bus.
When an application runs as an extended application on SNAKE, the CPU clock rate increases to 804 MHz and the 2 MB L2 cache shared by all CPU cores becomes available.
Many kinds of data can be placed in main memory. A maximum of 64 MB of main memory can be used from the application. This main memory is connected to the main bus.
When an application runs as an extended application on SNAKE, up to 124 MB of memory is available for use by the application.
8.3.3. DMA
This is used by command request DMA transfer commands. Its uses include the transfer of texture and vertex data from main memory to VRAM.
8.3.4. Main Bus
This bus connects the CPU, the GPU, main memory and other devices. Many kinds of data are exchanged primarily via this bus. To adjust the priority of modules that use the main bus, use the nngxSetMemAccessPrioMode
function.
8.3.5. GPU
The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.
Module
Description
P3D
This module references the 3D commands accumulated in the command buffer and performs the actual rendering.
PPF
This module performs block-to-linear conversion based on command request post-transfer commands and transfers data.
PSC
This module performs memory fill commands (clearing) for command requests.
PDC
This module transfers the content of the display buffer to the LCD.
VRAM A/VRAM B
Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately.
GPU internal bus
The bus that connects the modules in the GPU. It is connected to the main bus.
Note:
The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.
When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.
8.3.6. LCD
This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.
This is used by command request DMA transfer commands. Its uses include the transfer of texture and vertex data from main memory to VRAM.
This bus connects the CPU, the GPU, main memory and other devices. Many kinds of data are exchanged primarily via this bus. To adjust the priority of modules that use the main bus, use the nngxSetMemAccessPrioMode
function.
8.3.5. GPU
The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.
Module
Description
P3D
This module references the 3D commands accumulated in the command buffer and performs the actual rendering.
PPF
This module performs block-to-linear conversion based on command request post-transfer commands and transfers data.
PSC
This module performs memory fill commands (clearing) for command requests.
PDC
This module transfers the content of the display buffer to the LCD.
VRAM A/VRAM B
Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately.
GPU internal bus
The bus that connects the modules in the GPU. It is connected to the main bus.
Note:
The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.
When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.
8.3.6. LCD
This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.
The GPU comprises the P3D, PPF, PDC, VRAM A, and VRAM B modules, and the internal bus that connects those modules.
Module |
Description |
---|---|
P3D |
This module references the 3D commands accumulated in the command buffer and performs the actual rendering. |
PPF |
This module performs block-to-linear conversion based on command request post-transfer commands and transfers data. |
PSC |
This module performs memory fill commands (clearing) for command requests. |
PDC |
This module transfers the content of the display buffer to the LCD. |
VRAM A/VRAM B |
Memory where vertices and textures are placed. VRAMA and VRAMB can operate separately. |
GPU internal bus |
The bus that connects the modules in the GPU. It is connected to the main bus. |
The GPU internal bus is around twice the bandwidth of the main bus. For data that is frequently accessed in the GPU, you can boost the access speed by placing the data in VRAM, which also lifts the load off the main bus.
When data is exchanged between the GPU internal bus and the main bus, it is the main bus that is the rate-limiting factor.
This refers to the two 3DS LCD screens. They are connected to the PDC module within the GPU. When images are being displayed, the PDC gets one line of data from the display buffer for each scan line on the LCD. When the display buffer is located in main memory, the fetching of data places a periodic load on the main bus.
You can use the GPU profiling feature to measure processing in each hardware module in the GPU. The following table lists what information can be obtained using this feature.
Information |
Description |
---|---|
Busy clock |
You can get the number of busy clock cycles that occur as a result of the vertex shader and fragment lighting in the various GPU modules. |
Shader execution clock |
You can get the number of execution clock cycles and stall clock cycles for each vertex shader processor. |
Number of vertices entered in vertex cache |
You can get the number of vertices entered into post vertex cache. By taking the difference from the number of vertices actually used for rendering, you can determine the effective number of vertices in post vertex cache. |
Number of input/output polygons |
You can get the number of polygons input for triangle setup and the number of polygons output after clipping. |
Number of input fragments |
You can get the number of fragments input to the per-fragment operation module. |
Number of accesses to memory |
You can get the number of times memory is accessed by each GPU hardware module. |
The busy-clock obtained for each hardware module includes not only the busy clock cycles related to the actual processing of data, but also the time that the module is waiting and cannot output data because later modules are busy. It is effective to optimize not only modules that have particularly high busy-clock values, but also the final-stage module among modules with high busy values.
For more information on the profiling feature, see the CTR Programming Manual: Advanced Graphics.
8.5. Notes for Using the DMPGL
The following sections provide notes about performance when using the DMPGL.
8.5.1. Maintaining Internal State Consistency
The internal state refers to local data that is saved by the DMPGL driver; it could also be called a mirror of the hardware settings. Because DMPGL function calls update the internal state and commands configure the hardware settings, the DMPGL driver’s internal state may become inconsistent with the hardware settings when commands are issued directly or when command caches are used.
If there is a discrepancy between the internal state and hardware settings, you can preserve consistency by forcing the hardware settings to be validated (by issuing complete command packets). Validation actually occurs after nngxUpdateState(NN_GX_STATE_ALL)
is run.
This issues commands only for hardware settings that have been changed by the command cache or by commands that were issued directly, rather than for all settings, allowing you to trim the process of issuing commands. However, applications must carefully keep track of which states correspond to and which states are dependent on the commands that have been used.
When nngxSetCommandGenerationMode(NN_GX_CMDGEN_MODE_UNCONDITIONAL)
is called, commands are issued regardless of the comparison results for the internal state. Only the following settings are affected by this mode.
- Uniform settings for the reserved fragment shader.
- Integer uniform settings for the vertex shader.
- LUT data settings.
-
DMPGL functions associated with
NN_GX_STATE_OTHERS
.
Related Functions:
nngxUpdateState, nngxValidateState, glDrawElements, glDrawArrays, nngxSetCommandGenerationMode
8.5.2. Removing the Use of glGetUniformLocation (Important)
One way to configure a shader uniform using DMPGL involves using the glGetUniformLocation
function to get its location. Because it performs processor-intensive operations like string comparison, heavy use of glGetUniformLocation
is not recommended.
To get Location values for the fixed fragment shader, use the constants defined for that purpose. The locations of each uniform of a program object are guaranteed not to change until the program object is either destroyed by the glDeleteProgram
function or relinked by the glLinkProgram
function.
Note:
Macros for the Location values are defined in $CTR_SDK/include/nn/gx/CTR/gx_UniformLocationForFragmentShader.h
.
Related Functions:
glGetUniformLocation, glDeleteProgram, glLinkProgram
8.5.3. Cost of Switching Programs With glUseProgram
When glUseProgram
is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.
It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram
.
Related Functions:
glUseProgram
8.5.4. Cost of Calling glUseProgram(0)
Calling glUseProgram(0)
causes all validation flags to be set the next time glUseProgram
is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0)
explicitly.
Related Functions:
glUseProgram
8.5.5. Notes About the transpose Parameter of glUniformMatrix
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
8.5.6. Using Textures in the Native PICA Format
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
8.6. Functions That Cause System Blocking
If you measure performance using the CPU profiler, a large proportion of the measurements may be made while the system is blocking (SYSTEM_WAITING
). When the system is blocking, it indicates that profiling is being performed on multiple threads, and none of those threads are active. This is because a system call from a particular thread is blocking, but the profiler is not able to determine which thread is blocking.
You can use the CPU profiler’s “select thread” and “reverse call tree” features to trace the process calculated as system blocking.
If the profiler is tracing the active thread, it is calculated as the system blocking because it cannot identify the last active thread or function. You can eliminate this by using the select-thread feature to limit the threads that are profiled. If you then find a function that was being calculated as system blocking, you can use the reverse call tree to find out where in the application that function is called and then trace it.
Most CTR-SDK functions have the latent potential to cause the system to block. If possible, Nintendo recommends using the CPU profiler to check performance at crucial points in your implementation.
8.7. FS Library Performance Details
This section describes the reference data about the access rates to each archive, when the CTR-SDK FS library is used. Specifically, measurement results for file system performance for the ROM archives, save data (backup region), and extended save data (SD Card) are described when the FS library from CTR-SDK 3.x series or later is used. For ROM archives and save data, measurements were made presuming the use of card applications (Nintendo 3DS Game Cards) or downloadable applications (SD Card).
Although the development card speed can be set to both “Fast” and “Slow” for card-based software application ROM archives, almost no devices on the market have a “Slow” setting performance. However, ROM archive performance worsens through repeated access. For example, when reading a 512 KB file, performance like the “Slow” setting only happens momentarily (about one or two internal command cycles) at first, but as the situation worsens “Slow”-like performance gradually becomes more common.
Because performance of 3DS Game Cards varies depending on the memory chips inside, use the data provided here as a source of reference. Also note that the performance of CARD2 varies slightly depending on the size of the save region.
To measure the performance of SD Cards, we use media that have been formatted with a formatter that complies with the SD File System Specification. Depending on the media capacity and the media maker, the performance may differ even for the same class of media. Because performance of SD cards varies depending on the media and the actual unit, use the data provided here as a source of reference.
Also note that the data for CTR-SDK 3.x series and later versions is treated the same because the results were almost identical.
The following environment was used for the measurements.
- Release build
- Wait simulation function OFF
- Development card speed setting “Fast”
- Normal access priority (
PRIORITY_APP_NORMAL
)
Note:
If you use this feature to access the file system with real-time priority, design your performance based on the data in the documentation bundled with the CTR-SDK, rather than by relying on actual measurements.
For more information about access priority,see the CTR-SDK documentation.
Note:
The performance of ROM archives under emulation in the PARTNER-CTR Debugger has been tuned to nearly match the listed figures (performance of development cards).
The performance of retail cards is almost the same as the performance of development cards.
8.7.1. ROM Archive
The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.1.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.045 ms
0.021 ms
0.164 ms
Finalize
0.006 ms
0.002 ms
0.024 ms
8.7.1.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
8.7.2. Save Data
The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.2.1. Read
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.276 ms
0.708 ms
7.194 ms
Automatic redundancy enabled
1.104 ms
0.715 ms
7.031 ms
Finalize
No automatic redundancy
0.480 ms
0.345 ms
1.910 ms
Automatic redundancy enabled
0.395 ms
0.277 ms
2.034 ms
8.7.2.2. Writing
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
620.133 ms
478.054 ms
926.344 ms
Overwrite file of the same size
1.269 ms
0.709 ms
3.889 ms
Finalize
No file
0.489 ms
0.337 ms
2.084 ms
Overwrite file of the same size
0.483 ms
0.337 ms
2.203 ms
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.267 ms
0.813 ms
3.365 ms
Overwrite file of the same size
1.997 ms
0.667 ms
11.511 ms
Finalize
No file
0.368 ms
0.259 ms
1.807 ms
Overwrite file of the same size
0.384 ms
0.260 ms
1.713 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
8.7.3. Extended Save Data
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
Note:
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
8.7.4. ROM Archives (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
The following sections provide notes about performance when using the DMPGL.
8.5.1. Maintaining Internal State Consistency
The internal state refers to local data that is saved by the DMPGL driver; it could also be called a mirror of the hardware settings. Because DMPGL function calls update the internal state and commands configure the hardware settings, the DMPGL driver’s internal state may become inconsistent with the hardware settings when commands are issued directly or when command caches are used.
If there is a discrepancy between the internal state and hardware settings, you can preserve consistency by forcing the hardware settings to be validated (by issuing complete command packets). Validation actually occurs after nngxUpdateState(NN_GX_STATE_ALL)
is run.
This issues commands only for hardware settings that have been changed by the command cache or by commands that were issued directly, rather than for all settings, allowing you to trim the process of issuing commands. However, applications must carefully keep track of which states correspond to and which states are dependent on the commands that have been used.
When nngxSetCommandGenerationMode(NN_GX_CMDGEN_MODE_UNCONDITIONAL)
is called, commands are issued regardless of the comparison results for the internal state. Only the following settings are affected by this mode.
- Uniform settings for the reserved fragment shader.
- Integer uniform settings for the vertex shader.
- LUT data settings.
-
DMPGL functions associated with
NN_GX_STATE_OTHERS
.
Related Functions:
nngxUpdateState, nngxValidateState, glDrawElements, glDrawArrays, nngxSetCommandGenerationMode
8.5.2. Removing the Use of glGetUniformLocation (Important)
One way to configure a shader uniform using DMPGL involves using the glGetUniformLocation
function to get its location. Because it performs processor-intensive operations like string comparison, heavy use of glGetUniformLocation
is not recommended.
To get Location values for the fixed fragment shader, use the constants defined for that purpose. The locations of each uniform of a program object are guaranteed not to change until the program object is either destroyed by the glDeleteProgram
function or relinked by the glLinkProgram
function.
Note:
Macros for the Location values are defined in $CTR_SDK/include/nn/gx/CTR/gx_UniformLocationForFragmentShader.h
.
Related Functions:
glGetUniformLocation, glDeleteProgram, glLinkProgram
8.5.3. Cost of Switching Programs With glUseProgram
When glUseProgram
is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.
It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram
.
Related Functions:
glUseProgram
8.5.4. Cost of Calling glUseProgram(0)
Calling glUseProgram(0)
causes all validation flags to be set the next time glUseProgram
is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0)
explicitly.
Related Functions:
glUseProgram
8.5.5. Notes About the transpose Parameter of glUniformMatrix
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
8.5.6. Using Textures in the Native PICA Format
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
The internal state refers to local data that is saved by the DMPGL driver; it could also be called a mirror of the hardware settings. Because DMPGL function calls update the internal state and commands configure the hardware settings, the DMPGL driver’s internal state may become inconsistent with the hardware settings when commands are issued directly or when command caches are used.
If there is a discrepancy between the internal state and hardware settings, you can preserve consistency by forcing the hardware settings to be validated (by issuing complete command packets). Validation actually occurs after nngxUpdateState(NN_GX_STATE_ALL)
is run.
This issues commands only for hardware settings that have been changed by the command cache or by commands that were issued directly, rather than for all settings, allowing you to trim the process of issuing commands. However, applications must carefully keep track of which states correspond to and which states are dependent on the commands that have been used.
When nngxSetCommandGenerationMode(NN_GX_CMDGEN_MODE_UNCONDITIONAL)
is called, commands are issued regardless of the comparison results for the internal state. Only the following settings are affected by this mode.
- Uniform settings for the reserved fragment shader.
- Integer uniform settings for the vertex shader.
- LUT data settings.
-
DMPGL functions associated with
NN_GX_STATE_OTHERS
.
Related Functions:
nngxUpdateState, nngxValidateState, glDrawElements, glDrawArrays, nngxSetCommandGenerationMode
One way to configure a shader uniform using DMPGL involves using the glGetUniformLocation
function to get its location. Because it performs processor-intensive operations like string comparison, heavy use of glGetUniformLocation
is not recommended.
To get Location values for the fixed fragment shader, use the constants defined for that purpose. The locations of each uniform of a program object are guaranteed not to change until the program object is either destroyed by the glDeleteProgram
function or relinked by the glLinkProgram
function.
Macros for the Location values are defined in $CTR_SDK/include/nn/gx/CTR/gx_UniformLocationForFragmentShader.h
.
Related Functions:
glGetUniformLocation, glDeleteProgram, glLinkProgram
8.5.3. Cost of Switching Programs With glUseProgram
When glUseProgram
is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.
It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram
.
Related Functions:
glUseProgram
8.5.4. Cost of Calling glUseProgram(0)
Calling glUseProgram(0)
causes all validation flags to be set the next time glUseProgram
is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0)
explicitly.
Related Functions:
glUseProgram
8.5.5. Notes About the transpose Parameter of glUniformMatrix
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
8.5.6. Using Textures in the Native PICA Format
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
When glUseProgram
is called, a shader binary must be loaded during validation if the new program object and the old program object link to different shader binaries.
It is not recommended to render by frequently switching program objects that link to separate shader binaries. As an effective alternative, consider using a conditional branch instruction to allow your vertex shader binaries to be shared by multiple program objects, or adjust the render order to minimize the number of calls to glUseProgram
.
Related Functions:
glUseProgram
Calling glUseProgram(0)
causes all validation flags to be set the next time glUseProgram
is called, making shared shader binaries ineffective. DMPGL is designed to monitor differential updates, so there is no need to call glUseProgram(0)
explicitly.
Related Functions:
glUseProgram
8.5.5. Notes About the transpose Parameter of glUniformMatrix
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
8.5.6. Using Textures in the Native PICA Format
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
Matrices are treated as column-major in OpenGL ES. The transpose
parameter of the glUniformMatrix
functions, which indicates whether a matrix is transposed before it is loaded, can only be set to GL_FALSE
. In contrast, DMPGL treats matrices as row-major within the graphics driver, and allows GL_TRUE
to be specified as the transpose
parameter of the glUniformMatrix
functions.
Due to internal differences between DMPGL and OpenGL ES, matrices are implicitly transposed if GL_FALSE
is specified in the transpose
parameter of the glUniformMatrix
functions.
Matrices generated using the MATH library of the CTR-SDK, on the other hand, are row-major just like DMPGL drivers. By specifying GL_TRUE
in the transpose
parameter of the glUniformMatrix
functions, you can set matrices without transposing them.
Related Functions:
glUniformMatrix
The current CTR-SDK specifications support textures in standard OpenGL format. Using the standard format requires that the textures be converted to the native format when glTexImage2D
is called. (This conversion is done automatically.) This conversion can be omitted by maintaining textures in native format instead.
Related Functions:
glTexImage2D
8.5.7. Setting Uniforms for Vertex Shaders
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
8.5.8. Updating Buffers
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
Overwriting the uniforms in a vertex shader is more efficient if the registers being overwritten are contiguous.
- Case 1: Overwrite all registers ([c0...c15]).
- Case 2: Overwrite non-contiguous blocks of registers ([c0...c3], [c8...c11], [c16...c19], [c24...c27]).
In both cases, the number of registers being overwritten is the same, but Case 1 is more efficient because it allows transfer commands to be consolidated.
Related Functions:
glUniform
The glBufferData
and glTexImage2D
functions, which rewrite vertex and texture data, access data in the main memory. Accessing this data requires the CPU to apply the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load increases proportionally to the number of calls to such functions.
We recommend that you call these functions in advance when the application is being initialized, load vertex buffer and texture data early, and avoid loading data during per-frame operations whenever possible.
The same type of overhead is involved in using the glBufferSubData
function to partially update vertex data. To reduce this overhead, you might consider gathering all required data for a partial update into a single chunk and processing it all with a single call to glBufferSubData
.
Related Functions:
glBufferData, glTexImage2D, glCompressedTexImage2D, glTexImage1D, glBufferSubData, glTexSubImage1D
8.5.9. Validation
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category
Functions Used
Framebuffers
-
glBindFramebuffer
-
glBindRenderbuffer
-
glDeleteFramebuffers
-
glDeleteRenderbuffers
-
glFramebufferRenderbuffer
-
glFramebufferTexture2D
-
glRenderbufferStorage
-
glReadPixels
-
glClear
Vertex buffers
-
glBindBuffer
-
glBufferData
-
glBufferSubData
-
glDeleteBuffers
Triangles
-
glEnable
-
glDisable
-
glUseProgram
-
glDepthRangef
-
glPolygonOffset
Lighting LUTs
-
glUseProgram
-
glUniform*
dmp_LightEnv.lutEnabledSP
dmp_LightEnv.lutEnabledD0
dmp_LightEnv.lutEnabledD1
dmp_LightEnv.fresnelSelector
dmp_LightEnv.lutEnabledRefl
dmp_FragmentMaterial.sampler{D0,D1,SP,FR,RB,RG,RR}
dmp_FragmentLighting.enabled
-
glUniformsDMP
-
glRestoreProgramsDMP
Fog LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_FOG or GL_GAS_DMP)
dmp_Fog.sampler
-
glUniformsDMP
-
glRestoreProgramsDMP
Procedural texture LUTs
-
glUseProgram
-
glUniform*
dmp_Texture[3].ptSampler{Rgb,Alpha,Noise,R,G,B,A}
dmp_Texture[3].ptNoiseEnable
dmp_Texture[3].ptAlphaSeparate
dmp_Texture[3].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
Vertex arrays
-
glBindBuffer
-
glEnableVertexAttribArray
-
glDisableVertexAttribArray
-
glVertexAttribPointer
Current vertices
-
glBindBuffer
Framebuffer access
-
glEnable
-
glDisable
-
glDepthFunc
-
glEarlyDepthFuncDMP
-
glColorMask
-
glDepthMask
-
glStencilMask
-
glUseProgram
-
glUniform*
dmp_FragOperation.mode
-
glUniformsDMP
-
glRestoreProgramsDMP
Scissor/viewport
-
glEnable
-
glDisable
-
glScissor
-
glViewport
Texture 0
-
glUseProgram
-
glUniform*
dmp_Texture[0].samplerType
-
glUniformsDMP
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture 1
- Same as Texture 0
-
glUniform*
dmp_Texture[1].samplerType
Texture 2
- Same as Texture 0
-
glUniform*
dmp_Texture[2].samplerType
Texture 3
-
glUseProgram
-
glRestoreProgramsDMP
-
glBindTexture
-
glDeleteTextures
-
glCompressedTexImage2D
-
glCopyTexImage2D
-
glCopyTexSubImage2D
-
glTexImage2D
-
glTexParameteriv
-
glTexParameterfv
Texture LUTs
-
glBindTexture
-
glDeleteTextures
-
glTexImage1D
-
glTexSubImage1D
Program
-
glUseProgram
Shader uniforms
-
glUseProgram
-
glUniform*
Vertex shaders and uniforms for geometry shaders
-
glUniformsDMP
Rasterization
-
glDrawArrays
-
glDrawElements
-
glUniform*
Uniforms for reserved fragment shaders
-
glUniformsDMP
-
glRestoreProgramsDMP
Shader binaries
-
glUseProgram
Vertex shader binaries
-
glUseProgram
Geometry shader binaries
-
glUseProgram
Geometry shader attachments
-
glUseProgram
Geometry shader detachments
-
glUseProgram
Gas LUTs
-
glUseProgram
-
glUniform*
dmp_Fog.mode (GL_GAS_DMP)
dmp_Gas.sampler{TR, TG, TB}
-
glUniformsDMP
8.5.10. Functions That Issue Command Requests
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function
Condition
Command Request Added
nngxSplitDrawCmdlist
Always
Render command request
nngxTransferRenderImage
Always
Render command request
Post transfer command request
glClear
Always
Render command request
Memory fill command request
glBufferData
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glBufferSubData
When glBufferData
meets the condition above
DMA transfer command request
glTexImage2D
When NN_GX_MEM_VRAMA
or NN_GX_MEM_VRAMB
is specified as the first argument
DMA transfer command request
glCopyTexImage2D
Always
Render command request
Copy texture command request
glCopyTexSubImage2D
Always
Render command request
Copy texture command request
glRestoreTextureCollectionsDMP
When conditions are met by the functions that generate commands to be restored (glBufferData
, glBufferSubData
, glTexImage2D
).
DMA transfer command request
glRestoreVertexStateCollectionsDMP
glDrawArrays
When the reserved uniform dmp_Gas.autoAcc
is GL_TRUE
and the function is called for the first time after the reserved uniform dmp_FragOperation.mode
has its value changed from GL_FRAGOP_MODE_GAS_ACC_DMP
.
Render command request
glDrawElements
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
With DMPGL, any configuration changes made by calls to the GL API are applied to the hardware. An operation called validation is performed on the command buffer that actually writes the commands. Validation occurs when the following functions are called.
-
glDrawArray
-
glDrawElements
-
nngxValidateState
Each setting is divided into several categories, and a state update flag is set for each category that is updated. During validation, these flags are used to determine which categories to update. Updates are applied one category at a time. The processing load for glDrawArray
and glDrawElements
increases with the number of categories to update.
If you use glUseProgram
to switch programs, state update flags are only set for categories that differ from the previous program. Details about the functions for each category are shown below. If functions like glUniform
and glEnable
are used to change the parameters of a particular category, the update flag of that category is set.
Category |
Functions Used |
---|---|
Framebuffers |
|
Vertex buffers |
|
Triangles |
|
Lighting LUTs |
|
Fog LUTs |
|
Procedural texture LUTs |
|
Vertex arrays |
|
Current vertices |
|
Framebuffer access |
|
Scissor/viewport |
|
Texture 0 |
|
Texture 1 |
|
Texture 2 |
|
Texture 3 |
|
Texture LUTs |
|
Program |
|
Shader uniforms |
|
Rasterization |
|
Shader binaries |
|
Vertex shader binaries |
|
Geometry shader binaries |
|
Geometry shader attachments |
|
Geometry shader detachments |
|
Gas LUTs |
|
The GPU begins executing commands when nngxRunCmdlist
is called. Rendering instructions are processed one command request—accumulated in the command list—at a time.
The following functions issue command requests to a command list.
Function |
Condition |
Command Request Added |
---|---|---|
|
Always |
Render command request |
|
Always |
Render command request Post transfer command request |
|
Always |
Render command request Memory fill command request |
|
When |
DMA transfer command request |
|
When |
DMA transfer command request |
|
When |
DMA transfer command request |
|
Always |
Render command request Copy texture command request |
|
Always |
Render command request Copy texture command request |
|
When conditions are met by the functions that generate commands to be restored ( |
DMA transfer command request |
|
||
|
When the reserved uniform |
Render command request |
|
Note 1: Not added when nngxSplitDrawCmdlist
is called in advance.
Note 2: One added per color and depth (stencil) buffer. Two are added when both are specified.
If you measure performance using the CPU profiler, a large proportion of the measurements may be made while the system is blocking (SYSTEM_WAITING
). When the system is blocking, it indicates that profiling is being performed on multiple threads, and none of those threads are active. This is because a system call from a particular thread is blocking, but the profiler is not able to determine which thread is blocking.
You can use the CPU profiler’s “select thread” and “reverse call tree” features to trace the process calculated as system blocking.
If the profiler is tracing the active thread, it is calculated as the system blocking because it cannot identify the last active thread or function. You can eliminate this by using the select-thread feature to limit the threads that are profiled. If you then find a function that was being calculated as system blocking, you can use the reverse call tree to find out where in the application that function is called and then trace it.
Most CTR-SDK functions have the latent potential to cause the system to block. If possible, Nintendo recommends using the CPU profiler to check performance at crucial points in your implementation.
8.7. FS Library Performance Details
This section describes the reference data about the access rates to each archive, when the CTR-SDK FS library is used. Specifically, measurement results for file system performance for the ROM archives, save data (backup region), and extended save data (SD Card) are described when the FS library from CTR-SDK 3.x series or later is used. For ROM archives and save data, measurements were made presuming the use of card applications (Nintendo 3DS Game Cards) or downloadable applications (SD Card).
Although the development card speed can be set to both “Fast” and “Slow” for card-based software application ROM archives, almost no devices on the market have a “Slow” setting performance. However, ROM archive performance worsens through repeated access. For example, when reading a 512 KB file, performance like the “Slow” setting only happens momentarily (about one or two internal command cycles) at first, but as the situation worsens “Slow”-like performance gradually becomes more common.
Because performance of 3DS Game Cards varies depending on the memory chips inside, use the data provided here as a source of reference. Also note that the performance of CARD2 varies slightly depending on the size of the save region.
To measure the performance of SD Cards, we use media that have been formatted with a formatter that complies with the SD File System Specification. Depending on the media capacity and the media maker, the performance may differ even for the same class of media. Because performance of SD cards varies depending on the media and the actual unit, use the data provided here as a source of reference.
Also note that the data for CTR-SDK 3.x series and later versions is treated the same because the results were almost identical.
The following environment was used for the measurements.
- Release build
- Wait simulation function OFF
- Development card speed setting “Fast”
- Normal access priority (
PRIORITY_APP_NORMAL
)
Note:
If you use this feature to access the file system with real-time priority, design your performance based on the data in the documentation bundled with the CTR-SDK, rather than by relying on actual measurements.
For more information about access priority,see the CTR-SDK documentation.
Note:
The performance of ROM archives under emulation in the PARTNER-CTR Debugger has been tuned to nearly match the listed figures (performance of development cards).
The performance of retail cards is almost the same as the performance of development cards.
8.7.1. ROM Archive
The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.1.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.045 ms
0.021 ms
0.164 ms
Finalize
0.006 ms
0.002 ms
0.024 ms
8.7.1.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
8.7.2. Save Data
The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.2.1. Read
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.276 ms
0.708 ms
7.194 ms
Automatic redundancy enabled
1.104 ms
0.715 ms
7.031 ms
Finalize
No automatic redundancy
0.480 ms
0.345 ms
1.910 ms
Automatic redundancy enabled
0.395 ms
0.277 ms
2.034 ms
8.7.2.2. Writing
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
620.133 ms
478.054 ms
926.344 ms
Overwrite file of the same size
1.269 ms
0.709 ms
3.889 ms
Finalize
No file
0.489 ms
0.337 ms
2.084 ms
Overwrite file of the same size
0.483 ms
0.337 ms
2.203 ms
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.267 ms
0.813 ms
3.365 ms
Overwrite file of the same size
1.997 ms
0.667 ms
11.511 ms
Finalize
No file
0.368 ms
0.259 ms
1.807 ms
Overwrite file of the same size
0.384 ms
0.260 ms
1.713 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
8.7.3. Extended Save Data
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
Note:
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
8.7.4. ROM Archives (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
This section describes the reference data about the access rates to each archive, when the CTR-SDK FS library is used. Specifically, measurement results for file system performance for the ROM archives, save data (backup region), and extended save data (SD Card) are described when the FS library from CTR-SDK 3.x series or later is used. For ROM archives and save data, measurements were made presuming the use of card applications (Nintendo 3DS Game Cards) or downloadable applications (SD Card).
Although the development card speed can be set to both “Fast” and “Slow” for card-based software application ROM archives, almost no devices on the market have a “Slow” setting performance. However, ROM archive performance worsens through repeated access. For example, when reading a 512 KB file, performance like the “Slow” setting only happens momentarily (about one or two internal command cycles) at first, but as the situation worsens “Slow”-like performance gradually becomes more common.
Because performance of 3DS Game Cards varies depending on the memory chips inside, use the data provided here as a source of reference. Also note that the performance of CARD2 varies slightly depending on the size of the save region.
To measure the performance of SD Cards, we use media that have been formatted with a formatter that complies with the SD File System Specification. Depending on the media capacity and the media maker, the performance may differ even for the same class of media. Because performance of SD cards varies depending on the media and the actual unit, use the data provided here as a source of reference.
Also note that the data for CTR-SDK 3.x series and later versions is treated the same because the results were almost identical.
The following environment was used for the measurements.
- Release build
- Wait simulation function OFF
- Development card speed setting “Fast”
- Normal access priority (
PRIORITY_APP_NORMAL
)
If you use this feature to access the file system with real-time priority, design your performance based on the data in the documentation bundled with the CTR-SDK, rather than by relying on actual measurements.
For more information about access priority,see the CTR-SDK documentation.
The performance of ROM archives under emulation in the PARTNER-CTR Debugger has been tuned to nearly match the listed figures (performance of development cards).
The performance of retail cards is almost the same as the performance of development cards.
8.7.1. ROM Archive
The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.1.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.045 ms
0.021 ms
0.164 ms
Finalize
0.006 ms
0.002 ms
0.024 ms
8.7.1.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
8.7.2. Save Data
The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.2.1. Read
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.276 ms
0.708 ms
7.194 ms
Automatic redundancy enabled
1.104 ms
0.715 ms
7.031 ms
Finalize
No automatic redundancy
0.480 ms
0.345 ms
1.910 ms
Automatic redundancy enabled
0.395 ms
0.277 ms
2.034 ms
8.7.2.2. Writing
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
620.133 ms
478.054 ms
926.344 ms
Overwrite file of the same size
1.269 ms
0.709 ms
3.889 ms
Finalize
No file
0.489 ms
0.337 ms
2.084 ms
Overwrite file of the same size
0.483 ms
0.337 ms
2.203 ms
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.267 ms
0.813 ms
3.365 ms
Overwrite file of the same size
1.997 ms
0.667 ms
11.511 ms
Finalize
No file
0.368 ms
0.259 ms
1.807 ms
Overwrite file of the same size
0.384 ms
0.260 ms
1.713 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
8.7.3. Extended Save Data
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
Note:
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
8.7.4. ROM Archives (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
The benchmark data presented in this section is reference data that resulted from measuring access to ROM archives on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.1.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.045 ms
0.021 ms
0.164 ms
Finalize
0.006 ms
0.002 ms
0.024 ms
8.7.1.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible. The blue line is the CTR flash card and the red line is CARD2.
The tables below show the time taken by operations other than those of the TryRead
function.
Process |
Average |
Best Score |
Worst Score |
---|---|---|---|
|
0.045 ms |
0.021 ms |
0.164 ms |
|
0.006 ms |
0.002 ms |
0.024 ms |
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function.
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache. The blue line is the CTR flash card and the red line is CARD2.
The benchmark data presented in this section is reference data that resulted from measuring access to the backup region on development cards (CTR flash cards) and CARD2. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.2.1. Read
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.276 ms
0.708 ms
7.194 ms
Automatic redundancy enabled
1.104 ms
0.715 ms
7.031 ms
Finalize
No automatic redundancy
0.480 ms
0.345 ms
1.910 ms
Automatic redundancy enabled
0.395 ms
0.277 ms
2.034 ms
8.7.2.2. Writing
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
620.133 ms
478.054 ms
926.344 ms
Overwrite file of the same size
1.269 ms
0.709 ms
3.889 ms
Finalize
No file
0.489 ms
0.337 ms
2.084 ms
Overwrite file of the same size
0.483 ms
0.337 ms
2.203 ms
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.267 ms
0.813 ms
3.365 ms
Overwrite file of the same size
1.997 ms
0.667 ms
11.511 ms
Finalize
No file
0.368 ms
0.259 ms
1.807 ms
Overwrite file of the same size
0.384 ms
0.260 ms
1.713 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
The figures below show the results for batch-reading save data, of the specified size written on a development card or CARD2 that has 512 KB of backup memory, 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The red line shows results obtained with no automatic redundancy, the blue line shows results when automatic redundancy was enabled.
The backup memory set for CARD2 was 256 MB. CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4-KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an effect. Note that even when using automatic redundancy, there was little effect when caching files having a size of 16 KB or more because the effect of caching changes depends on factors such as whether a file is read immediately after writing or whether the same file is being read again.
The tables below show the time taken by operations other than those of the TryRead
function.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No automatic redundancy |
1.276 ms |
0.708 ms |
7.194 ms |
Automatic redundancy enabled |
1.104 ms |
0.715 ms |
7.031 ms |
|
|
No automatic redundancy |
0.480 ms |
0.345 ms |
1.910 ms |
Automatic redundancy enabled |
0.395 ms |
0.277 ms |
2.034 ms |
This section shows the results measured on a test unit for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data of the specified size 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
The backup memory set for CARD2 was 256 MB.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 350 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No file |
620.133 ms |
478.054 ms |
926.344 ms |
Overwrite file of the same size |
1.269 ms |
0.709 ms |
3.889 ms |
|
|
No file |
0.489 ms |
0.337 ms |
2.084 ms |
Overwrite file of the same size |
0.483 ms |
0.337 ms |
2.203 ms |
The TryInitialize
operation takes significantly more time creating a new file compared to when overwriting an existing file.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
CARD2 data is shown in a separate figure, and the data for the CTR flash card are shown as dashed lines for comparison.
The benchmark data shows that whether a file is created or overwritten results in no major difference in processing time, other than the fact that the CommitSaveData
operations are around 100 ms faster when data of the same size are being overwritten.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No file |
1.267 ms |
0.813 ms |
3.365 ms |
Overwrite file of the same size |
1.997 ms |
0.667 ms |
11.511 ms |
|
|
No file |
0.368 ms |
0.259 ms |
1.807 ms |
Overwrite file of the same size |
0.384 ms |
0.260 ms |
1.713 ms |
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.2.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (factory state)
2857.726 ms
2843.064 ms
2872.938 ms
No automatic redundancy
8535.636 ms
8339.167 ms
8661.502 ms
Automatic redundancy enabled (factory state)
2476.430 ms
2462.935 ms
2495.924 ms
Automatic redundancy enabled
8186.308 ms
8015.401 ms
8278.390 ms
MountSaveData
No automatic redundancy
26.955 ms
25.682 ms
29.150 ms
Automatic redundancy enabled
30.495 ms
28.150 ms
33.798 ms
Unmount
No automatic redundancy
1.476 ms
1.081 ms
3.318 ms
Automatic redundancy enabled
1.124 ms
0.873 ms
2.324 ms
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
8.7.2.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
658.566 ms
528.313 ms
946.492 ms
Automatic redundancy enabled
1.465 ms
0.705 ms
4.004 ms
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the development card was 512 KB.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No automatic redundancy (factory state) |
2857.726 ms |
2843.064 ms |
2872.938 ms |
No automatic redundancy |
8535.636 ms |
8339.167 ms |
8661.502 ms |
|
Automatic redundancy enabled (factory state) |
2476.430 ms |
2462.935 ms |
2495.924 ms |
|
Automatic redundancy enabled |
8186.308 ms |
8015.401 ms |
8278.390 ms |
|
|
No automatic redundancy |
26.955 ms |
25.682 ms |
29.150 ms |
Automatic redundancy enabled |
30.495 ms |
28.150 ms |
33.798 ms |
|
|
No automatic redundancy |
1.476 ms |
1.081 ms |
3.318 ms |
Automatic redundancy enabled |
1.124 ms |
0.873 ms |
2.324 ms |
It takes much more processing time to format a backup region when data has been written to it than when the backup region is in factory condition. The time it takes to format the region also depends on the amount of data written.
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size of the development card was 512 KB.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No automatic redundancy |
658.566 ms |
528.313 ms |
946.492 ms |
Automatic redundancy enabled |
1.465 ms |
0.705 ms |
4.004 ms |
8.7.3. Extended Save Data
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
Note:
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
8.7.4. ROM Archives (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
This section shows benchmark data for accessing the extended save data region on media having the same specifications and capacity as the SD Card packaged with the CTR system. Use this data as a reference. The benchmark data provided was collected using a sample device with its sole purpose being measurement testing, so there is no guarantee that processing will complete within the times presented.
The media used for measurements exhibited a 2.5% drop in performance when reading and a 5.2% drop when writing for the transfer rate of a 505 KB file versus that of a 504 KB file. However, the degree of degradation of performance and the file size at which performance degrades might differ depending on the media’s capacity, manufacturer, or other factors, even for media of the same class. Because there are variations in performance depending on the media being used, be sure to treat the data listed here as a source of reference.
8.7.3.1. Mounting
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process
Average
Best Score
Worst Score
MountExtSaveData
40.844 ms
39.278 ms
42.136 ms
DeleteExtSaveData
105.296 ms
77.204 ms
165.070 ms
CreateExtSaveData
2346.382 ms
2055.237 ms
2468.994 ms
MountExtSaveData
38.256 ms
36.832 ms
40.576 ms
Unmount
1.842 ms
1.619 ms
2.319 ms
8.7.3.2. Creating Files
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process
Average
Best Score
Worst Score
TryCreateFile
766.200 ms
363.559 ms
2125.721 ms
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
The tables below show the time required, as measured on a test unit, to execute the MountExtSaveData
function for mounting extended save data, the DeleteExtSaveData
function for deleting it, the CreateExtSaveData
function for creating it, the MountExtSaveData
function for remounting it, and the Unmount
function for unmounting it.
Process |
Average |
Best Score |
Worst Score |
---|---|---|---|
|
40.844 ms |
39.278 ms |
42.136 ms |
|
105.296 ms |
77.204 ms |
165.070 ms |
|
2346.382 ms |
2055.237 ms |
2468.994 ms |
|
38.256 ms |
36.832 ms |
40.576 ms |
|
1.842 ms |
1.619 ms |
2.319 ms |
The tables below show the time required, as measured on a test unit, for the TryCreateFile
function to execute.
Process |
Average |
Best Score |
Worst Score |
---|---|---|---|
|
766.200 ms |
363.559 ms |
2125.721 ms |
8.7.3.3. Loading Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
14.098 ms
11.246 ms
21.259 ms
Finalize
1.433 ms
1.050 ms
3.150 ms
8.7.3.4. Writing Files
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process
Average
Best Score
Worst Score
TryInitialize
15.406 ms
12.528 ms
23.478 ms
Finalize
1.451 ms
1.040 ms
3.259 ms
8.7.3.5. Deleting Files
The figures below show the results for batch-reading extended save data files 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function’s processing was measured on a test unit.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. For capacities up to 8 MB, every read operation was to a different, unique file. For 16 MB capacity, the testing involved 30 files, and for 32 MB the testing involved 16 files.
The tables below show the time taken by operations other than those of the TryRead
function.
Process |
Average |
Best Score |
Worst Score |
---|---|---|---|
|
14.098 ms |
11.246 ms |
21.259 ms |
|
1.433 ms |
1.050 ms |
3.150 ms |
The figures below show the results, measured on a test unit, for the time required to execute the operations of the TryWrite
and TryFlush
functions when writing files to extended save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times.
To prevent cached data from skewing the results, the number of files used for writing was dependent on the media capacity. For capacities up to 8 MB, every write operation was to a different, unique file. For 16 MB capacity, the testing used 30 files and for 32 MB, the testing used 16 files.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process |
Average |
Best Score |
Worst Score |
---|---|---|---|
|
15.406 ms |
12.528 ms |
23.478 ms |
|
1.451 ms |
1.040 ms |
3.259 ms |
8.7.3.5. Deleting Files
The benchmark data presented in this section resulted from measuring access to downloadable application’s ROM archives on media with the same specifications and capacity as the SD Card bundled with the CTR system. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.4.1. Batch Read
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process
Average
Best Score
Worst Score
TryInitialize
0.046 ms
0.022 ms
0.220 ms
Finalize
0.006 ms
0.002 ms
0.023 ms
8.7.4.2. Stream Reads
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
The figures below show the time required by the TryRead
function to batch-read data of the specified size 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to only read unique files whenever possible.
The tables below show the time taken by operations other than those of the TryRead
function.
Process |
Average |
Best Score |
Worst Score |
---|---|---|---|
|
0.046 ms |
0.022 ms |
0.220 ms |
|
0.006 ms |
0.002 ms |
0.023 ms |
The figures below show the time required by the TryRead
function to read files having the same size 50 times while varying the buffer size. A test unit measured the time required to execute the TryRead
function. The green line shows the results obtained for downloadable applications. For reference, the blue dashed line shows the measurement results obtained for a development card (set to “Fast”).
To prevent cached data from skewing the results, we attempted to read only unique 8-MB files not stored in the cache.
8.7.5. Save Data (Downloadable Applications)
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
8.7.6. Comparison of CTR Flash Cards and CARD2
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
The benchmark data presented in this section resulted from measuring access to a downloadable application's backup region on media with the same specifications and capacity as the SD Card bundled with the CTR system. Use this data as a reference. Because the benchmark data was collected using a sample device with its sole purpose being measurement testing, there is no guarantee that processing will complete within the times presented.
8.7.5.1. Read
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Table 8-17. Downloadable Applications Time Required for Operations Other Than the TryRead Function (Save Data)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No automatic redundancy
1.259 ms
0.715 ms
7.613 ms
Automatic redundancy enabled
1.242 ms
0.714 ms
5.633 ms
Finalize
No automatic redundancy
0.492 ms
0.349 ms
2.099 ms
Automatic redundancy enabled
0.398 ms
0.278 ms
2.015 ms
8.7.5.2. Writing
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-18. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (No Automatic Redundancy)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
198.837 ms
55.597 ms
304.459 ms
Overwrite file of the same size
1.232 ms
0.720 ms
4.068 ms
Finalize
No file
0.467 ms
0.333 ms
2.152 ms
Overwrite file of the same size
0.466 ms
0.337 ms
2.134 ms
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Table 8-19. Downloadable Applications Time Taken by the TryInitialize and Finalize Functions (Automatic Redundancy Enabled)
Process
Measurement Conditions
Average
Best Score
Worst Score
TryInitialize
No file
1.256 ms
0.810 ms
3.412 ms
Overwrite file of the same size
1.535 ms
0.662 ms
8.886 ms
Finalize
No file
0.396 ms
0.260 ms
2.002 ms
Overwrite file of the same size
0.385 ms
0.260 ms
2.033 ms
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
The figures below show the results for batch-reading the specified size of save data written to a downloadable application’s save data backup region set to 512 KB, with this operation performed a total of 50 times. In short, TryInitialize
→ TryRead
→ Finalize
was performed 50 times. The time required for the TryRead
function to execute was measured on a test unit. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled.
To minimize the impact of caching, the measurement program changed the number of files used based on the capacity. When reading files at the 4KB mark or less, all the files read were different, but for larger sizes, there was the possibility that caching would have an impact. Also note that the effect of caching changes when reading immediately after writing or when reading the same file more than once.
The tables below show the time taken by operations other than those of the TryRead
function.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No automatic redundancy |
1.259 ms |
0.715 ms |
7.613 ms |
Automatic redundancy enabled |
1.242 ms |
0.714 ms |
5.633 ms |
|
|
No automatic redundancy |
0.492 ms |
0.349 ms |
2.099 ms |
Automatic redundancy enabled |
0.398 ms |
0.278 ms |
2.015 ms |
This section shows the results, measured on a test unit, for the time required to execute the TryWrite
function and the TryFlush
function’s operations when writing save data 50 times. In short, TryInitialize
→ TryWrite
→ TryFlush
→ Finalize
was performed 50 times. The downloadable application backup memory size was 512 KB. Benchmarks were recorded with two types of save data: data saved with and without automatic redundancy enabled. When automatic redundancy was enabled, the measurement also included the operations of the CommitSaveData
function.
No Automatic Redundancy
The figures below show the benchmark results for a save data region with no automatic redundancy. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
There is almost no difference in the time required by the TryWrite
function, regardless of whether a file is created or overwritten. The difference in processing speed (of approximately 200 ms) is due to the time taken by the TryFlush
function.
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No file |
198.837 ms |
55.597 ms |
304.459 ms |
Overwrite file of the same size |
1.232 ms |
0.720 ms |
4.068 ms |
|
|
No file |
0.467 ms |
0.333 ms |
2.152 ms |
Overwrite file of the same size |
0.466 ms |
0.337 ms |
2.134 ms |
The TryInitialize
function takes more time when a file is created than when an existing file is overwritten.
Automatic Redundancy Enabled
The figures below show the benchmark results for a save data region with automatic redundancy enabled. Because automatic redundancy is enabled, the total time includes the operations of the CommitSaveData
function in addition to those of the TryWrite
and TryFlush
functions. The green line shows results obtained when a new file was created (“No file” below), and the blue line shows results when the data overwrote a file of the same size (“Overwrite same size” below).
The tables below show the time taken by the TryInitialize
function and the Finalize
function under the different measurement conditions.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No file |
1.256 ms |
0.810 ms |
3.412 ms |
Overwrite file of the same size |
1.535 ms |
0.662 ms |
8.886 ms |
|
|
No file |
0.396 ms |
0.260 ms |
2.002 ms |
Overwrite file of the same size |
0.385 ms |
0.260 ms |
2.033 ms |
When a file was overwritten as opposed to created, some processes were dramatically slower. Otherwise, the results showed almost no differences under the two measurement conditions.
The figures below show the difference in transfer rate with and without automatic redundancy. The red line shows results obtained with no automatic redundancy, and the blue line shows results when automatic redundancy was enabled. In both cases, the measurement condition was “No file.”
If your application supports the auto-save feature, Nintendo recommends against using a large amount of save data, because there is only a limited amount of time for finishing this process before the system powers off.
8.7.5.3. Mounting
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
FormatSaveData
No automatic redundancy (right after importing)
430.314 ms
360.571 ms
490.037 ms
No automatic redundancy
436.077 ms
428.289 ms
446.790 ms
Automatic redundancy enabled (right after importing)
206.321 ms
181.559 ms
253.866 ms
Automatic redundancy enabled
197.499 ms
180.707 ms
215.180 ms
MountSaveData
No automatic redundancy
24.079 ms
22.886 ms
26.050 ms
Automatic redundancy enabled
24.674 ms
22.770 ms
25.738 ms
Unmount
No automatic redundancy
1.694 ms
1.298 ms
3.496 ms
Automatic redundancy enabled
1.338 ms
0.895 ms
2.622 ms
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
8.7.5.4. Deleting Files
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process
Measurement Conditions
Average
Best Score
Worst Score
TryDeleteFile
No automatic redundancy
171.093 ms
55.857 ms
299.145 ms
Automatic redundancy enabled
1.546 ms
0.712 ms
4.581 ms
The tables below show the time required, as measured on a test unit, for the following functions: FormatSaveData
, MountSaveData
, and Unmount
. The backup memory size for the downloadable application was 512 KB.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No automatic redundancy (right after importing) |
430.314 ms |
360.571 ms |
490.037 ms |
No automatic redundancy |
436.077 ms |
428.289 ms |
446.790 ms |
|
Automatic redundancy enabled (right after importing) |
206.321 ms |
181.559 ms |
253.866 ms |
|
Automatic redundancy enabled |
197.499 ms |
180.707 ms |
215.180 ms |
|
|
No automatic redundancy |
24.079 ms |
22.886 ms |
26.050 ms |
Automatic redundancy enabled |
24.674 ms |
22.770 ms |
25.738 ms |
|
|
No automatic redundancy |
1.694 ms |
1.298 ms |
3.496 ms |
Automatic redundancy enabled |
1.338 ms |
0.895 ms |
2.622 ms |
For the measurement, the state immediately after importing is reproduced by deleting the backup region acted on by the SaveDataFiler function.
The tables below show the time required, as measured on a test unit, for the TryDeleteFile
function. The backup memory size for the downloadable application was 512 KB.
Process |
Measurement Conditions |
Average |
Best Score |
Worst Score |
---|---|---|---|---|
|
No automatic redundancy |
171.093 ms |
55.857 ms |
299.145 ms |
Automatic redundancy enabled |
1.546 ms |
0.712 ms |
4.581 ms |
CARD2 performance changes are shown below in comparison to the CTR flash card.
8.7.6.1. ROM Archive
The performance between CTR flash cards and CARD2 is approximately equal.
8.7.6.2. Save Data
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.
The performance between CTR flash cards and CARD2 is approximately equal.
No Automatic Redundancy
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 30x.
Automatic Redundancy Enabled
CARD2 file read performance increased by a maximum factor of approximately 5x.
File write performance increased by a maximum factor of approximately 33x.