This chapter provides information about graphics processing.
Graphics processing is divided between the CPU and GPU. To optimize system performance, you must determine whether CPU or GPU processing presents a bottleneck and take appropriate measures to deal with it.
3.1. About the Graphics Library (Important)
When using 3DS graphics, 3D commands and instructions to the GPU called command requests are put on a command list and the GPU operates based on these commands. The graphics library can be divided into two parts: the part that controls command requests, and the part that actually generates commands for rendering. The nngx library is used for controlling the command list and generating general commands.
The following four libraries are used to generate 3D commands for actual rendering.
- GR Library
- NW4C
- GD Library
- GL Library (DMPGL) (not recommended)
In addition to using these libraries, you can use a method of directly issuing commands from the application. The following sections detail the various methods of generating commands. Please use the library best suited to your purpose.
3.1.1. GR Library
This graphics library is used to assist the direct creation of 3D commands. Because no status management is carried out inside this library, it executes faster than other graphics libraries, and also uses less memory. To use this library, you must have an in-depth understanding of how each GPU register is set (described in the CTR Programming Manual: Advanced Graphics). This library is also hard to use because the user must handle detailed processing such as error checking.
3.1.2. NW4C (NintendoWare for CTR)
Combining both sound and graphics capabilities, NW4C is middleware provided by Nintendo to aid in application development. The graphics capabilities of NW4C generate direct commands without using other graphics libraries. In addition, libraries have been optimized to provide full performance when implemented using NintendoWare. To optimize performance, we recommend using NintendoWare.
3.1.3. GD Library
This graphics library has a similar structure to general 3D library APIs. The API has an independent format, but no understanding of the individual GPU registers is required in order to use GD. Nintendo recommends using GD if you intend not to use NW4C and plan to perform graphics processing without learning information about GPU registers.
3.1.4. GL Library (DMPGL)
This library provides a rich range of functionality, including API functions based on OpenGL ES, error handling, and state-difference management. However, compared to the other libraries, the processes of the GL library put a heavy load on the CPU because the library issues many redundant commands.
Note:
Use of this library is not recommended because CPU performance is greatly reduced.
3.1.5. Direct Creation of 3D Commands
This method directly creates register write commands to be applied to the GPU. Fast operations can be achieved by creating only those commands that are necessary. However, this method is the most difficult way to implement graphics because it requires an in-depth understanding and detailed knowledge of how to set each GPU register, as described in the CTR Programming Manual: Advanced Graphics. We recommend using the GR library, which is implemented in a way that assists direct creation of commands to be issued to the GPU.
3.2. Command Lists
When the 3DS system performs a graphics operation, the CPU generates rendering instructions (3D commands, command requests) and the GPU processes them. The command list consists of command requests, which are the instructions that the CPU conveys to the GPU, in addition to 3D commands, which are instructions that the GPU references directly.
When graphics are executed, the CPU conveys instructions to the GPU based on command requests, and the GPU performs various operations in accordance with the content of those command requests. If the command request received from the CPU is a render command request, the GPU directly references the 3D command buffer for rendering.
For more information about the functions that issue command requests, see Section 8.5.10. Functions That Issue Command Requests. The CPU load for processing command requests and the GPU wait-time both increase with the number of command requests processed during rendering.
Note:
Because command requests are primarily processed by the system core, the execution of large volumes of command requests can affect the processing efficiency of the system core.
For more information about the processing performed by the system core, see Chapter 7. Processing Handled by the System Core.
3.2.1. About Issuing Commands (Important)
You can significantly reduce CPU usage by issuing only those commands that are necessary to the command list without using the DMPGL.
Note:
For more information about command specifications, see the CTR Programming Manual: Advanced Graphics.
Note:
The implementation of NW4C directly issues only the commands that are needed to improve performance. Also, the GR library that is used to assist in directly issuing commands is included in the CTR-SDK.
Related Functions:
nngxAdd3DCommand
3.2.2. Double-Buffering Command Lists (Important)
When there is a single command list, you must wait until the GPU has finished processing, and then clear the command list and generate the render command for the next frame. You can make the CPU and GPU run in parallel by double-buffering the command list. With a double-buffered approach, the CPU issues rendering instructions to one of the command lists while the GPU processes the other command list.
Note:
If double-buffering a command list, the frame for which graphics rendering is actually being performed is delayed one frame from the frame for which commands accumulate.
For a sample implementation that demonstrates how to double-buffer a command list, see SampleDemos/gx/Api/CommandListDouble
.
Related Functions:
nngxBindCmdList, nngxRunCmdList
3.2.3. Reusing 3D Commands (Important)
You can lighten the processing load required of the CPU for issuing 3D commands by reusing 3D commands that have already been generated. This is particularly effective for the frequently used commands for setting vertex shaders and lookup tables, and for the 3D commands for the overall material settings included in them.
3.2.4. Command Caches
The command cache feature is provided to create command lists in advance for repeated use. By using a command cache, you can reduce the number of duplicate commands that are issued and the processing load on the CPU.
There are two ways to use a command cache. One is to copy and add command buffers. The other is to add command requests that reference command buffers.
Note:
For more information about command caches, see the CTR Programming Manual: Advanced Graphics. For more information about how to implement them, see SampleDemos/gx/Api/CommandCacheSimple
.
Related Functions:
nngxStartCmdlistSave, nngxStopCmdlistSave
3.2.5. Using Command Buffer Jumps (Important)
With a command buffer jump, the next command buffer execution address that the GPU references can be moved based on an instruction (3D command) in the command buffer. By using this feature you can make the GPU reference a series of command buffers without incurring any command request processing by the CPU.
The following functions have been prepared for use of this command buffer jump feature.
-
nngxAddJumpCommand
Use this function to make a jump from the current command buffer to another command buffer.
-
nngxAddSubRoutineCommand
Use this function to call another command buffer from the current command buffer as a subroutine.
Compared to copying and adding command caches, using command buffer jumps can significantly reduce the processing load placed on the CPU that normally results from the copying of command buffers. It can also significantly reduce the number of command requests compared to the number added when referencing command caches. The result is both a lighter processing load on the system core and less waiting by the GPU for the command request processing of the CPU.
The reloading of 3D commands involves a slightly higher cost to the GPU compared to simply executing a series of command buffers. But because processing speed can slow due to excessive command buffer jumping, we recommend using the feature for only limited groups of 3D commands.
Note:
For more information about the command execution registers involved in command buffer jumping, see the CTR Programming Manual: Advanced Graphics.
Related Functions:
nngxAddJumpCommand, nngxAddSubRoutineCommand
3.2.5.1. Subroutine Placement
If you plan to call a command buffer as a subroutine, placing the subroutine in VRAM may improve processing speed when the GPU accesses the command buffer, which leads to an overall reduction in GPU processing time.
When commands that could potentially cause frequent access to the command buffer are placed in VRAM, the bottleneck shifts from the command buffer to actual command processing. Accordingly, if access to the command buffer is not a bottleneck (this depends on the implementation of the application), little improvement can be expected from using this technique.
In the GPU rendering pipeline, commands to modules that are upstream from rasterization in the pipeline are processed using one cycle per command. When many commands to those upstream modules are made, command buffer access is typically more of a bottleneck than command processing. (See the "3D Command Execution Cost" section in the CTR Programming Manual: Advanced Graphics.)
Accordingly, placing commands that load shaders in VRAM could significantly reduce GPU processing time.
Conversely, commands to the rasterization module or modules that are downstream from rasterization in the pipeline are processed using two cycles per command. When there are many commands to those downstream modules, command processing is typically more of a bottleneck than command buffer access.
The size of the command buffer also affects the bottlenecking tendencies described above. The larger the command buffer, the more likely command buffer access is the bottleneck because memory must be accessed more frequently. Command buffer size is affected by whether you use single commands (header + one piece of data) or burst commands (header + multiple pieces of data), even if the number of commands is the same in both cases.
Also, if a command buffer being called as a subroutine is placed in device memory, the cost of switching when executing that subroutine is added directly to the GPU processing time.
3.2.6. Reusing Commands for Stereoscopic Display (Important)
To render a scene stereoscopically, you must render it twice with only the camera position changed.
By saving the commands from the first rendering pass in a command cache (or by some other means) and then reusing all of them—except for camera-related settings—in the second rendering pass, you can dramatically reduce the number of commands that must be issued for rendering.
Note:
Fragment lights are set in camera coordinates. To represent lighting precisely, you must convert fragment lights into each camera’s coordinate system, which means you must also change commands related to fragment lights.
3.3. Vertex Shader Programs and Lookup Tables
3.3.1. Vertex Shader Programs
Because vertex shader processing is run on each vertex, we recommend that you reduce the number of vertex lights, branch instructions, and so on as much as possible. Care is required because the generation of 3D commands that switch (load) vertex shaders places a load on the CPU. At runtime, shader switching also places a load on the GPU.
To reduce the number of times a vertex shader is loaded, take the following steps.
- Manage the rendering order to minimize the number of times that vertex shaders are switched.
- Pack multiple shader programs into a single binary. This way, you can switch shaders by simply changing the shader entry point without loading a binary.
3.3.2. Reducing Lookup Table (LUT) Loads
When you configure lookup tables, the CPU issues the lookup tables themselves as commands to the GPU. As a result, there is a CPU cost associated with switching lookup tables. Because there are registers for each lookup table type (such as D0, D1, and SP), we recommend that you avoid switching the lookup table whenever possible.
When two materials are each stored in separate LUTs, for example, you could set one LUT in D0 and one in D1, and then turn them on and off for the corresponding materials.
3.3.3. Changes in Loading Speed Due to 3D Command Buffer Address and Size
The address and size of the 3D command buffer may affect loading speed during execution.
To execute the 3D command buffer, use one of the following two methods.
- Execution using 3D executable commands
Loading speed is affected by the size of data as measured between the addresses of consecutive delimiter commands added using nngxFlush3DCommand
or nngxSplitDrawCmdlist
.
- Execution using a register that executes a command buffer (command buffer jump)
Loading speed is affected by the address and size of the command buffer added using nngxAddJumpCommand
or by a command buffer executed as a subroutine or executed to return to the calling processes from a subroutine added using nngxAddSubroutineCommand
.
More efficient transfer rates may result when using the following buffer addresses and sizes.
- The address has 128-byte alignment and the size of data is a multiple of 256 bytes (256 bytes, 512 bytes, 768 bytes, and so on.)
- The address is not 128-byte aligned, but the size as measured from the immediately previous address with 128-byte alignment to the end of the 3D command buffer is a multiple of 256 bytes.
For example, if the address and size of the 3D command buffer are 0x20000010 and 0x1F0, respectively, the immediately previous address with 128-byte alignment is 0x20000000 just 0x10 before. The data size from there to the end of the buffer is 0x1F0 + 0x10, or 0x200, which is a multiple of a 256.
Although the address and size of the 3D command buffer can influence loading speed as described above due to the way the GPU is implemented, you may not see large benefits due to factors such as the location of the buffer, the content of 3D commands, and memory access conflicts with other modules.
3.4. Vertex Buffers and Texture Buffers
3.4.1. Placement of Vertex Buffers and Texture Buffers (Important)
Vertex buffers, vertex index buffers, and texture buffers can be placed in main memory or VRAM. Because it is more convenient for the GPU to access VRAM than main memory, you can place buffers in VRAM to improve performance.
With the default settings, the GPU is given priority when there is a main memory access conflict between the CPU and the GPU. If the CPU is executing a process that heavily depends on memory access, this can cause a precipitous drop in performance. To reduce the processing load on the CPU, place data that is directly accessed by the GPU into VRAM instead of main memory.
Note:
Use nn::gx::SetMemAccessPrioMode
to configure main memory access priorities for the CPU and the GPU.
For information about the structure of the CPU, the GPU, and the various memory devices, see Section 8.3. Hardware Configuration.
If vertex attributes and textures (multi-textures) to be used simultaneously are placed in both VRAM and main memory, main memory becomes a bottleneck on access speed. All buffers that are used simultaneously must therefore be placed in VRAM.
Placing multi-textures in different VRAMs may lower performance. We recommend placing textures that are used simultaneously in the same VRAM.
3.4.2. Effective Use of the Vertex Cache (Important)
The 3DS system has a vertex cache that can temporarily cache processed vertex data using its vertex index as a key, and then use that cache data later for the same input vertex.
The vertex cache is only effective in rendering primitives when glDrawElements
(or an equivalent 3D command) uses a vertex buffer for rendering. Up to 32 entries can be cached in the vertex cache. If more than 32 entries are input, the existing entries are overwritten, starting with the least-recently accessed entries. The cache is flushed each time glDrawElements
is called.
3.4.3. Improving Efficiency of Vertex Indices
When specifying vertex indices, rendering efficiency can be improved by using the vertex cache to its fullest.
In the case of similar vertex indices, the efficiency of GL_TRIANGLES
is no worse than that of GL_TRIANGLE_STRIP
. In some cases, rendering with GL_TRIANGLES
results in better efficiency because adjustments are sometimes necessary when using GL_TRIANGLE_STRIP
with degenerate polygons. Using optimized vertex indices, GL_TRIANGLES
allows for the most efficient rendering.
3.4.4. Updating Buffers
When vertex and texture data are updated, the GPU accesses data in main memory and thus the CPU internally applies the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load will increase proportionally to the number of calls to such functions.
A single command request is issued with each transfer for buffers placed in VRAM; this also increases the CPU load.
We recommend that you call vertex and buffer update functions in advance when the application is being initialized, load vertex buffer and texture data early, avoid loading data during per-frame operations whenever possible, and transfer data all at once.
3.4.5. Using Interleaved Arrays
When rendering a model, it is more efficient to access memory using vertex buffers as a single interleaved array that combines vertex attributes rather than as separate arrays for each vertex attribute. This is particularly relevant when vertex arrays are placed in main memory because then the CPU and GPU compete over memory access. You can minimize the adverse effect on CPU performance by making memory access more efficient.
Note:
Beginning with NW4C 1.1.0, an interleaved array is used for the vertex buffer.
3.4.6. Reducing Texture Sizes
The GPU has L1 and L2 texture caches. These cache regions one at a time; by using smaller textures, you can suppress the number of texture fetches. We also recommend that you use smaller textures because it reduces VRAM usage.
Note:
Each texture unit has its own L1 texture cache, which holds 256 bytes.
The L2 texture cache holds 8 KB and is shared by all texture units.
3.4.7. Using Mipmaps (Important)
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
3.4.8. Using the ETC1 Texture Format (Important)
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Note:
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
3.5. Fragment Shaders
3.5.1. Number of Fragment Lights
The number of fragment lights affects GPU fill processing. The processing load increases proportionally with respect to both the fill area size and the number of lights. Avoid using fragment lighting when possible. For example, you can use static lighting and bake colors into vertices instead of sending normal data.
Note:
The GPU specifications always enable at least one fragment light. If you do not use the fragment lighting results, the per-fragment processing load does not change but you can reduce processing for the vertex shader’s quaternion output.
3.5.2. Layer Configuration (Important)
Choose layer configurations with smaller numbers (less processing) whenever possible.
If you are only using the D0 LUT, for example, layer configuration 7 is more expensive than layer configuration 1 by two clock cycles times the number of fragment lights for each fill area.
3.5.3. Unused Texture Unit Settings (Important)
When a texture is set in a texture unit, it is fetched even if (for example) it is not used by a texture combiner. To avoid unnecessary memory access, do not configure a texture unit that is not accessed by the combiners.
Note:
Accessing memory to fetch unnecessary textures is detrimental to power consumption.
3.5.4. Interpolation Between Mipmap Levels
When you configure linear interpolation to be used between mipmap levels, the number of texture fetches increases. Because this increases the GPU load and also leads to access conflicts, choose “nearest” interpolation between mipmap levels whenever possible.
3.5.5. Using or Not Using Blend Operations
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
3.6. Framebuffers and Display Buffers
3.6.1. Avoiding the Use of glClear
Each time you use glClear
to clear the framebuffer, two command requests are added, resulting in a large CPU load.
Instead of using glClear
, you can render a background model first with depth tests disabled and depth writes enabled to eliminate the framebuffer clear operation and save CPU/GPU processing.
For example, if you blend a polygon that covers the entire screen at the end of rendering to change the rendered brightness, the settings described above allow you to skip the clear operation for the depth buffer during the next rendering pass.
Note:
Settings that combine no depth testing with writing depth data cannot be made when using DMPGL. A library such as GR must be used instead to issue these commands.
3.6.2. Render Buffer Location (Important)
You can place the color buffer and depth buffer in different locations in VRAM (VRAM-A or VRAM-B) to parallelize fill operations when the buffers are used simultaneously. This improves performance.
3.6.3. Sharing Framebuffers to Conserve VRAM (Important)
When rendering to each of the screens, you can conserve VRAM by reusing the same framebuffer first for the right eye, then for the left eye, and finally for the lower screen.
3.6.4. Display Buffer Placement
When the display buffer is placed in main memory, main memory is accessed regularly because the GPU transfers the content of the display buffer to the LCD every frame.
When the display buffer is placed in VRAM, the content of the display buffer is transferred to the LCD without accessing main memory. Although this puts pressure on VRAM, it is useful in avoiding access conflicts when the cameras or other devices access main memory often.
Note:
If competing memory access prevents data from being transferred to the LCD on time, artifacts may be displayed on the LCD. You can also use nn::gx::SetMemAccessPrioMode
to configure access priorities for main memory.
When using 3DS graphics, 3D commands and instructions to the GPU called command requests are put on a command list and the GPU operates based on these commands. The graphics library can be divided into two parts: the part that controls command requests, and the part that actually generates commands for rendering. The nngx library is used for controlling the command list and generating general commands.
The following four libraries are used to generate 3D commands for actual rendering.
- GR Library
- NW4C
- GD Library
- GL Library (DMPGL) (not recommended)
In addition to using these libraries, you can use a method of directly issuing commands from the application. The following sections detail the various methods of generating commands. Please use the library best suited to your purpose.
3.1.1. GR Library
This graphics library is used to assist the direct creation of 3D commands. Because no status management is carried out inside this library, it executes faster than other graphics libraries, and also uses less memory. To use this library, you must have an in-depth understanding of how each GPU register is set (described in the CTR Programming Manual: Advanced Graphics). This library is also hard to use because the user must handle detailed processing such as error checking.
3.1.2. NW4C (NintendoWare for CTR)
Combining both sound and graphics capabilities, NW4C is middleware provided by Nintendo to aid in application development. The graphics capabilities of NW4C generate direct commands without using other graphics libraries. In addition, libraries have been optimized to provide full performance when implemented using NintendoWare. To optimize performance, we recommend using NintendoWare.
3.1.3. GD Library
This graphics library has a similar structure to general 3D library APIs. The API has an independent format, but no understanding of the individual GPU registers is required in order to use GD. Nintendo recommends using GD if you intend not to use NW4C and plan to perform graphics processing without learning information about GPU registers.
3.1.4. GL Library (DMPGL)
This library provides a rich range of functionality, including API functions based on OpenGL ES, error handling, and state-difference management. However, compared to the other libraries, the processes of the GL library put a heavy load on the CPU because the library issues many redundant commands.
Note:
Use of this library is not recommended because CPU performance is greatly reduced.
3.1.5. Direct Creation of 3D Commands
This method directly creates register write commands to be applied to the GPU. Fast operations can be achieved by creating only those commands that are necessary. However, this method is the most difficult way to implement graphics because it requires an in-depth understanding and detailed knowledge of how to set each GPU register, as described in the CTR Programming Manual: Advanced Graphics. We recommend using the GR library, which is implemented in a way that assists direct creation of commands to be issued to the GPU.
This graphics library is used to assist the direct creation of 3D commands. Because no status management is carried out inside this library, it executes faster than other graphics libraries, and also uses less memory. To use this library, you must have an in-depth understanding of how each GPU register is set (described in the CTR Programming Manual: Advanced Graphics). This library is also hard to use because the user must handle detailed processing such as error checking.
Combining both sound and graphics capabilities, NW4C is middleware provided by Nintendo to aid in application development. The graphics capabilities of NW4C generate direct commands without using other graphics libraries. In addition, libraries have been optimized to provide full performance when implemented using NintendoWare. To optimize performance, we recommend using NintendoWare.
3.1.3. GD Library
This graphics library has a similar structure to general 3D library APIs. The API has an independent format, but no understanding of the individual GPU registers is required in order to use GD. Nintendo recommends using GD if you intend not to use NW4C and plan to perform graphics processing without learning information about GPU registers.
3.1.4. GL Library (DMPGL)
This library provides a rich range of functionality, including API functions based on OpenGL ES, error handling, and state-difference management. However, compared to the other libraries, the processes of the GL library put a heavy load on the CPU because the library issues many redundant commands.
Note:
Use of this library is not recommended because CPU performance is greatly reduced.
3.1.5. Direct Creation of 3D Commands
This method directly creates register write commands to be applied to the GPU. Fast operations can be achieved by creating only those commands that are necessary. However, this method is the most difficult way to implement graphics because it requires an in-depth understanding and detailed knowledge of how to set each GPU register, as described in the CTR Programming Manual: Advanced Graphics. We recommend using the GR library, which is implemented in a way that assists direct creation of commands to be issued to the GPU.
This graphics library has a similar structure to general 3D library APIs. The API has an independent format, but no understanding of the individual GPU registers is required in order to use GD. Nintendo recommends using GD if you intend not to use NW4C and plan to perform graphics processing without learning information about GPU registers.
This library provides a rich range of functionality, including API functions based on OpenGL ES, error handling, and state-difference management. However, compared to the other libraries, the processes of the GL library put a heavy load on the CPU because the library issues many redundant commands.
Use of this library is not recommended because CPU performance is greatly reduced.
3.1.5. Direct Creation of 3D Commands
This method directly creates register write commands to be applied to the GPU. Fast operations can be achieved by creating only those commands that are necessary. However, this method is the most difficult way to implement graphics because it requires an in-depth understanding and detailed knowledge of how to set each GPU register, as described in the CTR Programming Manual: Advanced Graphics. We recommend using the GR library, which is implemented in a way that assists direct creation of commands to be issued to the GPU.
This method directly creates register write commands to be applied to the GPU. Fast operations can be achieved by creating only those commands that are necessary. However, this method is the most difficult way to implement graphics because it requires an in-depth understanding and detailed knowledge of how to set each GPU register, as described in the CTR Programming Manual: Advanced Graphics. We recommend using the GR library, which is implemented in a way that assists direct creation of commands to be issued to the GPU.
When the 3DS system performs a graphics operation, the CPU generates rendering instructions (3D commands, command requests) and the GPU processes them. The command list consists of command requests, which are the instructions that the CPU conveys to the GPU, in addition to 3D commands, which are instructions that the GPU references directly.
When graphics are executed, the CPU conveys instructions to the GPU based on command requests, and the GPU performs various operations in accordance with the content of those command requests. If the command request received from the CPU is a render command request, the GPU directly references the 3D command buffer for rendering.
For more information about the functions that issue command requests, see Section 8.5.10. Functions That Issue Command Requests. The CPU load for processing command requests and the GPU wait-time both increase with the number of command requests processed during rendering.
Because command requests are primarily processed by the system core, the execution of large volumes of command requests can affect the processing efficiency of the system core.
For more information about the processing performed by the system core, see Chapter 7. Processing Handled by the System Core.
3.2.1. About Issuing Commands (Important)
You can significantly reduce CPU usage by issuing only those commands that are necessary to the command list without using the DMPGL.
Note:
For more information about command specifications, see the CTR Programming Manual: Advanced Graphics.
Note:
The implementation of NW4C directly issues only the commands that are needed to improve performance. Also, the GR library that is used to assist in directly issuing commands is included in the CTR-SDK.
Related Functions:
nngxAdd3DCommand
3.2.2. Double-Buffering Command Lists (Important)
When there is a single command list, you must wait until the GPU has finished processing, and then clear the command list and generate the render command for the next frame. You can make the CPU and GPU run in parallel by double-buffering the command list. With a double-buffered approach, the CPU issues rendering instructions to one of the command lists while the GPU processes the other command list.
Note:
If double-buffering a command list, the frame for which graphics rendering is actually being performed is delayed one frame from the frame for which commands accumulate.
For a sample implementation that demonstrates how to double-buffer a command list, see SampleDemos/gx/Api/CommandListDouble
.
Related Functions:
nngxBindCmdList, nngxRunCmdList
3.2.3. Reusing 3D Commands (Important)
You can lighten the processing load required of the CPU for issuing 3D commands by reusing 3D commands that have already been generated. This is particularly effective for the frequently used commands for setting vertex shaders and lookup tables, and for the 3D commands for the overall material settings included in them.
3.2.4. Command Caches
The command cache feature is provided to create command lists in advance for repeated use. By using a command cache, you can reduce the number of duplicate commands that are issued and the processing load on the CPU.
There are two ways to use a command cache. One is to copy and add command buffers. The other is to add command requests that reference command buffers.
Note:
For more information about command caches, see the CTR Programming Manual: Advanced Graphics. For more information about how to implement them, see SampleDemos/gx/Api/CommandCacheSimple
.
Related Functions:
nngxStartCmdlistSave, nngxStopCmdlistSave
3.2.5. Using Command Buffer Jumps (Important)
With a command buffer jump, the next command buffer execution address that the GPU references can be moved based on an instruction (3D command) in the command buffer. By using this feature you can make the GPU reference a series of command buffers without incurring any command request processing by the CPU.
The following functions have been prepared for use of this command buffer jump feature.
-
nngxAddJumpCommand
Use this function to make a jump from the current command buffer to another command buffer.
-
nngxAddSubRoutineCommand
Use this function to call another command buffer from the current command buffer as a subroutine.
Compared to copying and adding command caches, using command buffer jumps can significantly reduce the processing load placed on the CPU that normally results from the copying of command buffers. It can also significantly reduce the number of command requests compared to the number added when referencing command caches. The result is both a lighter processing load on the system core and less waiting by the GPU for the command request processing of the CPU.
The reloading of 3D commands involves a slightly higher cost to the GPU compared to simply executing a series of command buffers. But because processing speed can slow due to excessive command buffer jumping, we recommend using the feature for only limited groups of 3D commands.
Note:
For more information about the command execution registers involved in command buffer jumping, see the CTR Programming Manual: Advanced Graphics.
Related Functions:
nngxAddJumpCommand, nngxAddSubRoutineCommand
3.2.5.1. Subroutine Placement
If you plan to call a command buffer as a subroutine, placing the subroutine in VRAM may improve processing speed when the GPU accesses the command buffer, which leads to an overall reduction in GPU processing time.
When commands that could potentially cause frequent access to the command buffer are placed in VRAM, the bottleneck shifts from the command buffer to actual command processing. Accordingly, if access to the command buffer is not a bottleneck (this depends on the implementation of the application), little improvement can be expected from using this technique.
In the GPU rendering pipeline, commands to modules that are upstream from rasterization in the pipeline are processed using one cycle per command. When many commands to those upstream modules are made, command buffer access is typically more of a bottleneck than command processing. (See the "3D Command Execution Cost" section in the CTR Programming Manual: Advanced Graphics.)
Accordingly, placing commands that load shaders in VRAM could significantly reduce GPU processing time.
Conversely, commands to the rasterization module or modules that are downstream from rasterization in the pipeline are processed using two cycles per command. When there are many commands to those downstream modules, command processing is typically more of a bottleneck than command buffer access.
The size of the command buffer also affects the bottlenecking tendencies described above. The larger the command buffer, the more likely command buffer access is the bottleneck because memory must be accessed more frequently. Command buffer size is affected by whether you use single commands (header + one piece of data) or burst commands (header + multiple pieces of data), even if the number of commands is the same in both cases.
Also, if a command buffer being called as a subroutine is placed in device memory, the cost of switching when executing that subroutine is added directly to the GPU processing time.
3.2.6. Reusing Commands for Stereoscopic Display (Important)
To render a scene stereoscopically, you must render it twice with only the camera position changed.
By saving the commands from the first rendering pass in a command cache (or by some other means) and then reusing all of them—except for camera-related settings—in the second rendering pass, you can dramatically reduce the number of commands that must be issued for rendering.
Note:
Fragment lights are set in camera coordinates. To represent lighting precisely, you must convert fragment lights into each camera’s coordinate system, which means you must also change commands related to fragment lights.
You can significantly reduce CPU usage by issuing only those commands that are necessary to the command list without using the DMPGL.
For more information about command specifications, see the CTR Programming Manual: Advanced Graphics.
The implementation of NW4C directly issues only the commands that are needed to improve performance. Also, the GR library that is used to assist in directly issuing commands is included in the CTR-SDK.
Related Functions:
nngxAdd3DCommand
When there is a single command list, you must wait until the GPU has finished processing, and then clear the command list and generate the render command for the next frame. You can make the CPU and GPU run in parallel by double-buffering the command list. With a double-buffered approach, the CPU issues rendering instructions to one of the command lists while the GPU processes the other command list.
If double-buffering a command list, the frame for which graphics rendering is actually being performed is delayed one frame from the frame for which commands accumulate.
For a sample implementation that demonstrates how to double-buffer a command list, see SampleDemos/gx/Api/CommandListDouble
.
Related Functions:
nngxBindCmdList, nngxRunCmdList
3.2.3. Reusing 3D Commands (Important)
You can lighten the processing load required of the CPU for issuing 3D commands by reusing 3D commands that have already been generated. This is particularly effective for the frequently used commands for setting vertex shaders and lookup tables, and for the 3D commands for the overall material settings included in them.
3.2.4. Command Caches
The command cache feature is provided to create command lists in advance for repeated use. By using a command cache, you can reduce the number of duplicate commands that are issued and the processing load on the CPU.
There are two ways to use a command cache. One is to copy and add command buffers. The other is to add command requests that reference command buffers.
Note:
For more information about command caches, see the CTR Programming Manual: Advanced Graphics. For more information about how to implement them, see SampleDemos/gx/Api/CommandCacheSimple
.
Related Functions:
nngxStartCmdlistSave, nngxStopCmdlistSave
3.2.5. Using Command Buffer Jumps (Important)
With a command buffer jump, the next command buffer execution address that the GPU references can be moved based on an instruction (3D command) in the command buffer. By using this feature you can make the GPU reference a series of command buffers without incurring any command request processing by the CPU.
The following functions have been prepared for use of this command buffer jump feature.
-
nngxAddJumpCommand
Use this function to make a jump from the current command buffer to another command buffer.
-
nngxAddSubRoutineCommand
Use this function to call another command buffer from the current command buffer as a subroutine.
Compared to copying and adding command caches, using command buffer jumps can significantly reduce the processing load placed on the CPU that normally results from the copying of command buffers. It can also significantly reduce the number of command requests compared to the number added when referencing command caches. The result is both a lighter processing load on the system core and less waiting by the GPU for the command request processing of the CPU.
The reloading of 3D commands involves a slightly higher cost to the GPU compared to simply executing a series of command buffers. But because processing speed can slow due to excessive command buffer jumping, we recommend using the feature for only limited groups of 3D commands.
Note:
For more information about the command execution registers involved in command buffer jumping, see the CTR Programming Manual: Advanced Graphics.
Related Functions:
nngxAddJumpCommand, nngxAddSubRoutineCommand
3.2.5.1. Subroutine Placement
If you plan to call a command buffer as a subroutine, placing the subroutine in VRAM may improve processing speed when the GPU accesses the command buffer, which leads to an overall reduction in GPU processing time.
When commands that could potentially cause frequent access to the command buffer are placed in VRAM, the bottleneck shifts from the command buffer to actual command processing. Accordingly, if access to the command buffer is not a bottleneck (this depends on the implementation of the application), little improvement can be expected from using this technique.
In the GPU rendering pipeline, commands to modules that are upstream from rasterization in the pipeline are processed using one cycle per command. When many commands to those upstream modules are made, command buffer access is typically more of a bottleneck than command processing. (See the "3D Command Execution Cost" section in the CTR Programming Manual: Advanced Graphics.)
Accordingly, placing commands that load shaders in VRAM could significantly reduce GPU processing time.
Conversely, commands to the rasterization module or modules that are downstream from rasterization in the pipeline are processed using two cycles per command. When there are many commands to those downstream modules, command processing is typically more of a bottleneck than command buffer access.
The size of the command buffer also affects the bottlenecking tendencies described above. The larger the command buffer, the more likely command buffer access is the bottleneck because memory must be accessed more frequently. Command buffer size is affected by whether you use single commands (header + one piece of data) or burst commands (header + multiple pieces of data), even if the number of commands is the same in both cases.
Also, if a command buffer being called as a subroutine is placed in device memory, the cost of switching when executing that subroutine is added directly to the GPU processing time.
3.2.6. Reusing Commands for Stereoscopic Display (Important)
To render a scene stereoscopically, you must render it twice with only the camera position changed.
By saving the commands from the first rendering pass in a command cache (or by some other means) and then reusing all of them—except for camera-related settings—in the second rendering pass, you can dramatically reduce the number of commands that must be issued for rendering.
Note:
Fragment lights are set in camera coordinates. To represent lighting precisely, you must convert fragment lights into each camera’s coordinate system, which means you must also change commands related to fragment lights.
You can lighten the processing load required of the CPU for issuing 3D commands by reusing 3D commands that have already been generated. This is particularly effective for the frequently used commands for setting vertex shaders and lookup tables, and for the 3D commands for the overall material settings included in them.
The command cache feature is provided to create command lists in advance for repeated use. By using a command cache, you can reduce the number of duplicate commands that are issued and the processing load on the CPU.
There are two ways to use a command cache. One is to copy and add command buffers. The other is to add command requests that reference command buffers.
For more information about command caches, see the CTR Programming Manual: Advanced Graphics. For more information about how to implement them, see SampleDemos/gx/Api/CommandCacheSimple
.
Related Functions:
nngxStartCmdlistSave, nngxStopCmdlistSave
3.2.5. Using Command Buffer Jumps (Important)
With a command buffer jump, the next command buffer execution address that the GPU references can be moved based on an instruction (3D command) in the command buffer. By using this feature you can make the GPU reference a series of command buffers without incurring any command request processing by the CPU.
The following functions have been prepared for use of this command buffer jump feature.
-
nngxAddJumpCommand
Use this function to make a jump from the current command buffer to another command buffer.
-
nngxAddSubRoutineCommand
Use this function to call another command buffer from the current command buffer as a subroutine.
Compared to copying and adding command caches, using command buffer jumps can significantly reduce the processing load placed on the CPU that normally results from the copying of command buffers. It can also significantly reduce the number of command requests compared to the number added when referencing command caches. The result is both a lighter processing load on the system core and less waiting by the GPU for the command request processing of the CPU.
The reloading of 3D commands involves a slightly higher cost to the GPU compared to simply executing a series of command buffers. But because processing speed can slow due to excessive command buffer jumping, we recommend using the feature for only limited groups of 3D commands.
Note:
For more information about the command execution registers involved in command buffer jumping, see the CTR Programming Manual: Advanced Graphics.
Related Functions:
nngxAddJumpCommand, nngxAddSubRoutineCommand
3.2.5.1. Subroutine Placement
If you plan to call a command buffer as a subroutine, placing the subroutine in VRAM may improve processing speed when the GPU accesses the command buffer, which leads to an overall reduction in GPU processing time.
When commands that could potentially cause frequent access to the command buffer are placed in VRAM, the bottleneck shifts from the command buffer to actual command processing. Accordingly, if access to the command buffer is not a bottleneck (this depends on the implementation of the application), little improvement can be expected from using this technique.
In the GPU rendering pipeline, commands to modules that are upstream from rasterization in the pipeline are processed using one cycle per command. When many commands to those upstream modules are made, command buffer access is typically more of a bottleneck than command processing. (See the "3D Command Execution Cost" section in the CTR Programming Manual: Advanced Graphics.)
Accordingly, placing commands that load shaders in VRAM could significantly reduce GPU processing time.
Conversely, commands to the rasterization module or modules that are downstream from rasterization in the pipeline are processed using two cycles per command. When there are many commands to those downstream modules, command processing is typically more of a bottleneck than command buffer access.
The size of the command buffer also affects the bottlenecking tendencies described above. The larger the command buffer, the more likely command buffer access is the bottleneck because memory must be accessed more frequently. Command buffer size is affected by whether you use single commands (header + one piece of data) or burst commands (header + multiple pieces of data), even if the number of commands is the same in both cases.
Also, if a command buffer being called as a subroutine is placed in device memory, the cost of switching when executing that subroutine is added directly to the GPU processing time.
3.2.6. Reusing Commands for Stereoscopic Display (Important)
To render a scene stereoscopically, you must render it twice with only the camera position changed.
By saving the commands from the first rendering pass in a command cache (or by some other means) and then reusing all of them—except for camera-related settings—in the second rendering pass, you can dramatically reduce the number of commands that must be issued for rendering.
Note:
Fragment lights are set in camera coordinates. To represent lighting precisely, you must convert fragment lights into each camera’s coordinate system, which means you must also change commands related to fragment lights.
With a command buffer jump, the next command buffer execution address that the GPU references can be moved based on an instruction (3D command) in the command buffer. By using this feature you can make the GPU reference a series of command buffers without incurring any command request processing by the CPU.
The following functions have been prepared for use of this command buffer jump feature.
-
nngxAddJumpCommand
Use this function to make a jump from the current command buffer to another command buffer.
-
nngxAddSubRoutineCommand
Use this function to call another command buffer from the current command buffer as a subroutine.
Compared to copying and adding command caches, using command buffer jumps can significantly reduce the processing load placed on the CPU that normally results from the copying of command buffers. It can also significantly reduce the number of command requests compared to the number added when referencing command caches. The result is both a lighter processing load on the system core and less waiting by the GPU for the command request processing of the CPU.
The reloading of 3D commands involves a slightly higher cost to the GPU compared to simply executing a series of command buffers. But because processing speed can slow due to excessive command buffer jumping, we recommend using the feature for only limited groups of 3D commands.
For more information about the command execution registers involved in command buffer jumping, see the CTR Programming Manual: Advanced Graphics.
Related Functions:
nngxAddJumpCommand, nngxAddSubRoutineCommand
3.2.5.1. Subroutine Placement
If you plan to call a command buffer as a subroutine, placing the subroutine in VRAM may improve processing speed when the GPU accesses the command buffer, which leads to an overall reduction in GPU processing time.
When commands that could potentially cause frequent access to the command buffer are placed in VRAM, the bottleneck shifts from the command buffer to actual command processing. Accordingly, if access to the command buffer is not a bottleneck (this depends on the implementation of the application), little improvement can be expected from using this technique.
In the GPU rendering pipeline, commands to modules that are upstream from rasterization in the pipeline are processed using one cycle per command. When many commands to those upstream modules are made, command buffer access is typically more of a bottleneck than command processing. (See the "3D Command Execution Cost" section in the CTR Programming Manual: Advanced Graphics.)
Accordingly, placing commands that load shaders in VRAM could significantly reduce GPU processing time.
Conversely, commands to the rasterization module or modules that are downstream from rasterization in the pipeline are processed using two cycles per command. When there are many commands to those downstream modules, command processing is typically more of a bottleneck than command buffer access.
The size of the command buffer also affects the bottlenecking tendencies described above. The larger the command buffer, the more likely command buffer access is the bottleneck because memory must be accessed more frequently. Command buffer size is affected by whether you use single commands (header + one piece of data) or burst commands (header + multiple pieces of data), even if the number of commands is the same in both cases.
Also, if a command buffer being called as a subroutine is placed in device memory, the cost of switching when executing that subroutine is added directly to the GPU processing time.
If you plan to call a command buffer as a subroutine, placing the subroutine in VRAM may improve processing speed when the GPU accesses the command buffer, which leads to an overall reduction in GPU processing time.
When commands that could potentially cause frequent access to the command buffer are placed in VRAM, the bottleneck shifts from the command buffer to actual command processing. Accordingly, if access to the command buffer is not a bottleneck (this depends on the implementation of the application), little improvement can be expected from using this technique.
In the GPU rendering pipeline, commands to modules that are upstream from rasterization in the pipeline are processed using one cycle per command. When many commands to those upstream modules are made, command buffer access is typically more of a bottleneck than command processing. (See the "3D Command Execution Cost" section in the CTR Programming Manual: Advanced Graphics.)
Accordingly, placing commands that load shaders in VRAM could significantly reduce GPU processing time.
Conversely, commands to the rasterization module or modules that are downstream from rasterization in the pipeline are processed using two cycles per command. When there are many commands to those downstream modules, command processing is typically more of a bottleneck than command buffer access.
The size of the command buffer also affects the bottlenecking tendencies described above. The larger the command buffer, the more likely command buffer access is the bottleneck because memory must be accessed more frequently. Command buffer size is affected by whether you use single commands (header + one piece of data) or burst commands (header + multiple pieces of data), even if the number of commands is the same in both cases.
Also, if a command buffer being called as a subroutine is placed in device memory, the cost of switching when executing that subroutine is added directly to the GPU processing time.
To render a scene stereoscopically, you must render it twice with only the camera position changed.
By saving the commands from the first rendering pass in a command cache (or by some other means) and then reusing all of them—except for camera-related settings—in the second rendering pass, you can dramatically reduce the number of commands that must be issued for rendering.
Fragment lights are set in camera coordinates. To represent lighting precisely, you must convert fragment lights into each camera’s coordinate system, which means you must also change commands related to fragment lights.
3.3. Vertex Shader Programs and Lookup Tables
3.3.1. Vertex Shader Programs
Because vertex shader processing is run on each vertex, we recommend that you reduce the number of vertex lights, branch instructions, and so on as much as possible. Care is required because the generation of 3D commands that switch (load) vertex shaders places a load on the CPU. At runtime, shader switching also places a load on the GPU.
To reduce the number of times a vertex shader is loaded, take the following steps.
- Manage the rendering order to minimize the number of times that vertex shaders are switched.
- Pack multiple shader programs into a single binary. This way, you can switch shaders by simply changing the shader entry point without loading a binary.
3.3.2. Reducing Lookup Table (LUT) Loads
When you configure lookup tables, the CPU issues the lookup tables themselves as commands to the GPU. As a result, there is a CPU cost associated with switching lookup tables. Because there are registers for each lookup table type (such as D0, D1, and SP), we recommend that you avoid switching the lookup table whenever possible.
When two materials are each stored in separate LUTs, for example, you could set one LUT in D0 and one in D1, and then turn them on and off for the corresponding materials.
3.3.3. Changes in Loading Speed Due to 3D Command Buffer Address and Size
The address and size of the 3D command buffer may affect loading speed during execution.
To execute the 3D command buffer, use one of the following two methods.
- Execution using 3D executable commands
Loading speed is affected by the size of data as measured between the addresses of consecutive delimiter commands added using nngxFlush3DCommand
or nngxSplitDrawCmdlist
.
- Execution using a register that executes a command buffer (command buffer jump)
Loading speed is affected by the address and size of the command buffer added using nngxAddJumpCommand
or by a command buffer executed as a subroutine or executed to return to the calling processes from a subroutine added using nngxAddSubroutineCommand
.
More efficient transfer rates may result when using the following buffer addresses and sizes.
- The address has 128-byte alignment and the size of data is a multiple of 256 bytes (256 bytes, 512 bytes, 768 bytes, and so on.)
- The address is not 128-byte aligned, but the size as measured from the immediately previous address with 128-byte alignment to the end of the 3D command buffer is a multiple of 256 bytes.
For example, if the address and size of the 3D command buffer are 0x20000010 and 0x1F0, respectively, the immediately previous address with 128-byte alignment is 0x20000000 just 0x10 before. The data size from there to the end of the buffer is 0x1F0 + 0x10, or 0x200, which is a multiple of a 256.
Although the address and size of the 3D command buffer can influence loading speed as described above due to the way the GPU is implemented, you may not see large benefits due to factors such as the location of the buffer, the content of 3D commands, and memory access conflicts with other modules.
3.4. Vertex Buffers and Texture Buffers
3.4.1. Placement of Vertex Buffers and Texture Buffers (Important)
Vertex buffers, vertex index buffers, and texture buffers can be placed in main memory or VRAM. Because it is more convenient for the GPU to access VRAM than main memory, you can place buffers in VRAM to improve performance.
With the default settings, the GPU is given priority when there is a main memory access conflict between the CPU and the GPU. If the CPU is executing a process that heavily depends on memory access, this can cause a precipitous drop in performance. To reduce the processing load on the CPU, place data that is directly accessed by the GPU into VRAM instead of main memory.
Note:
Use nn::gx::SetMemAccessPrioMode
to configure main memory access priorities for the CPU and the GPU.
For information about the structure of the CPU, the GPU, and the various memory devices, see Section 8.3. Hardware Configuration.
If vertex attributes and textures (multi-textures) to be used simultaneously are placed in both VRAM and main memory, main memory becomes a bottleneck on access speed. All buffers that are used simultaneously must therefore be placed in VRAM.
Placing multi-textures in different VRAMs may lower performance. We recommend placing textures that are used simultaneously in the same VRAM.
3.4.2. Effective Use of the Vertex Cache (Important)
The 3DS system has a vertex cache that can temporarily cache processed vertex data using its vertex index as a key, and then use that cache data later for the same input vertex.
The vertex cache is only effective in rendering primitives when glDrawElements
(or an equivalent 3D command) uses a vertex buffer for rendering. Up to 32 entries can be cached in the vertex cache. If more than 32 entries are input, the existing entries are overwritten, starting with the least-recently accessed entries. The cache is flushed each time glDrawElements
is called.
3.4.3. Improving Efficiency of Vertex Indices
When specifying vertex indices, rendering efficiency can be improved by using the vertex cache to its fullest.
In the case of similar vertex indices, the efficiency of GL_TRIANGLES
is no worse than that of GL_TRIANGLE_STRIP
. In some cases, rendering with GL_TRIANGLES
results in better efficiency because adjustments are sometimes necessary when using GL_TRIANGLE_STRIP
with degenerate polygons. Using optimized vertex indices, GL_TRIANGLES
allows for the most efficient rendering.
3.4.4. Updating Buffers
When vertex and texture data are updated, the GPU accesses data in main memory and thus the CPU internally applies the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load will increase proportionally to the number of calls to such functions.
A single command request is issued with each transfer for buffers placed in VRAM; this also increases the CPU load.
We recommend that you call vertex and buffer update functions in advance when the application is being initialized, load vertex buffer and texture data early, avoid loading data during per-frame operations whenever possible, and transfer data all at once.
3.4.5. Using Interleaved Arrays
When rendering a model, it is more efficient to access memory using vertex buffers as a single interleaved array that combines vertex attributes rather than as separate arrays for each vertex attribute. This is particularly relevant when vertex arrays are placed in main memory because then the CPU and GPU compete over memory access. You can minimize the adverse effect on CPU performance by making memory access more efficient.
Note:
Beginning with NW4C 1.1.0, an interleaved array is used for the vertex buffer.
3.4.6. Reducing Texture Sizes
The GPU has L1 and L2 texture caches. These cache regions one at a time; by using smaller textures, you can suppress the number of texture fetches. We also recommend that you use smaller textures because it reduces VRAM usage.
Note:
Each texture unit has its own L1 texture cache, which holds 256 bytes.
The L2 texture cache holds 8 KB and is shared by all texture units.
3.4.7. Using Mipmaps (Important)
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
3.4.8. Using the ETC1 Texture Format (Important)
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Note:
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
3.5. Fragment Shaders
3.5.1. Number of Fragment Lights
The number of fragment lights affects GPU fill processing. The processing load increases proportionally with respect to both the fill area size and the number of lights. Avoid using fragment lighting when possible. For example, you can use static lighting and bake colors into vertices instead of sending normal data.
Note:
The GPU specifications always enable at least one fragment light. If you do not use the fragment lighting results, the per-fragment processing load does not change but you can reduce processing for the vertex shader’s quaternion output.
3.5.2. Layer Configuration (Important)
Choose layer configurations with smaller numbers (less processing) whenever possible.
If you are only using the D0 LUT, for example, layer configuration 7 is more expensive than layer configuration 1 by two clock cycles times the number of fragment lights for each fill area.
3.5.3. Unused Texture Unit Settings (Important)
When a texture is set in a texture unit, it is fetched even if (for example) it is not used by a texture combiner. To avoid unnecessary memory access, do not configure a texture unit that is not accessed by the combiners.
Note:
Accessing memory to fetch unnecessary textures is detrimental to power consumption.
3.5.4. Interpolation Between Mipmap Levels
When you configure linear interpolation to be used between mipmap levels, the number of texture fetches increases. Because this increases the GPU load and also leads to access conflicts, choose “nearest” interpolation between mipmap levels whenever possible.
3.5.5. Using or Not Using Blend Operations
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
3.6. Framebuffers and Display Buffers
3.6.1. Avoiding the Use of glClear
Each time you use glClear
to clear the framebuffer, two command requests are added, resulting in a large CPU load.
Instead of using glClear
, you can render a background model first with depth tests disabled and depth writes enabled to eliminate the framebuffer clear operation and save CPU/GPU processing.
For example, if you blend a polygon that covers the entire screen at the end of rendering to change the rendered brightness, the settings described above allow you to skip the clear operation for the depth buffer during the next rendering pass.
Note:
Settings that combine no depth testing with writing depth data cannot be made when using DMPGL. A library such as GR must be used instead to issue these commands.
3.6.2. Render Buffer Location (Important)
You can place the color buffer and depth buffer in different locations in VRAM (VRAM-A or VRAM-B) to parallelize fill operations when the buffers are used simultaneously. This improves performance.
3.6.3. Sharing Framebuffers to Conserve VRAM (Important)
When rendering to each of the screens, you can conserve VRAM by reusing the same framebuffer first for the right eye, then for the left eye, and finally for the lower screen.
3.6.4. Display Buffer Placement
When the display buffer is placed in main memory, main memory is accessed regularly because the GPU transfers the content of the display buffer to the LCD every frame.
When the display buffer is placed in VRAM, the content of the display buffer is transferred to the LCD without accessing main memory. Although this puts pressure on VRAM, it is useful in avoiding access conflicts when the cameras or other devices access main memory often.
Note:
If competing memory access prevents data from being transferred to the LCD on time, artifacts may be displayed on the LCD. You can also use nn::gx::SetMemAccessPrioMode
to configure access priorities for main memory.
3.3.1. Vertex Shader Programs
Because vertex shader processing is run on each vertex, we recommend that you reduce the number of vertex lights, branch instructions, and so on as much as possible. Care is required because the generation of 3D commands that switch (load) vertex shaders places a load on the CPU. At runtime, shader switching also places a load on the GPU.
To reduce the number of times a vertex shader is loaded, take the following steps.
- Manage the rendering order to minimize the number of times that vertex shaders are switched.
- Pack multiple shader programs into a single binary. This way, you can switch shaders by simply changing the shader entry point without loading a binary.
3.3.2. Reducing Lookup Table (LUT) Loads
When you configure lookup tables, the CPU issues the lookup tables themselves as commands to the GPU. As a result, there is a CPU cost associated with switching lookup tables. Because there are registers for each lookup table type (such as D0, D1, and SP), we recommend that you avoid switching the lookup table whenever possible.
When two materials are each stored in separate LUTs, for example, you could set one LUT in D0 and one in D1, and then turn them on and off for the corresponding materials.
3.3.3. Changes in Loading Speed Due to 3D Command Buffer Address and Size
The address and size of the 3D command buffer may affect loading speed during execution.
To execute the 3D command buffer, use one of the following two methods.
- Execution using 3D executable commands
Loading speed is affected by the size of data as measured between the addresses of consecutive delimiter commands added using nngxFlush3DCommand
or nngxSplitDrawCmdlist
.
- Execution using a register that executes a command buffer (command buffer jump)
Loading speed is affected by the address and size of the command buffer added using nngxAddJumpCommand
or by a command buffer executed as a subroutine or executed to return to the calling processes from a subroutine added using nngxAddSubroutineCommand
.
More efficient transfer rates may result when using the following buffer addresses and sizes.
- The address has 128-byte alignment and the size of data is a multiple of 256 bytes (256 bytes, 512 bytes, 768 bytes, and so on.)
- The address is not 128-byte aligned, but the size as measured from the immediately previous address with 128-byte alignment to the end of the 3D command buffer is a multiple of 256 bytes.
For example, if the address and size of the 3D command buffer are 0x20000010 and 0x1F0, respectively, the immediately previous address with 128-byte alignment is 0x20000000 just 0x10 before. The data size from there to the end of the buffer is 0x1F0 + 0x10, or 0x200, which is a multiple of a 256.
Although the address and size of the 3D command buffer can influence loading speed as described above due to the way the GPU is implemented, you may not see large benefits due to factors such as the location of the buffer, the content of 3D commands, and memory access conflicts with other modules.
Because vertex shader processing is run on each vertex, we recommend that you reduce the number of vertex lights, branch instructions, and so on as much as possible. Care is required because the generation of 3D commands that switch (load) vertex shaders places a load on the CPU. At runtime, shader switching also places a load on the GPU.
To reduce the number of times a vertex shader is loaded, take the following steps.
- Manage the rendering order to minimize the number of times that vertex shaders are switched.
- Pack multiple shader programs into a single binary. This way, you can switch shaders by simply changing the shader entry point without loading a binary.
When you configure lookup tables, the CPU issues the lookup tables themselves as commands to the GPU. As a result, there is a CPU cost associated with switching lookup tables. Because there are registers for each lookup table type (such as D0, D1, and SP), we recommend that you avoid switching the lookup table whenever possible.
When two materials are each stored in separate LUTs, for example, you could set one LUT in D0 and one in D1, and then turn them on and off for the corresponding materials.
3.3.3. Changes in Loading Speed Due to 3D Command Buffer Address and Size
The address and size of the 3D command buffer may affect loading speed during execution.
To execute the 3D command buffer, use one of the following two methods.
- Execution using 3D executable commands
Loading speed is affected by the size of data as measured between the addresses of consecutive delimiter commands added using nngxFlush3DCommand
or nngxSplitDrawCmdlist
.
- Execution using a register that executes a command buffer (command buffer jump)
Loading speed is affected by the address and size of the command buffer added using nngxAddJumpCommand
or by a command buffer executed as a subroutine or executed to return to the calling processes from a subroutine added using nngxAddSubroutineCommand
.
More efficient transfer rates may result when using the following buffer addresses and sizes.
- The address has 128-byte alignment and the size of data is a multiple of 256 bytes (256 bytes, 512 bytes, 768 bytes, and so on.)
- The address is not 128-byte aligned, but the size as measured from the immediately previous address with 128-byte alignment to the end of the 3D command buffer is a multiple of 256 bytes.
For example, if the address and size of the 3D command buffer are 0x20000010 and 0x1F0, respectively, the immediately previous address with 128-byte alignment is 0x20000000 just 0x10 before. The data size from there to the end of the buffer is 0x1F0 + 0x10, or 0x200, which is a multiple of a 256.
Although the address and size of the 3D command buffer can influence loading speed as described above due to the way the GPU is implemented, you may not see large benefits due to factors such as the location of the buffer, the content of 3D commands, and memory access conflicts with other modules.
The address and size of the 3D command buffer may affect loading speed during execution.
To execute the 3D command buffer, use one of the following two methods.
- Execution using 3D executable commands
Loading speed is affected by the size of data as measured between the addresses of consecutive delimiter commands added usingnngxFlush3DCommand
ornngxSplitDrawCmdlist
.
- Execution using a register that executes a command buffer (command buffer jump)
Loading speed is affected by the address and size of the command buffer added usingnngxAddJumpCommand
or by a command buffer executed as a subroutine or executed to return to the calling processes from a subroutine added usingnngxAddSubroutineCommand
.
More efficient transfer rates may result when using the following buffer addresses and sizes.
- The address has 128-byte alignment and the size of data is a multiple of 256 bytes (256 bytes, 512 bytes, 768 bytes, and so on.)
- The address is not 128-byte aligned, but the size as measured from the immediately previous address with 128-byte alignment to the end of the 3D command buffer is a multiple of 256 bytes.
For example, if the address and size of the 3D command buffer are 0x20000010 and 0x1F0, respectively, the immediately previous address with 128-byte alignment is 0x20000000 just 0x10 before. The data size from there to the end of the buffer is 0x1F0 + 0x10, or 0x200, which is a multiple of a 256.
Although the address and size of the 3D command buffer can influence loading speed as described above due to the way the GPU is implemented, you may not see large benefits due to factors such as the location of the buffer, the content of 3D commands, and memory access conflicts with other modules.
3.4.1. Placement of Vertex Buffers and Texture Buffers (Important)
Vertex buffers, vertex index buffers, and texture buffers can be placed in main memory or VRAM. Because it is more convenient for the GPU to access VRAM than main memory, you can place buffers in VRAM to improve performance.
With the default settings, the GPU is given priority when there is a main memory access conflict between the CPU and the GPU. If the CPU is executing a process that heavily depends on memory access, this can cause a precipitous drop in performance. To reduce the processing load on the CPU, place data that is directly accessed by the GPU into VRAM instead of main memory.
Note:
Use nn::gx::SetMemAccessPrioMode
to configure main memory access priorities for the CPU and the GPU.
For information about the structure of the CPU, the GPU, and the various memory devices, see Section 8.3. Hardware Configuration.
If vertex attributes and textures (multi-textures) to be used simultaneously are placed in both VRAM and main memory, main memory becomes a bottleneck on access speed. All buffers that are used simultaneously must therefore be placed in VRAM.
Placing multi-textures in different VRAMs may lower performance. We recommend placing textures that are used simultaneously in the same VRAM.
3.4.2. Effective Use of the Vertex Cache (Important)
The 3DS system has a vertex cache that can temporarily cache processed vertex data using its vertex index as a key, and then use that cache data later for the same input vertex.
The vertex cache is only effective in rendering primitives when glDrawElements
(or an equivalent 3D command) uses a vertex buffer for rendering. Up to 32 entries can be cached in the vertex cache. If more than 32 entries are input, the existing entries are overwritten, starting with the least-recently accessed entries. The cache is flushed each time glDrawElements
is called.
3.4.3. Improving Efficiency of Vertex Indices
When specifying vertex indices, rendering efficiency can be improved by using the vertex cache to its fullest.
In the case of similar vertex indices, the efficiency of GL_TRIANGLES
is no worse than that of GL_TRIANGLE_STRIP
. In some cases, rendering with GL_TRIANGLES
results in better efficiency because adjustments are sometimes necessary when using GL_TRIANGLE_STRIP
with degenerate polygons. Using optimized vertex indices, GL_TRIANGLES
allows for the most efficient rendering.
3.4.4. Updating Buffers
When vertex and texture data are updated, the GPU accesses data in main memory and thus the CPU internally applies the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load will increase proportionally to the number of calls to such functions.
A single command request is issued with each transfer for buffers placed in VRAM; this also increases the CPU load.
We recommend that you call vertex and buffer update functions in advance when the application is being initialized, load vertex buffer and texture data early, avoid loading data during per-frame operations whenever possible, and transfer data all at once.
3.4.5. Using Interleaved Arrays
When rendering a model, it is more efficient to access memory using vertex buffers as a single interleaved array that combines vertex attributes rather than as separate arrays for each vertex attribute. This is particularly relevant when vertex arrays are placed in main memory because then the CPU and GPU compete over memory access. You can minimize the adverse effect on CPU performance by making memory access more efficient.
Note:
Beginning with NW4C 1.1.0, an interleaved array is used for the vertex buffer.
3.4.6. Reducing Texture Sizes
The GPU has L1 and L2 texture caches. These cache regions one at a time; by using smaller textures, you can suppress the number of texture fetches. We also recommend that you use smaller textures because it reduces VRAM usage.
Note:
Each texture unit has its own L1 texture cache, which holds 256 bytes.
The L2 texture cache holds 8 KB and is shared by all texture units.
3.4.7. Using Mipmaps (Important)
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
3.4.8. Using the ETC1 Texture Format (Important)
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Note:
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
Vertex buffers, vertex index buffers, and texture buffers can be placed in main memory or VRAM. Because it is more convenient for the GPU to access VRAM than main memory, you can place buffers in VRAM to improve performance.
With the default settings, the GPU is given priority when there is a main memory access conflict between the CPU and the GPU. If the CPU is executing a process that heavily depends on memory access, this can cause a precipitous drop in performance. To reduce the processing load on the CPU, place data that is directly accessed by the GPU into VRAM instead of main memory.
Use nn::gx::SetMemAccessPrioMode
to configure main memory access priorities for the CPU and the GPU.
For information about the structure of the CPU, the GPU, and the various memory devices, see Section 8.3. Hardware Configuration.
If vertex attributes and textures (multi-textures) to be used simultaneously are placed in both VRAM and main memory, main memory becomes a bottleneck on access speed. All buffers that are used simultaneously must therefore be placed in VRAM.
Placing multi-textures in different VRAMs may lower performance. We recommend placing textures that are used simultaneously in the same VRAM.
The 3DS system has a vertex cache that can temporarily cache processed vertex data using its vertex index as a key, and then use that cache data later for the same input vertex.
The vertex cache is only effective in rendering primitives when glDrawElements
(or an equivalent 3D command) uses a vertex buffer for rendering. Up to 32 entries can be cached in the vertex cache. If more than 32 entries are input, the existing entries are overwritten, starting with the least-recently accessed entries. The cache is flushed each time glDrawElements
is called.
3.4.3. Improving Efficiency of Vertex Indices
When specifying vertex indices, rendering efficiency can be improved by using the vertex cache to its fullest.
In the case of similar vertex indices, the efficiency of GL_TRIANGLES
is no worse than that of GL_TRIANGLE_STRIP
. In some cases, rendering with GL_TRIANGLES
results in better efficiency because adjustments are sometimes necessary when using GL_TRIANGLE_STRIP
with degenerate polygons. Using optimized vertex indices, GL_TRIANGLES
allows for the most efficient rendering.
3.4.4. Updating Buffers
When vertex and texture data are updated, the GPU accesses data in main memory and thus the CPU internally applies the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load will increase proportionally to the number of calls to such functions.
A single command request is issued with each transfer for buffers placed in VRAM; this also increases the CPU load.
We recommend that you call vertex and buffer update functions in advance when the application is being initialized, load vertex buffer and texture data early, avoid loading data during per-frame operations whenever possible, and transfer data all at once.
3.4.5. Using Interleaved Arrays
When rendering a model, it is more efficient to access memory using vertex buffers as a single interleaved array that combines vertex attributes rather than as separate arrays for each vertex attribute. This is particularly relevant when vertex arrays are placed in main memory because then the CPU and GPU compete over memory access. You can minimize the adverse effect on CPU performance by making memory access more efficient.
Note:
Beginning with NW4C 1.1.0, an interleaved array is used for the vertex buffer.
3.4.6. Reducing Texture Sizes
The GPU has L1 and L2 texture caches. These cache regions one at a time; by using smaller textures, you can suppress the number of texture fetches. We also recommend that you use smaller textures because it reduces VRAM usage.
Note:
Each texture unit has its own L1 texture cache, which holds 256 bytes.
The L2 texture cache holds 8 KB and is shared by all texture units.
3.4.7. Using Mipmaps (Important)
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
3.4.8. Using the ETC1 Texture Format (Important)
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Note:
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
When specifying vertex indices, rendering efficiency can be improved by using the vertex cache to its fullest.
In the case of similar vertex indices, the efficiency of GL_TRIANGLES
is no worse than that of GL_TRIANGLE_STRIP
. In some cases, rendering with GL_TRIANGLES
results in better efficiency because adjustments are sometimes necessary when using GL_TRIANGLE_STRIP
with degenerate polygons. Using optimized vertex indices, GL_TRIANGLES
allows for the most efficient rendering.
When vertex and texture data are updated, the GPU accesses data in main memory and thus the CPU internally applies the changes to the GPU. Because this process puts a relatively large overhead on the system, the overall processor load will increase proportionally to the number of calls to such functions.
A single command request is issued with each transfer for buffers placed in VRAM; this also increases the CPU load.
We recommend that you call vertex and buffer update functions in advance when the application is being initialized, load vertex buffer and texture data early, avoid loading data during per-frame operations whenever possible, and transfer data all at once.
3.4.5. Using Interleaved Arrays
When rendering a model, it is more efficient to access memory using vertex buffers as a single interleaved array that combines vertex attributes rather than as separate arrays for each vertex attribute. This is particularly relevant when vertex arrays are placed in main memory because then the CPU and GPU compete over memory access. You can minimize the adverse effect on CPU performance by making memory access more efficient.
Note:
Beginning with NW4C 1.1.0, an interleaved array is used for the vertex buffer.
3.4.6. Reducing Texture Sizes
The GPU has L1 and L2 texture caches. These cache regions one at a time; by using smaller textures, you can suppress the number of texture fetches. We also recommend that you use smaller textures because it reduces VRAM usage.
Note:
Each texture unit has its own L1 texture cache, which holds 256 bytes.
The L2 texture cache holds 8 KB and is shared by all texture units.
3.4.7. Using Mipmaps (Important)
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
3.4.8. Using the ETC1 Texture Format (Important)
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Note:
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
When rendering a model, it is more efficient to access memory using vertex buffers as a single interleaved array that combines vertex attributes rather than as separate arrays for each vertex attribute. This is particularly relevant when vertex arrays are placed in main memory because then the CPU and GPU compete over memory access. You can minimize the adverse effect on CPU performance by making memory access more efficient.
Beginning with NW4C 1.1.0, an interleaved array is used for the vertex buffer.
The GPU has L1 and L2 texture caches. These cache regions one at a time; by using smaller textures, you can suppress the number of texture fetches. We also recommend that you use smaller textures because it reduces VRAM usage.
Each texture unit has its own L1 texture cache, which holds 256 bytes.
The L2 texture cache holds 8 KB and is shared by all texture units.
3.4.7. Using Mipmaps (Important)
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
3.4.8. Using the ETC1 Texture Format (Important)
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Note:
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
When you shrink and display a texture, there are many texture cache misses and GPU performance drops. If mipmaps are used, however, appropriately sized textures are accessed to shrink textures. Use mipmaps for any textures that you expect to shrink onscreen.
Conversely, you can conserve VRAM by not using mipmaps for textures that are guaranteed to always be displayed at a fixed size.
The ETC1 texture format reduces the per-texel data size by 4 bits. In addition to decreasing VRAM usage, a reduction in texture size has the advantage of economizing the memory ranges that are accessed. A texture cache that only uses the ETC1 format is treated as being compressed and is advantageous in terms of cache efficiency.
We recommend using the ETC1 format whenever possible.
Texture formats other than ETC1—including ETC1A4—are expanded to the (32-bit) RGBA8 format in the cache.
3.4.9. Texture Image Layout Orientation (Important)
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
3.4.10. Differences Between nngxUpdateBuffer and nngxUpdateBufferLight (Important)
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Note:
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
You can achieve better texture cache hit rates by aligning the orientation of the texture image used to the order in which fragments are generated in the framebuffer (Layout Pattern 2 in Figure 2-2) instead of aligning to the LCD display orientation (Layout Pattern 1 in Figure 2-2). This can improve display performance, particularly for 2D backgrounds and characters. The normal LCD layout orientation is shown in Figure 2-1, below.
The content of the framebuffer copied to the display buffer is what is actually shown in the LCD, but the important point to note here is the relationship between the fragment generation order and the texture image pixel order. The system rasterizes images in 8×8 pixel blocks (the blue box in Figure 2-1), but it fetches textures in 8×4 texel blocks. The green box in Figure 2-1 shows the orientation of the texture block fetched in Layout Pattern 1, and the red box shows the orientation for Layout Pattern 2.
Figure 3-2 shows the two texture image layout patterns. Because texture formats are handled in 8×8 pixel blocks, data ordering in memory is either a1 → a2 → b1 → b2 → for Layout Pattern 1, or C1 → C2 → A1 → A2 → for Layout Pattern 2. In other words, when rasterizing the block shown in Figure 3-1, the blocks fetched in Layout Pattern 1 would be c2 → c1 → c2 → c1, whereas in Layout Pattern 2 the blocks fetched would just be C1 → C2. Pattern 2 requires fewer block switches, and the next block to rasterize is the very next texture block in memory.
The yellow lines indicate the switching order for block fetching when two blocks are rasterized.
When the CPU changes vertex or texture data that resides on device memory directly accessed by the GPU, the application must guarantee the consistency of this data. In other words, if the CPU manipulates (decompresses, copies, or changes) this data, it must apply those changes to the GPU side.
The nngxUpdateBuffer
takes longer to execute in extended applications than in standard applications. For this reason, Nintendo recommends calling the nngxUpdateBufferLight
function instead of nngxUpdateBuffer
when you are only applying data manipulated by the CPU to the GPU. However, you must call the nngxUpdateBuffer
function to access GPU rendering results with the CPU.
Because calls to nngxUpdateBuffer
entail some overhead, calling it multiple times might lead to a drop in performance. If you manipulate multiple pieces of data with the CPU, do not call the function for each such piece of data. Instead, call it once for the entire device memory region when all your CPU data operations are finished. This reduces the processing load.
Using the FS library to decompress data on device memory counts as manipulation of this data by the CPU.
Calling the nngxUpdateBuffer
function twice on a small region and calling it once on the entire device memory region have roughly equivalent costs.
3.4.10.1. The nngxAddVramDmaCommand and nngxAddVramDmaCommandNoCacheFlush Functions
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
The application must also take data consistency into account when transferring data from main memory to VRAM. In general, if the application uses DMPGL or the nngxAddVramDmaCommand
function to transfer the data, the data is cached automatically by the function.
In contrast, if the application uses the nngxAddVramDmaCommandNoCacheFlush
function to transfer data, the data is not cached by the function. When you transfer data from multiple regions, you can reduce the processing load by first using the nngxUpdateBufferLight
function so that the application ensures data consistency, and then using the nngxAddVramDmaCommandNoCacheFlush
function to transfer the data.
3.4.11. Vertex Data Loading and Vertex Shader Processing
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
3.4.12. Effect of Combinations of Vertex Attribute on the Vertex Data Transfer Rate
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
If no geometry shader is in use, loading vertex data under ideal memory access conditions could hide up to 52 cycles of vertex shader processing underneath the loading of each individual vertex of vertex data, regardless of whether the vertex data is placed in device memory or VRAM
This number of cycles is a theoretical value. The actual value can be reduced by several factors: differences in data size, the number of load arrays, changes in memory access speed due to other processes, and whether different types are combined in the vertex data attributes.
Depending on the combination of type and size of vertex attribute data making up the load array, look-ahead transfer can increase the vertex data transfer rate when using a vertex buffer.
Look-ahead transfer is used if the following conditions are satisfied.
("The number of attributes of other than GL_FLOAT
type" + "The number of attributes with a data length of 1")
<= ("The number of attributes of GL_FLOAT
type with a data length of 4" + "The number of attributes of GL_FLOAT
type with a data length of 3" / 2)
The data size of "the number of attributes of other than GL_FLOAT
type" and the data type of "the number of attributes with a data length of 1" do not matter. Vertex attributes that meet more than one condition at the same time are counted for each condition category they fulfill. For example, a vertex attribute of GL_BYTE
type with a data length of 1 is included in the count for "the number of attributes of other than GL_FLOAT
type" and "the number of attributes with a data length of 1."
If the two sides of the comparison for whether to perform look-ahead transfer are equal, the determination depends on the amount of data per load array. The less data there is, the faster the transfer rate. If the amount of vertex data is the same in both cases, the determination depends on the number of attributes in the load array. The fewer number of attributes in the load array, the faster the transfer rate.
3.4.13. Address Alignment of the Vertex Array
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
It may be possible when using a vertex buffer to increase the efficiency of vertex array transfer during rendering by using a vertex array with 32-byte address alignment. The concept of a vertex array address as used here means the address obtained by adding the offset specified by glVertexAttribPointer
(the value given by ptr
) to the vertex buffer address.
It may not always be possible to improve transfer efficiency, because how much the transfer rate increases as compared to using a vertex array with an address having other than 32-byte alignment depends on the type and size of the vertex attribute, the location the vertex array is stored, and the content of the vertex index. Even if transfer performance is improved, it may not necessarily lead to better overall system performance unless vertex array transfer is the bottleneck.
3.5. Fragment Shaders
3.5.1. Number of Fragment Lights
The number of fragment lights affects GPU fill processing. The processing load increases proportionally with respect to both the fill area size and the number of lights. Avoid using fragment lighting when possible. For example, you can use static lighting and bake colors into vertices instead of sending normal data.
Note:
The GPU specifications always enable at least one fragment light. If you do not use the fragment lighting results, the per-fragment processing load does not change but you can reduce processing for the vertex shader’s quaternion output.
3.5.2. Layer Configuration (Important)
Choose layer configurations with smaller numbers (less processing) whenever possible.
If you are only using the D0 LUT, for example, layer configuration 7 is more expensive than layer configuration 1 by two clock cycles times the number of fragment lights for each fill area.
3.5.3. Unused Texture Unit Settings (Important)
When a texture is set in a texture unit, it is fetched even if (for example) it is not used by a texture combiner. To avoid unnecessary memory access, do not configure a texture unit that is not accessed by the combiners.
Note:
Accessing memory to fetch unnecessary textures is detrimental to power consumption.
3.5.4. Interpolation Between Mipmap Levels
When you configure linear interpolation to be used between mipmap levels, the number of texture fetches increases. Because this increases the GPU load and also leads to access conflicts, choose “nearest” interpolation between mipmap levels whenever possible.
3.5.5. Using or Not Using Blend Operations
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
3.6. Framebuffers and Display Buffers
3.6.1. Avoiding the Use of glClear
Each time you use glClear
to clear the framebuffer, two command requests are added, resulting in a large CPU load.
Instead of using glClear
, you can render a background model first with depth tests disabled and depth writes enabled to eliminate the framebuffer clear operation and save CPU/GPU processing.
For example, if you blend a polygon that covers the entire screen at the end of rendering to change the rendered brightness, the settings described above allow you to skip the clear operation for the depth buffer during the next rendering pass.
Note:
Settings that combine no depth testing with writing depth data cannot be made when using DMPGL. A library such as GR must be used instead to issue these commands.
3.6.2. Render Buffer Location (Important)
You can place the color buffer and depth buffer in different locations in VRAM (VRAM-A or VRAM-B) to parallelize fill operations when the buffers are used simultaneously. This improves performance.
3.6.3. Sharing Framebuffers to Conserve VRAM (Important)
When rendering to each of the screens, you can conserve VRAM by reusing the same framebuffer first for the right eye, then for the left eye, and finally for the lower screen.
3.6.4. Display Buffer Placement
When the display buffer is placed in main memory, main memory is accessed regularly because the GPU transfers the content of the display buffer to the LCD every frame.
When the display buffer is placed in VRAM, the content of the display buffer is transferred to the LCD without accessing main memory. Although this puts pressure on VRAM, it is useful in avoiding access conflicts when the cameras or other devices access main memory often.
Note:
If competing memory access prevents data from being transferred to the LCD on time, artifacts may be displayed on the LCD. You can also use nn::gx::SetMemAccessPrioMode
to configure access priorities for main memory.
3.5.1. Number of Fragment Lights
The number of fragment lights affects GPU fill processing. The processing load increases proportionally with respect to both the fill area size and the number of lights. Avoid using fragment lighting when possible. For example, you can use static lighting and bake colors into vertices instead of sending normal data.
Note:
The GPU specifications always enable at least one fragment light. If you do not use the fragment lighting results, the per-fragment processing load does not change but you can reduce processing for the vertex shader’s quaternion output.
3.5.2. Layer Configuration (Important)
Choose layer configurations with smaller numbers (less processing) whenever possible.
If you are only using the D0 LUT, for example, layer configuration 7 is more expensive than layer configuration 1 by two clock cycles times the number of fragment lights for each fill area.
3.5.3. Unused Texture Unit Settings (Important)
When a texture is set in a texture unit, it is fetched even if (for example) it is not used by a texture combiner. To avoid unnecessary memory access, do not configure a texture unit that is not accessed by the combiners.
Note:
Accessing memory to fetch unnecessary textures is detrimental to power consumption.
3.5.4. Interpolation Between Mipmap Levels
When you configure linear interpolation to be used between mipmap levels, the number of texture fetches increases. Because this increases the GPU load and also leads to access conflicts, choose “nearest” interpolation between mipmap levels whenever possible.
3.5.5. Using or Not Using Blend Operations
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
The number of fragment lights affects GPU fill processing. The processing load increases proportionally with respect to both the fill area size and the number of lights. Avoid using fragment lighting when possible. For example, you can use static lighting and bake colors into vertices instead of sending normal data.
The GPU specifications always enable at least one fragment light. If you do not use the fragment lighting results, the per-fragment processing load does not change but you can reduce processing for the vertex shader’s quaternion output.
Choose layer configurations with smaller numbers (less processing) whenever possible.
If you are only using the D0 LUT, for example, layer configuration 7 is more expensive than layer configuration 1 by two clock cycles times the number of fragment lights for each fill area.
3.5.3. Unused Texture Unit Settings (Important)
When a texture is set in a texture unit, it is fetched even if (for example) it is not used by a texture combiner. To avoid unnecessary memory access, do not configure a texture unit that is not accessed by the combiners.
Note:
Accessing memory to fetch unnecessary textures is detrimental to power consumption.
3.5.4. Interpolation Between Mipmap Levels
When you configure linear interpolation to be used between mipmap levels, the number of texture fetches increases. Because this increases the GPU load and also leads to access conflicts, choose “nearest” interpolation between mipmap levels whenever possible.
3.5.5. Using or Not Using Blend Operations
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
When a texture is set in a texture unit, it is fetched even if (for example) it is not used by a texture combiner. To avoid unnecessary memory access, do not configure a texture unit that is not accessed by the combiners.
Accessing memory to fetch unnecessary textures is detrimental to power consumption.
When you configure linear interpolation to be used between mipmap levels, the number of texture fetches increases. Because this increases the GPU load and also leads to access conflicts, choose “nearest” interpolation between mipmap levels whenever possible.
3.5.5. Using or Not Using Blend Operations
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
Without blending, the framebuffer is only accessed for lights; with blending, read accesses also occur. Setting blend operations when they are not needed entails unnecessary memory accesses. Configure blend operations appropriately.
3.6.1. Avoiding the Use of glClear
Each time you use glClear
to clear the framebuffer, two command requests are added, resulting in a large CPU load.
Instead of using glClear
, you can render a background model first with depth tests disabled and depth writes enabled to eliminate the framebuffer clear operation and save CPU/GPU processing.
For example, if you blend a polygon that covers the entire screen at the end of rendering to change the rendered brightness, the settings described above allow you to skip the clear operation for the depth buffer during the next rendering pass.
Note:
Settings that combine no depth testing with writing depth data cannot be made when using DMPGL. A library such as GR must be used instead to issue these commands.
3.6.2. Render Buffer Location (Important)
You can place the color buffer and depth buffer in different locations in VRAM (VRAM-A or VRAM-B) to parallelize fill operations when the buffers are used simultaneously. This improves performance.
3.6.3. Sharing Framebuffers to Conserve VRAM (Important)
When rendering to each of the screens, you can conserve VRAM by reusing the same framebuffer first for the right eye, then for the left eye, and finally for the lower screen.
3.6.4. Display Buffer Placement
When the display buffer is placed in main memory, main memory is accessed regularly because the GPU transfers the content of the display buffer to the LCD every frame.
When the display buffer is placed in VRAM, the content of the display buffer is transferred to the LCD without accessing main memory. Although this puts pressure on VRAM, it is useful in avoiding access conflicts when the cameras or other devices access main memory often.
Note:
If competing memory access prevents data from being transferred to the LCD on time, artifacts may be displayed on the LCD. You can also use nn::gx::SetMemAccessPrioMode
to configure access priorities for main memory.
Each time you use glClear
to clear the framebuffer, two command requests are added, resulting in a large CPU load.
Instead of using glClear
, you can render a background model first with depth tests disabled and depth writes enabled to eliminate the framebuffer clear operation and save CPU/GPU processing.
For example, if you blend a polygon that covers the entire screen at the end of rendering to change the rendered brightness, the settings described above allow you to skip the clear operation for the depth buffer during the next rendering pass.
Settings that combine no depth testing with writing depth data cannot be made when using DMPGL. A library such as GR must be used instead to issue these commands.
You can place the color buffer and depth buffer in different locations in VRAM (VRAM-A or VRAM-B) to parallelize fill operations when the buffers are used simultaneously. This improves performance.
3.6.3. Sharing Framebuffers to Conserve VRAM (Important)
When rendering to each of the screens, you can conserve VRAM by reusing the same framebuffer first for the right eye, then for the left eye, and finally for the lower screen.
3.6.4. Display Buffer Placement
When the display buffer is placed in main memory, main memory is accessed regularly because the GPU transfers the content of the display buffer to the LCD every frame.
When the display buffer is placed in VRAM, the content of the display buffer is transferred to the LCD without accessing main memory. Although this puts pressure on VRAM, it is useful in avoiding access conflicts when the cameras or other devices access main memory often.
Note:
If competing memory access prevents data from being transferred to the LCD on time, artifacts may be displayed on the LCD. You can also use nn::gx::SetMemAccessPrioMode
to configure access priorities for main memory.
When rendering to each of the screens, you can conserve VRAM by reusing the same framebuffer first for the right eye, then for the left eye, and finally for the lower screen.
When the display buffer is placed in main memory, main memory is accessed regularly because the GPU transfers the content of the display buffer to the LCD every frame.
When the display buffer is placed in VRAM, the content of the display buffer is transferred to the LCD without accessing main memory. Although this puts pressure on VRAM, it is useful in avoiding access conflicts when the cameras or other devices access main memory often.
If competing memory access prevents data from being transferred to the LCD on time, artifacts may be displayed on the LCD. You can also use nn::gx::SetMemAccessPrioMode
to configure access priorities for main memory.