15.1. glFinish and glFlush Functions
The glFinish()
function is the same as the glFlush()
function on a 3DS system.
void glFinish(void); void glFlush(void);
15.2. Differences From OpenGL ES
Most features and functions follow the OpenGL ES specifications, but there are some differences. For example, some OpenGL ES functions and features have not been implemented, and others have restrictions.
Features |
Differences |
---|---|
|
Reserved uniforms are used to control the alpha test. |
|
Reserved uniforms are used to control clipping. |
|
Logical operations are not present in OpenGL ES 2.0, but the equivalent features from OpenGL ES 1.1 have been implemented for 3DS. |
|
Dithering is not supported. |
|
Reserved uniforms are used to control fog. Fog coefficients are set by a lookup table that takes depth values in window coordinates as input, rather than by the distance from the viewpoint. |
|
Index colors are not supported. |
|
Reserved uniforms are used to control lighting. 3DS lighting is a per-fragment operation. |
|
Only |
|
Multisampling is not supported. |
|
The scissor test is run as a subprocess of rasterization. |
|
Reserved uniforms are used to control texture units. |
Function |
Differences |
---|---|
Shader Functions |
|
|
Only |
|
This function is not implemented. |
|
This function uses a (13-bit) namespace that is independent of shader objects. |
|
This function uses a namespace that is independent of program objects. |
glDrawArrays(), glDrawElements() |
|
|
This function is not implemented. |
|
This function is not implemented. |
|
This function is not implemented. |
glGetShaderPrecisionFormat |
This function is not implemented. |
|
This function is not implemented. |
|
This function is not implemented. |
|
Only |
|
This function is not implemented. |
|
This function does nothing when it is called. |
|
Neither |
Viewport Functions |
|
|
You cannot specify a negative value for |
Texture Functions |
|
|
An error is generated when |
|
The width and height must be specified as powers of 2 (from 16 through 1024). |
|
This function is not implemented. |
|
This function is not implemented. |
|
These functions are used to load lookup tables. One-dimensional textures are not supported. |
|
The width and height must be specified as powers of 2 (from 8 through 1024). |
|
This function is not implemented. |
Render Buffer Functions |
|
|
Index colors are not supported. The stencil buffer must be cleared at the same time as the depth buffer. This function is unaffected by the scissor test and masking. |
|
This function is not implemented. |
|
This function is not implemented. |
|
This function is not implemented. |
|
Because the stencil buffer is combined with the depth buffer, |
|
The color buffer is the only supported attachment point. |
|
This function is not implemented. |
|
This function is not implemented. |
|
There are restrictions on which formats can be specified. The only format that can be used for the stencil buffer is |
|
This function is not implemented. |
|
This function is not implemented. |
Blending Functions |
|
|
|
|
|
Other |
|
|
This function does the same thing as |
|
There is no difference until |
|
This function is not implemented. |
|
The value set for |
15.3. Improving Buffer Access Performance
When you are not using either the color, depth, or stencil buffer, you can explicitly disable that buffer's features to reduce unnecessary processing. Because the same buffer is used for the 3DS depth and stencil buffers, however, settings to access either one have the same performance as settings to access both.
The following sections describe the conditions for accessing each buffer. Avoid these conditions to prevent unnecessary accesses from occurring.
15.3.1. Write Access to the Color Buffer
These accesses occur when the glColorMask()
function specifies a value of GL_TRUE
for any component.
15.3.2. Read Access to the Color Buffer
These accesses occur when write accesses to the color buffer occur and any of the following conditions are met.
GL_BLEND
has been enabled by theglEnable()
function.- A call of the
glColorMask()
function has not specified the same values for all components. GL_COLOR_LOGIC_OP
has been enabled by theglEnable()
function.
15.3.3. Write Access to the Depth Buffer
These accesses occur when GL_DEPTH_TEST
has been enabled by the glEnable()
function and GL_TRUE
has been specified by the glDepthMask()
function.
15.3.4. Read Access to the Depth Buffer
These accesses occur when GL_DEPTH_TEST
has been enabled by the glEnable()
function.
15.3.5. Write Access to the Stencil Buffer
These accesses occur when GL_STENCIL_TEST
has been enabled by the glEnable()
function, and a nonzero masking setting is used in the glStencilMask()
function.
15.3.6. Read Access to the Stencil Buffer
These accesses occur when GL_STENCIL_TEST
has been enabled by the glEnable()
function.
15.4. Improving CPU Performance
You may be able to improve processing speed by paying attention to the following points when you implement your application.
- Link as many vertex shader objects together as possible. You can link multiple shader objects together to generate a single shader binary. It is less resource intensive to switch between shader objects when they are linked to the same shader binary than when they are linked to different shader binaries.
- Ensure that applications keep the uniform location values obtained by the
glGetUniformLocation()
function, and use them repeatedly. Location values are static after theglLinkProgram()
function is called. They do not change untilglLinkProgram
is called again. - Do not call the
nngxSplitDrawCmdlist()
function unnecessarily because it generates a split command each time it is called. For example, when thenngxTransferRenderImage()
function is called, it generates a split command internally. CallingnngxSplitDrawCmdlist
immediately afterwards would generate an unnecessary command. - Use a vertex buffer, whenever possible, to send vertex attribute data to a shader. If you do not use a vertex buffer, the CPU accumulates vertex data in the 3D command buffer, and CPU processing increases significantly.
- Use a texture collection or a vertex state collection when the same texture or vertex buffer is used repeatedly in a particular rendering pass. By binding all textures and setting all vertex arrays at the same time, you can reduce the cost of function calls.
- When executing the same shader object with different uniform settings, it is less resource intensive to attach that shader object to multiple program objects (each with its own uniform settings) and then switch between the program objects than it is to switch between many uniform settings on a single program object. The reason is that uniform values are saved for each program object.
- Do not delete and regenerate lookup table objects frequently. After the lookup table data is loaded by the
glTexImage1D()
function, it is converted into an internal hardware format each time the lookup table object is used.
15.5. Using Vertex Buffers
When a vertex buffer is used, the GPU geometry pipeline loads vertex data. When a vertex buffer is not used, however, the CPU sorts the vertex arrays according to the vertex index arrays, converts all vertex data into 24-bit floating-point values, and then fills the command buffer. This process imposes a considerably high processing load on the CPU, and also decreases the efficiency with which the GPU geometry pipeline loads data. This also requires a larger command buffer. All vertex data is converted into 24-bit floating-point numbers for the x, y, z, and w components when it is loaded into the command buffer, which must be able to hold at least 12 bytes multiplied by both the number of vertices and the number of vertex attributes.
Vertex buffers can be processed more quickly when they are placed in VRAM than when they are placed in main (device) memory. Splitting them between VRAM and main memory results in the same speed as placing them in main memory.
15.5.1. Data Structure for Vertex Arrays
A vertex array can either be structured as an interleaved array, which is an array of structures that contain multiple vertex attributes, or as an independent array, which is an array of single vertex attributes.
When you use a vertex buffer, interleaved arrays are more efficient at loading vertex data than independent arrays. The time spent loading vertex data is often hidden by the time spent for processing later on, such as in the vertex shader and during rasterization. When the vertex buffer is placed in main memory, however, by making the loading of data more efficient, you can sometimes reduce the cost of data access and speed up processing.
15.6. Getting the Starting Address for Each Buffer Type
You can get the starting addresses of the data regions allocated for each texture object, vertex buffer object, and render buffer object.
The GPU can directly access all obtained addresses. You cannot get the address of the copy region that is generated by the driver.
void glGetTexParameteriv(GLenum target, GLenum pname, GLint* params); void glGetBufferParameteriv(GLenum target, GLenum pname, GLint* params); void glGetRenderbufferParameteriv(GLenum target, GLenum pname, GLint* params);
To get the texture address, call the glGetTexParameteriv()
function and specify GL_TEXTURE_DATA_ADDR_DMP
for pname
.
To get the vertex buffer address, call the glGetBufferParameteriv()
function and specify GL_BUFFER_DATA_ADDR_DMP
for pname
.
To get the render buffer address, call the glGetRenderbufferParameteriv()
function and specify GL_RENDERBUFFER_DATA_ADDR_DMP
for pname
.
These functions allow you to get various types of information, depending on the value passed in for pname
. For more information about the information that can be obtained, see the CTR-SDK API Reference.
15.7. Number of Bytes Loaded for Various Data Types
This section describes the number of bytes that the GPU loads from vertex buffers, textures, and command buffers in a single operation.
15.7.1. Vertex Buffers
The number of bytes loaded concurrently from a vertex buffer depends on the order of the vertex indices.
The glDrawElements()
function loads 16 vertex indices concurrently from an index array, sorts them, and then loads data from the vertex array in the same order as the sorted indices. Consecutive data is loaded from the vertex array for any consecutive vertex indices.
Multiple vertex attributes are loaded as an interleaved array from a vertex array if: (1) the array contains vertex attributes and has been enabled by the glEnableVertexAttribArray()
function, and (2) the driver interprets the array as an interleaved array (a vertex array that combines multiple vertex attributes) based on the information specified by the glVertexAttribPointer()
function.
Burst reads of up to 256 bytes are used to load consecutive data from a vertex array. If there are more than 256 bytes to be loaded, they are read 256 bytes at a time. Data is read at least 16 bytes at a time, even for non-consecutive data.
The glDrawArrays()
function processes data just like an index array, using consecutive numbers starting at 0.
15.7.2. Textures
The number of bytes that are loaded at one time depends on the texture format. The following table shows the number of bytes that are transferred from VRAM, avoiding the texture cache.
format |
type |
Bytes |
---|---|---|
|
|
128 |
|
64 |
|
|
64 |
|
|
|
96 |
|
64 |
|
|
|
64 |
|
32 |
|
|
|
32 |
|
16 |
|
|
|
32 |
|
16 |
|
|
|
64 |
|
- |
128 |
|
- |
32 |
15.7.3. Command Buffer
A command buffer loads 128 bytes at a time.
15.8. Block-Shaped Noise Is Rendered on Some Pixels
Data in the 3DS framebuffer is processed 4×4 pixels at a time. These blocks of pixels are called block addresses and are also used to manage the framebuffer cache. Tag information in the cache is cleared at several times, including when the glFinish
, glFlush
, or glClear()
function is called; when the framebuffer-related GPU state (NN_GX_STATE_FRAMEBUFFER
, NN_GX_STATE_FBACCESS
) is validated; and when the command list is split by the nngxSplitDrawCmdlist()
function. Cache tags are initialized with their default value of 0x3FFF
after tag information in the cache is cleared. Consequently, any pixels that you attempt to render immediately afterward at the same block address as the default cache tag value (0x3FFF
) mistakenly hit the cache. As a result, an incorrect color is applied to the pixels.
Block addresses are assigned consecutively, beginning at 0, in 16-pixel blocks from the starting address of the framebuffer (the color buffer, depth buffer, and stencil buffer). Because addresses are assigned to data that has been laid out in the GPU render format, pixel locations in a rendered image correspond to different block addresses in block 8 mode and in block 32 mode.
The problem described in this section is triggered by pixels that are assigned a block address of 0x3FFF
. The problem does not occur when the total number of framebuffer blocks is less than or equal to 0x3FFF
(in other words, when the total number of framebuffer pixels is less than or equal to 0x3FFF
×16, or 262,128 pixels). This is equivalent to a 512×512 rectangle). This problem also does not occur when there are no read accesses on the color buffer, depth buffer, or stencil buffer.
Cache tag information is also cleared when a value of 1 is written to GPU register 0x0110
.
For more information about accessing GPU registers directly and controlling the GPU state, block mode, and framebuffer read access, see the 3DS Programming Manual: Advanced Graphics.
15.8.1. Relationship Between Pixels and Block Addresses
As mentioned previously, block addresses begin at 0 and are assigned in ascending order, 16 pixels at a time, from the starting addresses of the color buffer and depth/stencil buffer, which are laid out in the GPU render format. Unlike the origin for the glViewport()
function, the buffer addresses start with the pixels at the upper-left corner of the image to render. Note, too, that the image width (the horizontal direction) corresponds to the shorter edge of the LCD.
Because there are different ways to assign addresses, the block mode changes which block address in the cache corresponds to the pixels on an image.
15.8.1.1. Block 8 Mode
Block address 0 corresponds to the 4×4 block of pixels at the upper-left corner of the rendered image; block address 1 corresponds to the 4×4 block of pixels immediately to the right of block address 0; block address 2 corresponds to the 4×4 block of pixels immediately below block addresses 0; and block address 3 corresponds to the 4×4 block of pixels immediately below block address 1. Block addresses increase to the right 8×8 pixels at a time. After they reach the edge of the image, they continue from the left edge of the image on the next row.
The value of N in the figure is calculated by taking one-quarter the width of the framebuffer (in pixels) and multiplying it by two.
15.8.1.2. Block 32 Mode
In block 32 mode, addresses are assigned in metablocks of 32×32 pixels. Metablock address 0 corresponds to the 32×32-pixel region at the upper-left corner of the rendered image, and metablock address 1 corresponds to the 32×32-pixel region immediately to its right. Metablock addresses increase to the right 32×32 pixels at a time. After they reach the edge of the image, they continue from the left edge of the image on the next row.
As the following figure shows, the block addresses of the pixels are arranged in a zigzag pattern within each metablock, starting with the 4×4-pixel block at the upper-left corner. To find the block address of a single pixel in the image, multiply its metablock address by 0x40
, and then add its block address within the metablock.
The left side of the figure shows the block addresses (in hexadecimal) for pixels within a metablock. The right side of the figure shows the metablock addresses for the entire image.
15.8.2. Workaround #1
This problem does not occur when the framebuffer has no more than 262,128 pixels (the product of its width and height). In other words, you can work around this problem by using a framebuffer that is no larger than necessary.
Note that the problem does not occur with a framebuffer that has the same size as one of the LCDs—240×400 (96,000 pixels) or 240×320 (76,800 pixels)—because the total number of pixels does not exceed 262,128.
Although you specify the framebuffer size with the glRenderbufferStorage()
function, if you allocate a large framebuffer and only use a part (240×400 region) of it, you can avoid this problem by using no more than the minimum necessary size for the allocated framebuffer region.
15.8.3. Workaround #2
You can work around this problem by adjusting the size of the framebuffer so that the problematic pixels at block address 0x3FFF are located outside of the rendering region.
For example, when you allocate a 480×800 framebuffer (as you would to apply 2×2 antialiasing in block 8 mode), block address 0x3FFF
is assigned to the 44-pixel region whose upper-left corner is located at pixel coordinates of (124, 548). If you were to extend the size of the framebuffer by 32 pixels to 512×800, however, block address 0x3FFF
would be assigned to the 4×4-pixel region whose upper-left corner is located at pixel coordinates (508, 508). By configuring the viewport to display only the 480×800 region on the left side of the framebuffer, you can avoid these problematic pixels.
One disadvantage of this method is that it requires you to allocate a framebuffer that is larger than necessary, which wastes VRAM. However, it is a simple workaround that only involves adjusting the framebuffer size.
For more information about how differences in the block mode and framebuffer size affect the pixel coordinates to which block address 0x3FFF
is assigned, see 15.8.1. Relationship Between Pixels and Block Addresses.
15.8.4. Workaround #3
You can work around this problem by rendering several pixels that are not at block address 0x3FFF
to change the content of cache tags immediately after they have been cleared.
To change the content of the cache tags, you must render four pixels at specific block addresses. When both the color buffer and depth/stencil buffer have been configured to be read, these four pixels must each have a different block address, for which the lower three bits are all 1 (0x7
). When only one buffer (either the color buffer or the depth/stencil buffer) has been configured to be read, these four pixels must each have a different block address, for which the lower four bits are all 1 (0xF
).
For example, assume that pixels are rendered at the following block addresses immediately after cache tags are cleared: 0x00
, 0x01
, 0x0F
, 0x02
, 0x1F
, 0x03
, 0x0F
, 0x2F
, and 0x3F
.
Block addresses 0x00
and 0x01
do not count because their lower four bits are not 0xF
.
Block address 0x0F
is only counted once, even though pixels are rendered there twice. In this example, the workaround is only effective after pixels have been rendered at block addresses 0x0F
, 0x1F
, 0x2F
, and 0x3F
. If pixels at block address 0x3FFF
are rendered before the pixels at block address 0x3F
, the problem would occur.
You can work around this problem by rendering dummy polygons, with pixels that meet these conditions, immediately after cache tags are cleared, given the following caveats. The following are valid dummy pixels.
- Pixels that fail the depth test, stencil test, or alpha test. If you use settings that cause dummy pixels to always fail these tests (for example, by specifying
GL_NEVER
for the depth test function), make sure that you restore the original depth test function when you resume normal rendering. Note that the cache flush command (a command that writes to register0x111
) would be required at this time. - Pixels that do not affect the color buffer when they are rendered because of alpha blend settings.
- The following are not valid dummy pixels. Pixels that are clipped by the view volume or user-defined clipping planes.
- Pixels that are dropped by the scissor test.
- Pixels that are dropped by the early depth test.
15.8.4.1. Block 8 Mode
When you render a dummy polygon to work around this problem, you must choose pixels at block addresses whose lower four bits (or lower three bits) are all 1. If you look at how block addresses are arranged in block 8 mode, the lower four bits of the block address follow the same 32×8-pixel pattern repeated horizontally, and the lower three bits of the block address follow the same 16×8-pixel pattern repeated horizontally. However, depending on the framebuffer width, these patterns may be shifted horizontally by eight pixels, for every eight pixels vertically.
The following table shows rectangle sizes that meet the conditions for a dummy polygon. Note that the rectangle with the smallest possible area must be placed so that the pixels at its four corners have block addresses that meet the necessary conditions.
Rectangle Shape/Conditions |
Lower 4 Bits Are All 1 |
Lower 3 Bits Are All 1 |
---|---|---|
Rectangle with the smallest possible area (cannot be placed anywhere). |
94×1 |
46×1 |
Rectangle that can be placed anywhere. |
125×5 |
61×5 |
15.8.4.2. Block 32 Mode
When you render a dummy polygon to work around this problem, you must choose pixels at block addresses whose lower four bits (or lower three bits) are all 1. If you look at how block addresses are arranged in block 32 mode, the lower four bits of the block address follow the same 32×32-pixel pattern repeated horizontally, and the lower three bits of the block address follow the same 32×16-pixel pattern repeated horizontally.
The following table shows rectangle sizes that meet the conditions for a dummy polygon. Note that the rectangle with the smallest possible area must be placed so that the pixels at its four corners have block addresses that meet the necessary conditions.
Rectangle Shape/Conditions |
Lower 4 Bits Are All 1 |
Lower 3 Bits Are All 1 |
---|---|---|
Rectangle with the smallest possible area (cannot be placed anywhere). |
46×1 |
46×1 |
Rectangle that can be placed anywhere. |
61×13 |
61×5 |
15.9. Lines Are Rendered in Error and the Region Following the Framebuffer Is Corrupted
When rendering an extremely small polygon that has a right edge close to the window’s x-coordinate of 0, the system sometimes renders lines in error. This phenomenon is caused by coordinate wraparound when the pixel's x-coordinate becomes negative, due to calculation errors on polygon pixel generation. Because of the extremely large x-coordinate value, the system renders a polygon with extremely elongated dimensions in the positive x direction.
This phenomenon can also cause corruption in the memory region outside of the rendering memory region. The wrapped x-coordinate becomes 1023, and the system generates pixels from (0, y) through (1023, y), regardless of the size of the rendering region. In other words, the system generates pixels with x-coordinates from 0 through 1023, even when the rendering region size is set to a width smaller than 1024, such as 256×256. The framebuffer is accessed at the addresses calculated from the pixel's (x, y) coordinates and the width of the rendering region, but the raw x-coordinate is used for address calculation, even when it is outside of the rendering region's width. Consequently, depending on the y-coordinate value, the system might write pixel color data to memory addresses following the last address of the rendering region.
Memory corruption does not happen when the rendering region is cropped by using a scissor test. We recommend configuring a scissor test to avoid possible memory corruption. Conducting a scissor test does not entail any penalty in terms of GPU performance.
The conditions under which this phenomenon occurs depend solely on the window coordinates. For any set of window coordinates, the problem either always occurs or never occurs. This problem occurs in relation to the view volume or polygons generated as a result of clipping, so it occurs even when the original polygon itself is large, provided that the polygon protrudes beyond the edge of the screen when the window’s x-coordinate is 0, producing an extremely small area contained within the view volume.
One workaround for this issue is to adjust the x-coordinate in the vertex shader. The clipping x-coordinate calculated by the vertex shader is clipped from -w
to w
, so x
values close to -w
indicate a vertex close to the screen edge at the window’s x-coordinate of 0. You can avoid the erroneous lines by moving any such vertices—vertices located close to the edge of the screen where the x-coordinate is 0—away from the edge of the screen by adjusting the x
value in the -w direction. This workaround only changes the vertex coordinate by no more than one pixel, so it has almost no effect on rendering results.
When processing this in the vertex shader, handle the x
value after applying a projection transform (the value written to the output register as the vertex coordinate x
value) as follows.
if ( -w < x && x < -w * (1-epsilon) ) x = -w;
These x
and w
values are the x
and w
values of the vertex coordinate after the projection transformation. The epsilon
value is a variable for adjustment that is to be specified as appropriate for the scene you are rendering.
The following code section is a sample implementation of a vertex shader. Instructions starting from mul
are included to avoid displaying lines in error.
// v0 : position attribute // o0 : output for position // c0-c3 : modelview matrix // c4-c7 : projection matrix // c8 : (1 - epsilon, 1, any, any) m4x4 r0, v0, c0 // modelview transformation m4x4 r1, r0, c4 // projection transformation mul r2.xy, -r1.w, c8.xy // r2.x = -w * (1-epsilon), r2.y = -w cmp 2, 4, r1.xx, r2.xy // ifc 1, 1, 1 // if ((x < -w * (1-epsilon)) && (x > -w)) mov r1.x, -r1.w // x = -w; endif mov o0, r1
15.10. Changing the Priority of Operations Within a Driver
The CPU in the 3DS has two cores, one dedicated to running applications, and the other dedicated to controlling the processes of the system’s devices.
The core for the system controls multiple devices, and GPU processing has a high priority among these devices. So heavy graphics processing can affect processing for other devices. In such cases, you can use the nngxSetInternalDriverPrioMode()
function to set the GPU to a lower priority and minimize the effect on other devices.
void nngxSetInternalDriverPrioMode(nngxInternalDriverPrioMode mode);
Select from the following values for the mode
parameter.
Definition |
Description |
---|---|
|
High priority (default) |
|
Low priority |
Note that lowering the priority for the GPU reduces the performance effect on other devices, but also reduces graphics performance.
15.11. Functions That Allocate Internal Buffers Within the Library
The library implicitly allocates internal buffers for some of the gl
and nngx()
functions.
15.11.1. nngxValidateState, glDraw*
If a reference table (LUT) is being used, the library allocates an internal buffer.
When glTexture1D
is called and the library is notified that a reference table will be used, the next time nngxValidateState
or glDraw*
is executed, an intermediate buffer is allocated for loading the reference table. However, when a 3D command is generated directly, and a buffer for the reference table is specified, no intermediate buffer is allocated.
Note that the allocated region is freed by glDeleteTexture
or nngxFinalize
.
15.11.2. Command Lists, Display Lists, Textures, and Similar Objects
Functions for allocating command lists, display lists, textures, and other objects allocate internal buffers for each object. These buffers are maintained until the corresponding object has been destroyed using nngxDelete*
or glDelete*
.
This applies to the following functions.
nngxGenCmdlists
, nngxGenDisplaybuffers
, glCreateProgram
, glCreateShader
, glGenBuffers
, glGenRenderbuffers
, glGenTextures
, and others.
15.12. Analyzing Causes of GPU Hangs
The data returned from a call to the nngxGetCmdlistParameteri()
function when passing NN_GX_CMDLIST_HW_STATE
for pname
includes a number of bits that indicate whether the hardware is busy. This data can be helpful in analyzing the cause of problems in the operation of the hardware, such as when the GPU hangs.
When the hardware is malfunctioning, the likely cause is modules that are stuck in the busy state. For modules that work in sequence, such as the triangle setup > rasterization > texture unit modules, the busy signal propagates from the last module through to the first module in the chain. When a sequence of modules is busy, the last module in the chain is the most likely cause. However, the per-fragment operation module indicated by bit 6 in the returned data can be stuck in the busy state due to invalid data from a previous module, so the most likely cause could also be an earlier module.
Roughly speaking, busy signals propagate from rasterizing and pixel processing marked by bits 0 through 7, or from geometry processing marked by bits 8 through 16.
Rasterizing and pixel processing covers the sequence of modules of triangle setup > rasterization > texture unit > fragment lighting > texture combiner > per-fragment operation, with busy signals from modules later in the chain propagating to modules earlier in the chain. In other words, busy signals propagate in order from bit 5 to bit 0.
Bit 6 is also a per-fragment operation module busy signal. However, although this does propagate to bits 0 and 1, it does not propagate to bits 2, 3, or 4.
Bit 7 indicates a busy signal from the early depth test module, which occurs when the system is waiting for the early depth buffer to clear (GPU built-in memory). This busy signal does not propagate to other modules.
A busy signal in the triangle setup module does not propagate to earlier modules (the vertex cache or geometry generator). In other words, no busy signals propagate between rasterizing and pixel processing and geometry processing.
Geometry processing takes place as follows. Vertex input process module (which loads command buffers and vertex arrays) ⇒ vertex processor ⇒ post vertex cache (in that order). Busy signals from modules later in the chain propagate to modules earlier in the chain. In other words, busy signals propagate in this order: bit 16 ⇒ (bit 11, bit 12, bit 13, bit 14) ⇒ bit 8 ⇒ bit 9. Bits 11, 12, 13, and 14 correspond to the busy states of vertex processors 0, 1, 2, and 3, but because each vertex processor is allocated in parallel with the vertex loading module and post-vertex cache, a busy signal from the post-vertex cache propagates to one or more of the four vertex processors. The signal might not propagate to all four of the vertex processors.
This description applies to the situation when the geometry shader is disabled. When it is enabled, vertex processor 0, which is the geometry shader processor, comes after the post-vertex cache. In this case, a busy signal from the geometry shader processor propagates to the post-vertex cache, but it does not propagate to any earlier modules. In other words, a busy signal would propagate in this order: bit 11 > bit 16. A busy signal arising from the post-vertex cache does propagate to earlier modules (vertex processors 1, 2, and 3). In other words, a busy signal would propagate in this order: bit 11 ⇒ bit 16, and bit 16 ⇒ (bit 12, bit 13, bit 14) ⇒ bit 8 ⇒ bit 9.
The post-vertex cache, indicated by bit 16, outputs a busy signal when it is filled to capacity with vertex data. If the cache cannot output this data to the next module for some reason, such as when the next module is not responding, vertex data fills the post-vertex cache to capacity. If the geometry shader is disabled, the next module is the triangle setup module. If the geometry shader is enabled, the next module is the geometry processor (vertex processor 0).
The cause of the GPU hanging when the texture unit indicated by bit 2 is busy, can be attributed to a bug in the hardware that occurs when simultaneously using textures, both in and not in, a 4-bit format as a multitexture.
If the GPU hangs due to an incorrect load array setting (the vertex attribute data load unit to the GPU), bit 8 enters a busy state.
If the GPU hangs due to the vertex shader’s output of NaN (or the geometry shader’s output pf NaN when the geometry shader is being used), the rasterization module and the triangle setup (bit 0 and bit 1) enter a busy state.
For more information about the nngxGetCmdlistParamerteri()
function and the bits obtained with NN_GX_CMDLIST_HW_STATE
, see 4.1.10. Getting Command List Parameters.
15.12.1. Hardware State When the GPU Hangs
The following table provides examples of the hardware state obtained for NN_GX_CMDLIST_HW_STATE
when the GPU hangs, and the related cause.
Hardware State |
Cause of GPU Hang |
---|---|
0x00011303 |
CPU destroyed content of vertex buffer while GPU was operating. |
0x00010103 |
GPU bug caused content of vertex buffer in VRAM to be discarded in the middle of rendering. For information, see 15.9. Lines Are Rendered in Error and the Region Following the Framebuffer Is Corrupted. |
0x00010107 |
Hang due to multitexture hardware bug. Conflict with texture addressing of 128-byte alignment. For information, see 7.3.1. Formats With 4-Bit Components. |
0x00000100 |
Bit 8 in both PICA registers These registers must be set again when rendering using a GL function after rendering using a non-GL function. Unused elements of the load array (bits 31 to 28 of register |
0x00007300 |
PICA register |
0x00000000 |
Related register for the command buffer address jump function was not set properly. While in standby for command request execution, |
0x00000001 |
|
For more information about how to set a PICA register, see 3DS Programming Manual: Advanced Graphics.
15.13. Effect of Vertex Attribute Combinations on Vertex Data Transfer Speed
When a vertex buffer is used, the vertex attribute data type and data size combination (the type
and size
parameters in the glVertexAttribPointer()
function) affects the speed of vertex data transfer.
Vertex attribute data stored in the vertex buffer is grouped together as one or multiple vertex attributes and loaded to the GPU. The load array is the unit used in loading the vertex data.
For more information about load arrays, see 3DS Programming Manual: Advanced Graphics.
When the GPU transfers load arrays, it determines whether to perform a read-ahead transfer based on the combination of vertex attribute data types and data sizes comprising the load array. If a read-ahead transfer can be performed, the vertex data transfer speed is faster.
Read-ahead transfer is performed when data meets the requirements of the conditional equation shown below.
(Attribute Number Types Other Than GL_FLOAT
+ Attribute Number Whose Data Size Is 1) <= (GL_FLOAT
Type Attribute Number Whose Data Size Is 4 + GL_FLOAT
Type Attribute Number Whose Data Size Is 3 / 2)
The data size of Attribute Number Types Other Than GL_FLOAT
and the data type of Attribute Number Whose Data Size Is 1 are arbitrary. Vertex attributes applying multiple conditions are counted according to each of those conditions. For instance, for GL_BYTE
type vertex attributes whose data size is 1, both the attribute type other than GL_FLOAT
and the attribute number whose data size is 1 are counted.
If the conditions for read-ahead transfer are matched, transfer speed depends on the data volume of load arrays. The smaller the data volume, the faster the transfer speed. If the volume of vertex data is the same, transfer speed depends on the number of attributes included in the load array. The fewer the load array attributes, the faster the transfer speed.
15.14. Vertex Array Address Alignment
Efficiency of vertex array transfer processing can be improved by keeping vertex array address alignment to 32 bytes when rendering with the use of a vertex buffer. The vertex array address is a value comprised of the vertex buffer address and an offset specified by the glVertexAttribPointer()
function (a value specified in ptr
).
The extent to which speed is improved in comparison to a vertex array address whose alignment has not been kept within 32 bytes depends on the vertex attribute type, size, location of vertex array storage, and the content of the vertex index. There is no guarantee that this method will be effective. In addition, even if transfer processing performance improves, it will not necessarily improve performance of the overall system unless vertex array transfer processing is causing a noticeable bottleneck.
15.15. GPU Hangs When Multitexture Is Used
On SNAKE hardware, the GPU does not hang when using multitextures. However, note that hangs can still occur when the application is running on CTR.
The GPU may experience a hang when both of the following conditions are met.
- Multiple textures are used.
- There is a considerable difference in performance between texture units.
Procedural texture is not included in these conditions. This phenomenon does not occur when one normal texture is used simultaneously with a procedural texture.
This phenomenon can be avoided by taking the following steps.
- Use only one texture.
- For all textures used, apply the
GL_XXX_MIPMAP_LINEAR
setting (the setting for trilinear filter use) to theGL_TEXTURE_MIN_FILTER
texture parameter.
This parameter must be set even for textures for which there actually is no mipmap.
In addition, the following methods are recommended as a means of mitigating this phenomenon.
- Set the
GL_TEXTURE_MIN_FILTER
texture parameter to theGL_XXX_MIPMAP_LINEAR
setting (the setting for trilinear filter use) for a portion of the textures to be used.
The problem can be avoided completely by setting all textures to the trilinear filter setting, but the phenomenon’s occurrence can be mitigated by using the setting on just a portion of the textures. - Place all textures to be used at the same time in the same VRAM.
- Reduce the number of textures used.
Because the occurrence of this phenomenon is dependent on timing, changing texture settings as listed below can also avoid the problem much of the time. However, in some cases, the frequency of occurrence could actually worsen.
- Change the size of textures.
- Change texture formats.
- Change the filter mode of the textures.
- Change the storage locations of textures, switching between VRAM-A, VRAM-B, and FCRAM (device memory).
For a description of determining whether this phenomenon is the cause of a hang, see 15.12. Analyzing Causes of GPU Hangs.
However, note that the same type of hangs can also be caused by being in conflict with restrictions governing the storage location of the 4-bit texture format, making it difficult to determine the cause for certain.
15.16. GL_INTERPOLATE Calculations of the Texture Combiner
The texture combiner’s GL_INTERPOLATE
expression is src0 * src2 + src1 * (1 – src2)
. If src2
is 1
or 0
, you would expect the result of the calculation to be the same value as either src0
or src1
. However, because of how the specifications have been implemented, if src2
is 0 and src0
< src1
, the output result is not the value of src1
as you would expect, but rather something that is one unit less bright than src1
.
To avoid this problem, you must combine GL_MODULATE
and GL_MULT_ADD_DMP
and perform the calculation in a two-stage combiner. Alternatively, if you change the src2
operand to GL_ONE_MINUS_*
and switch src0
and src1
, you may be able to reduce the fragments for which this problem arises.
15.17. When Render Results Are a Complete Mismatch for Polygons With the Same Vertex Coordinates
Even when you render polygons that have exactly the same vertex coordinates, the various attribute values for the fragments might not match at all.
This phenomenon occurs as a result of calculation errors from fragment interpolation calculations that arise when the order in which the vertices were entered into the rasterization module was different. It will not occur when the input order for the vertices is exactly the same but if you render using the glDrawElements()
function with GL_TRIANGLES
as a parameter, that input order can change internally as a result of the relationship with the vertex indices for the immediately prior polygon. To ensure your ability to render multiple polygons with completely matching fragments, you must use the same vertex indices when rendering, including those for the polygons that precede and follow your desired polygons.