This chapter provides information about compilers and the optimization of generated code.
4.1. Coding With Optimization in Mind
4.1.1. Using the __restrict Keyword
By using the __restrict
keyword, you can tell the compiler that a pointer or an array of function parameters do not overlap in memory (for floating-point matrix operations, for example). This allows the compiler to perform optimizations that may not be possible when aliases are generated.
Example:
MTX44*
MTX44Multiple(MTX44* pOut, const MTX44* __restrict p1, const MTX44* __restrict p2)
{
//・・・
}
The __restrict
keyword has no effect if it is used in a template class method, but it is effective when attached to methods that are specialized for float values. The __restrict
keyword also works when it is added only to the CPP file (the definition) and not the H file (the header).
4.1.2. Using the register Modifier
In a calculation involving floating-point matrices, for example, you can make dependencies between variables explicit by storing the values to be calculated in local variables declared with the register
keyword. It then becomes easier to generate code that better accounts for latency by combining independent calculations and so on.
Example:
MTX34* MTX34Multiple(MTX34*__restrict pOut, const MTX34*__restrict p1, const MTX34*__restrict p2)
{
const f32 (*const a)[4] = p1->m;
const f32 (*const b)[4] = p2->m;
register f32 m[3][4];
// Store the calculated result in register variables
m[0][0] = a[0][0] * b[0][0] + a[0][1] * b[1][0] + a[0][2] * b[2][0];
//...
pOut->m[0][0] = m[0][0];
//...
}
4.1.3. Using Inline Assembly Code
You must use embedded assembly code to specify VFP instructions directly, but functions written in embedded assembly code cannot be optimized or inlined.
If you want the compiler to inline or optimize your code, you must use inline assembly code instead.
Instructions written using inline assembly code may be expanded to several instructions. The LDM, STM, LDRD, and STRD instructions may be expanded to the LDR and STR instructions.
Note:
In ARMCC 4.1 and subsequent compiler versions, VFP instructions can be written using inline assembler code.
4.1.4. Optimization Warnings
If RVCT determines that it cannot optimize a function, it outputs information similar to the following and stops optimizing that particular function (after expanding inline statements). This causes the slowest code to be output. You can improve performance by revising your C++ code so that it does not prevent optimization when information such as the following is displayed.
#1596-D: Could not optimize: External function reference prevents optimization
Pay attention to output related to optimization that begins with “Could not optimize:” or “Might not be able to optimize:”.
Note:
The CTR-SDK build system suppresses optimization warnings. To display these warnings, you must remove the --diag_suppress=optimizations
option from CCFLAGS_WARNING
in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
.
4.2. Tuning Assembly Levels
4.2.1. Referencing Disassembly Listings
You can use the following steps to disassemble and analyze the code output by the compiler.
The CTR-SDK build system outputs a file with the .dasm
extension, holding the disassembly results to the same location as the executable image.
If --interleave=source_only
is added to the DISAS
flag in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
, the output file will include C source code.
To disassemble a single function, specify its path to armcc.exe
as follows.
armcc.exe --debug -O3 -Otime --cpu MPCore --interleave -S test.cpp
When this is compiled with the -S
option, only assembly code is generated. The output file has a .s
extension.
Adding the --interleave
option causes the output file to contain C code and a .txt
extension.
4.2.2. Analyzing CPU Stalls
You can analyze interlock delays caused by the processor pipeline, such as CPU stalls caused by VFP latency, using the relevant assembler warning messages.
Compile the source to be analyzed with the -S
option specified; the assembly source is output.
armcc.exe --debug -O3 -Otime --cpu MPCore -S test.cpp
Warnings related to interlock delays are output when the generated assembly source is assembled with the --diag_warning 1563
option specified.
armasm --diag_warning 1563 test.s --cpu=MPCore test.s
4.3. Reducing General Processing
You can reduce CPU processing to some extent by paying attention to the following points.
- Avoid frequent use of virtual.
- Reduce function jumps using inline or some other means.
4.1.1. Using the __restrict Keyword
By using the __restrict
keyword, you can tell the compiler that a pointer or an array of function parameters do not overlap in memory (for floating-point matrix operations, for example). This allows the compiler to perform optimizations that may not be possible when aliases are generated.
Example:
MTX44*
MTX44Multiple(MTX44* pOut, const MTX44* __restrict p1, const MTX44* __restrict p2)
{
//・・・
}
The __restrict
keyword has no effect if it is used in a template class method, but it is effective when attached to methods that are specialized for float values. The __restrict
keyword also works when it is added only to the CPP file (the definition) and not the H file (the header).
4.1.2. Using the register Modifier
In a calculation involving floating-point matrices, for example, you can make dependencies between variables explicit by storing the values to be calculated in local variables declared with the register
keyword. It then becomes easier to generate code that better accounts for latency by combining independent calculations and so on.
Example:
MTX34* MTX34Multiple(MTX34*__restrict pOut, const MTX34*__restrict p1, const MTX34*__restrict p2)
{
const f32 (*const a)[4] = p1->m;
const f32 (*const b)[4] = p2->m;
register f32 m[3][4];
// Store the calculated result in register variables
m[0][0] = a[0][0] * b[0][0] + a[0][1] * b[1][0] + a[0][2] * b[2][0];
//...
pOut->m[0][0] = m[0][0];
//...
}
4.1.3. Using Inline Assembly Code
You must use embedded assembly code to specify VFP instructions directly, but functions written in embedded assembly code cannot be optimized or inlined.
If you want the compiler to inline or optimize your code, you must use inline assembly code instead.
Instructions written using inline assembly code may be expanded to several instructions. The LDM, STM, LDRD, and STRD instructions may be expanded to the LDR and STR instructions.
Note:
In ARMCC 4.1 and subsequent compiler versions, VFP instructions can be written using inline assembler code.
4.1.4. Optimization Warnings
If RVCT determines that it cannot optimize a function, it outputs information similar to the following and stops optimizing that particular function (after expanding inline statements). This causes the slowest code to be output. You can improve performance by revising your C++ code so that it does not prevent optimization when information such as the following is displayed.
#1596-D: Could not optimize: External function reference prevents optimization
Pay attention to output related to optimization that begins with “Could not optimize:” or “Might not be able to optimize:”.
Note:
The CTR-SDK build system suppresses optimization warnings. To display these warnings, you must remove the --diag_suppress=optimizations
option from CCFLAGS_WARNING
in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
.
By using the __restrict
keyword, you can tell the compiler that a pointer or an array of function parameters do not overlap in memory (for floating-point matrix operations, for example). This allows the compiler to perform optimizations that may not be possible when aliases are generated.
Example:
MTX44* MTX44Multiple(MTX44* pOut, const MTX44* __restrict p1, const MTX44* __restrict p2) { //・・・ }
The __restrict
keyword has no effect if it is used in a template class method, but it is effective when attached to methods that are specialized for float values. The __restrict
keyword also works when it is added only to the CPP file (the definition) and not the H file (the header).
In a calculation involving floating-point matrices, for example, you can make dependencies between variables explicit by storing the values to be calculated in local variables declared with the register
keyword. It then becomes easier to generate code that better accounts for latency by combining independent calculations and so on.
Example:
MTX34* MTX34Multiple(MTX34*__restrict pOut, const MTX34*__restrict p1, const MTX34*__restrict p2) { const f32 (*const a)[4] = p1->m; const f32 (*const b)[4] = p2->m; register f32 m[3][4]; // Store the calculated result in register variables m[0][0] = a[0][0] * b[0][0] + a[0][1] * b[1][0] + a[0][2] * b[2][0]; //... pOut->m[0][0] = m[0][0]; //... }
4.1.3. Using Inline Assembly Code
You must use embedded assembly code to specify VFP instructions directly, but functions written in embedded assembly code cannot be optimized or inlined.
If you want the compiler to inline or optimize your code, you must use inline assembly code instead.
Instructions written using inline assembly code may be expanded to several instructions. The LDM, STM, LDRD, and STRD instructions may be expanded to the LDR and STR instructions.
Note:
In ARMCC 4.1 and subsequent compiler versions, VFP instructions can be written using inline assembler code.
4.1.4. Optimization Warnings
If RVCT determines that it cannot optimize a function, it outputs information similar to the following and stops optimizing that particular function (after expanding inline statements). This causes the slowest code to be output. You can improve performance by revising your C++ code so that it does not prevent optimization when information such as the following is displayed.
#1596-D: Could not optimize: External function reference prevents optimization
Pay attention to output related to optimization that begins with “Could not optimize:” or “Might not be able to optimize:”.
Note:
The CTR-SDK build system suppresses optimization warnings. To display these warnings, you must remove the --diag_suppress=optimizations
option from CCFLAGS_WARNING
in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
.
You must use embedded assembly code to specify VFP instructions directly, but functions written in embedded assembly code cannot be optimized or inlined.
If you want the compiler to inline or optimize your code, you must use inline assembly code instead.
Instructions written using inline assembly code may be expanded to several instructions. The LDM, STM, LDRD, and STRD instructions may be expanded to the LDR and STR instructions.
In ARMCC 4.1 and subsequent compiler versions, VFP instructions can be written using inline assembler code.
If RVCT determines that it cannot optimize a function, it outputs information similar to the following and stops optimizing that particular function (after expanding inline statements). This causes the slowest code to be output. You can improve performance by revising your C++ code so that it does not prevent optimization when information such as the following is displayed.
#1596-D: Could not optimize: External function reference prevents optimization
Pay attention to output related to optimization that begins with “Could not optimize:” or “Might not be able to optimize:”.
The CTR-SDK build system suppresses optimization warnings. To display these warnings, you must remove the --diag_suppress=optimizations
option from CCFLAGS_WARNING
in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
.
4.2.1. Referencing Disassembly Listings
You can use the following steps to disassemble and analyze the code output by the compiler.
The CTR-SDK build system outputs a file with the .dasm
extension, holding the disassembly results to the same location as the executable image.
If --interleave=source_only
is added to the DISAS
flag in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
, the output file will include C source code.
To disassemble a single function, specify its path to armcc.exe
as follows.
armcc.exe --debug -O3 -Otime --cpu MPCore --interleave -S test.cpp
When this is compiled with the -S
option, only assembly code is generated. The output file has a .s
extension.
Adding the --interleave
option causes the output file to contain C code and a .txt
extension.
4.2.2. Analyzing CPU Stalls
You can analyze interlock delays caused by the processor pipeline, such as CPU stalls caused by VFP latency, using the relevant assembler warning messages.
Compile the source to be analyzed with the -S
option specified; the assembly source is output.
armcc.exe --debug -O3 -Otime --cpu MPCore -S test.cpp
Warnings related to interlock delays are output when the generated assembly source is assembled with the --diag_warning 1563
option specified.
armasm --diag_warning 1563 test.s --cpu=MPCore test.s
You can use the following steps to disassemble and analyze the code output by the compiler.
The CTR-SDK build system outputs a file with the .dasm
extension, holding the disassembly results to the same location as the executable image.
If --interleave=source_only
is added to the DISAS
flag in $(CTRSDK_ROOT)/build/omake/commondefs.cctype.RVCT.om
, the output file will include C source code.
To disassemble a single function, specify its path to armcc.exe
as follows.
armcc.exe --debug -O3 -Otime --cpu MPCore --interleave -S test.cpp
When this is compiled with the -S
option, only assembly code is generated. The output file has a .s
extension.
Adding the --interleave
option causes the output file to contain C code and a .txt
extension.
You can analyze interlock delays caused by the processor pipeline, such as CPU stalls caused by VFP latency, using the relevant assembler warning messages.
Compile the source to be analyzed with the -S
option specified; the assembly source is output.
armcc.exe --debug -O3 -Otime --cpu MPCore -S test.cpp
Warnings related to interlock delays are output when the generated assembly source is assembled with the --diag_warning 1563
option specified.
armasm --diag_warning 1563 test.s --cpu=MPCore test.s
4.3. Reducing General Processing
You can reduce CPU processing to some extent by paying attention to the following points.
- Avoid frequent use of virtual.
- Reduce function jumps using inline or some other means.
You can reduce CPU processing to some extent by paying attention to the following points.
- Avoid frequent use of virtual.
- Reduce function jumps using inline or some other means.