Path: blob/21.2-virgl/src/panfrost/bifrost/Notes.txt
4565 views
# Notes on opcodes12_Notes mainly by Connor Abbott extracted from the disassembler_34LOG_FREXPM:56// From the ARM patent US20160364209A1:7// "Decompose v (the input) into numbers x1 and s such that v = x1 * 2^s,8// and x1 is a floating point value in a predetermined range where the9// value 1 is within the range and not at one extremity of the range (e.g.10// choose a range where 1 is towards middle of range)."11//12// This computes x1.1314FRCP_FREXPM:1516// Given a floating point number m * 2^e, returns m * 2^{-1}. This is17// exactly the same as the mantissa part of frexp().1819FSQRT_FREXPM:20// Given a floating point number m * 2^e, returns m * 2^{-2} if e is even,21// and m * 2^{-1} if e is odd. In other words, scales by powers of 4 until22// within the range [0.25, 1). Used for square-root and reciprocal23// square-root.2425262728FRCP_FREXPE:29// Given a floating point number m * 2^e, computes -e - 1 as an integer.30// Zero and infinity/NaN return 0.3132FSQRT_FREXPE:33// Computes floor(e/2) + 1.3435FRSQ_FREXPE:36// Given a floating point number m * 2^e, computes -floor(e/2) - 1 as an37// integer.3839LSHIFT_ADD_LOW32:40// These instructions in the FMA slot, together with LSHIFT_ADD_HIGH32.i3241// in the ADD slot, allow one to do a 64-bit addition with an extra small42// shift on one of the sources. There are three possible scenarios:43//44// 1) Full 64-bit addition. Do:45// out.x = LSHIFT_ADD_LOW32.i64 src1.x, src2.x, shift46// out.y = LSHIFT_ADD_HIGH32.i32 src1.y, src2.y47//48// The shift amount is applied to src2 before adding. The shift amount, and49// any extra bits from src2 plus the overflow bit, are sent directly from50// FMA to ADD instead of being passed explicitly. Hence, these two must be51// bundled together into the same instruction.52//53// 2) Add a 64-bit value src1 to a zero-extended 32-bit value src2. Do:54// out.x = LSHIFT_ADD_LOW32.u32 src1.x, src2, shift55// out.y = LSHIFT_ADD_HIGH32.i32 src1.x, 056//57// Note that in this case, the second argument to LSHIFT_ADD_HIGH32 is58// ignored, so it can actually be anything. As before, the shift is applied59// to src2 before adding.60//61// 3) Add a 64-bit value to a sign-extended 32-bit value src2. Do:62// out.x = LSHIFT_ADD_LOW32.i32 src1.x, src2, shift63// out.y = LSHIFT_ADD_HIGH32.i32 src1.x, 064//65// The only difference is the .i32 instead of .u32. Otherwise, this is66// exactly the same as before.67//68// In all these instructions, the shift amount is stored where the third69// source would be, so the shift has to be a small immediate from 0 to 7.70// This is fine for the expected use-case of these instructions, which is71// manipulating 64-bit pointers.72//73// These instructions can also be combined with various load/store74// instructions which normally take a 64-bit pointer in order to add a75// 32-bit or 64-bit offset to the pointer before doing the operation,76// optionally shifting the offset. The load/store op implicity does77// LSHIFT_ADD_HIGH32.i32 internally. Letting ptr be the pointer, and offset78// the desired offset, the cases go as follows:79//80// 1) Add a 64-bit offset:81// LSHIFT_ADD_LOW32.i64 ptr.x, offset.x, shift82// ld_st_op ptr.y, offset.y, ...83//84// Note that the output of LSHIFT_ADD_LOW32.i64 is not used, instead being85// implicitly sent to the load/store op to serve as the low 32 bits of the86// pointer.87//88// 2) Add a 32-bit unsigned offset:89// temp = LSHIFT_ADD_LOW32.u32 ptr.x, offset, shift90// ld_st_op temp, ptr.y, ...91//92// Now, the low 32 bits of offset << shift + ptr are passed explicitly to93// the ld_st_op, to match the case where there is no offset and ld_st_op is94// called directly.95//96// 3) Add a 32-bit signed offset:97// temp = LSHIFT_ADD_LOW32.i32 ptr.x, offset, shift98// ld_st_op temp, ptr.y, ...99//100// Again, the same as the unsigned case except for the offset.101102---103104ADD ops..105106F16_TO_F32.X: // take the low 16 bits, and expand it to a 32-bit float107F16_TO_F32.Y: // take the high 16 bits, and expand it to a 32-bit float108109MOV:110// Logically, this should be SWZ.XY, but that's equivalent to a move, and111// this seems to be the canonical way the blob generates a MOV.112113114FRCP_FREXPM:115// Given a floating point number m * 2^e, returns m ^ 2^{-1}.116117FLOG_FREXPE:118// From the ARM patent US20160364209A1:119// "Decompose v (the input) into numbers x1 and s such that v = x1 * 2^s,120// and x1 is a floating point value in a predetermined range where the121// value 1 is within the range and not at one extremity of the range (e.g.122// choose a range where 1 is towards middle of range)."123//124// This computes s.125126LD_UBO.v4i32127// src0 = offset, src1 = binding128129FRCP_FAST.f32:130// *_FAST does not exist on G71 (added to G51, G72, and everything after)131132FRCP_TABLE133// Given a floating point number m * 2^e, produces a table-based134// approximation of 2/m using the top 17 bits. Includes special cases for135// infinity, NaN, and zero, and copies the sign bit.136137FRCP_FAST.f16.X138// Exists on G71139140FRSQ_TABLE:141// A similar table for inverse square root, using the high 17 bits of the142// mantissa as well as the low bit of the exponent.143144FRCP_APPROX:145// Used in the argument reduction for log. Given a floating-point number146// m * 2^e, uses the top 4 bits of m to produce an approximation to 1/m147// with the exponent forced to 0 and only the top 5 bits are nonzero. 0,148// infinity, and NaN all return 1.0.149// See the ARM patent for more information.150151MUX:152// For each bit i, return src2[i] ? src0[i] : src1[i]. In other words, this153// is the same as (src2 & src0) | (~src2 & src1).154155ST_VAR:156// store a varying given the address and datatype from LD_VAR_ADDR157158LD_VAR_ADDR:159// Compute varying address and datatype (for storing in the vertex shader),160// and store the vec3 result in the data register. The result is passed as161// the 3 normal arguments to ST_VAR.162163DISCARD164// Conditional discards (discard_if) in NIR. Compares the first two165// sources and discards if the result is true166167ATEST.f32:168// Implements alpha-to-coverage, as well as possibly the late depth and169// stencil tests. The first source is the existing sample mask in R60170// (possibly modified by gl_SampleMask), and the second source is the alpha171// value. The sample mask is written right away based on the172// alpha-to-coverage result using the normal register write mechanism,173// since that doesn't need to read from any memory, and then written again174// later based on the result of the stencil and depth tests using the175// special register.176177BLEND:178// This takes the sample coverage mask (computed by ATEST above) as a179// regular argument, in addition to the vec4 color in the special register.180181182