Path: blob/21.2-virgl/src/panfrost/lib/pan_tiler.c
4560 views
/*1* Copyright (C) 2019 Collabora, Ltd.2*3* Permission is hereby granted, free of charge, to any person obtaining a4* copy of this software and associated documentation files (the "Software"),5* to deal in the Software without restriction, including without limitation6* the rights to use, copy, modify, merge, publish, distribute, sublicense,7* and/or sell copies of the Software, and to permit persons to whom the8* Software is furnished to do so, subject to the following conditions:9*10* The above copyright notice and this permission notice (including the next11* paragraph) shall be included in all copies or substantial portions of the12* Software.13*14* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR15* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,16* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL17* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER18* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,19* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE20* SOFTWARE.21*22* Authors:23* Alyssa Rosenzweig <[email protected]>24*/2526#include "util/u_math.h"27#include "util/macros.h"28#include "pan_device.h"29#include "pan_encoder.h"30#include "panfrost-quirks.h"3132/* Mali GPUs are tiled-mode renderers, rather than immediate-mode.33* Conceptually, the screen is divided into 16x16 tiles. Vertex shaders run.34* Then, a fixed-function hardware block (the tiler) consumes the gl_Position35* results. For each triangle specified, it marks each containing tile as36* containing that triangle. This set of "triangles per tile" form the "polygon37* list". Finally, the rasterization unit consumes the polygon list to invoke38* the fragment shader.39*40* In practice, it's a bit more complicated than this. On Midgard chips with an41* "advanced tiling unit" (all except T720/T820/T830), 16x16 is the logical42* tile size, but Midgard features "hierarchical tiling", where power-of-two43* multiples of the base tile size can be used: hierarchy level 0 (16x16),44* level 1 (32x32), level 2 (64x64), per public information about Midgard's45* tiling. In fact, tiling goes up to 4096x4096 (!), although in practice46* 128x128 is the largest usually used (though higher modes are enabled). The47* idea behind hierarchical tiling is to use low tiling levels for small48* triangles and high levels for large triangles, to minimize memory bandwidth49* and repeated fragment shader invocations (the former issue inherent to50* immediate-mode rendering and the latter common in traditional tilers).51*52* The tiler itself works by reading varyings in and writing a polygon list53* out. Unfortunately (for us), both of these buffers are managed in main54* memory; although they ideally will be cached, it is the drivers'55* responsibility to allocate these buffers. Varying buffer allocation is56* handled elsewhere, as it is not tiler specific; the real issue is allocating57* the polygon list.58*59* This is hard, because from the driver's perspective, we have no information60* about what geometry will actually look like on screen; that information is61* only gained from running the vertex shader. (Theoretically, we could run the62* vertex shaders in software as a prepass, or in hardware with transform63* feedback as a prepass, but either idea is ludicrous on so many levels).64*65* Instead, Mali uses a bit of a hybrid approach, splitting the polygon list66* into three distinct pieces. First, the driver statically determines which67* tile hierarchy levels to use (more on that later). At this point, we know the68* framebuffer dimensions and all the possible tilings of the framebuffer, so69* we know exactly how many tiles exist across all hierarchy levels. The first70* piece of the polygon list is the header, which is exactly 8 bytes per tile,71* plus padding and a small 64-byte prologue. (If that doesn't remind you of72* AFBC, it should. See pan_afbc.c for some fun parallels). The next part is73* the polygon list body, which seems to contain 512 bytes per tile, again74* across every level of the hierarchy. These two parts form the polygon list75* buffer. This buffer has a statically determinable size, approximately equal76* to the # of tiles across all hierarchy levels * (8 bytes + 512 bytes), plus77* alignment / minimum restrictions / etc.78*79* The third piece is the easy one (for us): the tiler heap. In essence, the80* tiler heap is a gigantic slab that's as big as could possibly be necessary81* in the worst case imaginable. Just... a gigantic allocation that we give a82* start and end pointer to. What's the catch? The tiler heap is lazily83* allocated; that is, a huge amount of memory is _reserved_, but only a tiny84* bit is actually allocated upfront. The GPU just keeps using the85* unallocated-but-reserved portions as it goes along, generating page faults86* if it goes beyond the allocation, and then the kernel is instructed to87* expand the allocation on page fault (known in the vendor kernel as growable88* memory). This is quite a bit of bookkeeping of its own, but that task is89* pushed to kernel space and we can mostly ignore it here, just remembering to90* set the GROWABLE flag so the kernel actually uses this path rather than91* allocating a gigantic amount up front and burning a hole in RAM.92*93* As far as determining which hierarchy levels to use, the simple answer is94* that right now, we don't. In the tiler configuration fields (consistent from95* the earliest Midgard's SFBD through the latest Bifrost traces we have),96* there is a hierarchy_mask field, controlling which levels (tile sizes) are97* enabled. Ideally, the hierarchical tiling dream -- mapping big polygons to98* big tiles and small polygons to small tiles -- would be realized here as99* well. As long as there are polygons at all needing tiling, we always have to100* have big tiles available, in case there are big polygons. But we don't101* necessarily need small tiles available. Ideally, when there are small102* polygons, small tiles are enabled (to avoid waste from putting small103* triangles in the big tiles); when there are not, small tiles are disabled to104* avoid enabling more levels than necessary, which potentially costs in memory105* bandwidth / power / tiler performance.106*107* Of course, the driver has to figure this out statically. When tile108* hiearchies are actually established, this occurs by the tiler in109* fixed-function hardware, after the vertex shaders have run and there is110* sufficient information to figure out the size of triangles. The driver has111* no such luxury, again barring insane hacks like additionally running the112* vertex shaders in software or in hardware via transform feedback. Thus, for113* the driver, we need a heuristic approach.114*115* There are lots of heuristics to guess triangle size statically you could116* imagine, but one approach shines as particularly simple-stupid: assume all117* on-screen triangles are equal size and spread equidistantly throughout the118* screen. Let's be clear, this is NOT A VALID ASSUMPTION. But if we roll with119* it, then we see:120*121* Triangle Area = (Screen Area / # of triangles)122* = (Width * Height) / (# of triangles)123*124* Or if you prefer, we can also make a third CRAZY assumption that we only draw125* right triangles with edges parallel/perpendicular to the sides of the screen126* with no overdraw, forming a triangle grid across the screen:127*128* |--w--|129* _____ |130* | /| /| |131* |/_|/_| h132* | /| /| |133* |/_|/_| |134*135* Then you can use some middle school geometry and algebra to work out the136* triangle dimensions. I started working on this, but realised I didn't need137* to to make my point, but couldn't bare to erase that ASCII art. Anyway.138*139* POINT IS, by considering the ratio of screen area and triangle count, we can140* estimate the triangle size. For a small size, use small bins; for a large141* size, use large bins. Intuitively, this metric makes sense: when there are142* few triangles on a large screen, you're probably compositing a UI and143* therefore the triangles are large; when there are a lot of triangles on a144* small screen, you're probably rendering a 3D mesh and therefore the145* triangles are tiny. (Or better said -- there will be tiny triangles, even if146* there are also large triangles. There have to be unless you expect crazy147* overdraw. Generally, it's better to allow more small bin sizes than148* necessary than not allow enough.)149*150* From this heuristic (or whatever), we determine the minimum allowable tile151* size, and we use that to decide the hierarchy masking, selecting from the152* minimum "ideal" tile size to the maximum tile size (2048x2048 in practice).153*154* Once we have that mask and the framebuffer dimensions, we can compute the155* size of the statically-sized polygon list structures, allocate them, and go!156*157* -----158*159* On T720, T820, and T830, there is no support for hierarchical tiling.160* Instead, the hardware allows the driver to select the tile size dynamically161* on a per-framebuffer basis, including allowing rectangular/non-square tiles.162* Rules for tile size selection are as follows:163*164* - Dimensions must be powers-of-two.165* - The smallest tile is 16x16.166* - The tile width/height is at most the framebuffer w/h (clamp up to 16 pix)167* - There must be no more than 64 tiles in either dimension.168*169* Within these constraints, the driver is free to pick a tile size according170* to some heuristic, similar to units with an advanced tiling unit.171*172* To pick a size without any heuristics, we may satisfy the constraints by173* defaulting to 16x16 (a power-of-two). This fits the minimum. For the size174* constraint, consider:175*176* # of tiles < 64177* ceil (fb / tile) < 64178* (fb / tile) <= (64 - 1)179* tile <= fb / (64 - 1) <= next_power_of_two(fb / (64 - 1))180*181* Hence we clamp up to align_pot(fb / (64 - 1)).182183* Extending to use a selection heuristic left for future work.184*185* Once the tile size (w, h) is chosen, we compute the hierarchy "mask":186*187* hierarchy_mask = (log2(h / 16) << 6) | log2(w / 16)188*189* Of course with no hierarchical tiling, this is not a mask; it's just a field190* specifying the tile size. But I digress.191*192* We also compute the polgon list sizes (with framebuffer size W, H) as:193*194* full_size = 0x200 + 0x200 * ceil(W / w) * ceil(H / h)195* offset = 8 * ceil(W / w) * ceil(H / h)196*197* It further appears necessary to round down offset to the nearest 0x200.198* Possibly we would also round down full_size to the nearest 0x200 but199* full_size/0x200 = (1 + ceil(W / w) * ceil(H / h)) is an integer so there's200* nothing to do.201*/202203/* Hierarchical tiling spans from 16x16 to 4096x4096 tiles */204205#define MIN_TILE_SIZE 16206#define MAX_TILE_SIZE 4096207208/* Constants as shifts for easier power-of-two iteration */209210#define MIN_TILE_SHIFT util_logbase2(MIN_TILE_SIZE)211#define MAX_TILE_SHIFT util_logbase2(MAX_TILE_SIZE)212213/* The hierarchy has a 64-byte prologue */214#define PROLOGUE_SIZE 0x40215216/* For each tile (across all hierarchy levels), there is 8 bytes of header */217#define HEADER_BYTES_PER_TILE 0x8218219/* Likewise, each tile per level has 512 bytes of body */220#define FULL_BYTES_PER_TILE 0x200221222/* If the width-x-height framebuffer is divided into tile_size-x-tile_size223* tiles, how many tiles are there? Rounding up in each direction. For the224* special case of tile_size=16, this aligns with the usual Midgard count.225* tile_size must be a power-of-two. Not really repeat code from AFBC/checksum,226* because those care about the stride (not just the overall count) and only at227* a a fixed-tile size (not any of a number of power-of-twos) */228229static unsigned230pan_tile_count(unsigned width, unsigned height, unsigned tile_width, unsigned tile_height)231{232unsigned aligned_width = ALIGN_POT(width, tile_width);233unsigned aligned_height = ALIGN_POT(height, tile_height);234235unsigned tile_count_x = aligned_width / tile_width;236unsigned tile_count_y = aligned_height / tile_height;237238return tile_count_x * tile_count_y;239}240241/* For `masked_count` of the smallest tile sizes masked out, computes how the242* size of the polygon list header. We iterate the tile sizes (16x16 through243* 2048x2048). For each tile size, we figure out how many tiles there are at244* this hierarchy level and therefore many bytes this level is, leaving us with245* a byte count for each level. We then just sum up the byte counts across the246* levels to find a byte count for all levels. */247248static unsigned249panfrost_hierarchy_size(250unsigned width,251unsigned height,252unsigned mask,253unsigned bytes_per_tile)254{255unsigned size = PROLOGUE_SIZE;256257/* Iterate hierarchy levels */258259for (unsigned b = 0; b < (MAX_TILE_SHIFT - MIN_TILE_SHIFT); ++b) {260/* Check if this level is enabled */261if (!(mask & (1 << b)))262continue;263264/* Shift from a level to a tile size */265unsigned tile_size = (1 << b) * MIN_TILE_SIZE;266267unsigned tile_count = pan_tile_count(width, height, tile_size, tile_size);268unsigned level_count = bytes_per_tile * tile_count;269270size += level_count;271}272273/* This size will be used as an offset, so ensure it's aligned */274return ALIGN_POT(size, 0x200);275}276277/* Implement the formula:278*279* 0x200 + bytes_per_tile * ceil(W / w) * ceil(H / h)280*281* rounding down the answer to the nearest 0x200. This is used to compute both282* header and body sizes for GPUs without hierarchical tiling. Essentially,283* computing a single hierarchy level, since there isn't any hierarchy!284*/285286static unsigned287panfrost_flat_size(unsigned width, unsigned height, unsigned dim, unsigned bytes_per_tile)288{289/* First, extract the tile dimensions */290291unsigned tw = (1 << (dim & 0b111)) * 8;292unsigned th = (1 << ((dim & (0b111 << 6)) >> 6)) * 8;293294/* tile_count is ceil(W/w) * ceil(H/h) */295unsigned raw = pan_tile_count(width, height, tw, th) * bytes_per_tile;296297/* Round down and add offset */298return 0x200 + ((raw / 0x200) * 0x200);299}300301/* Given a hierarchy mask and a framebuffer size, compute the header size */302303unsigned304panfrost_tiler_header_size(unsigned width, unsigned height, unsigned mask, bool hierarchy)305{306if (hierarchy)307return panfrost_hierarchy_size(width, height, mask, HEADER_BYTES_PER_TILE);308else309return panfrost_flat_size(width, height, mask, HEADER_BYTES_PER_TILE);310}311312/* The combined header/body is sized similarly (but it is significantly313* larger), except that it can be empty when the tiler disabled, rather than314* getting clamped to a minimum size.315*/316317unsigned318panfrost_tiler_full_size(unsigned width, unsigned height, unsigned mask, bool hierarchy)319{320if (hierarchy)321return panfrost_hierarchy_size(width, height, mask, FULL_BYTES_PER_TILE);322else323return panfrost_flat_size(width, height, mask, FULL_BYTES_PER_TILE);324}325326/* On GPUs without hierarchical tiling, we choose a tile size directly and327* stuff it into the field otherwise known as hierarchy mask (not a mask). */328329static unsigned330panfrost_choose_tile_size(331unsigned width, unsigned height, unsigned vertex_count)332{333/* Figure out the ideal tile size. Eventually a heuristic should be334* used for this */335336unsigned best_w = 16;337unsigned best_h = 16;338339/* Clamp so there are less than 64 tiles in each direction */340341best_w = MAX2(best_w, util_next_power_of_two(width / 63));342best_h = MAX2(best_h, util_next_power_of_two(height / 63));343344/* We have our ideal tile size, so encode */345346unsigned exp_w = util_logbase2(best_w / 16);347unsigned exp_h = util_logbase2(best_h / 16);348349return exp_w | (exp_h << 6);350}351352/* In the future, a heuristic to choose a tiler hierarchy mask would go here.353* At the moment, we just default to 0xFF, which enables all possible hierarchy354* levels. Overall this yields good performance but presumably incurs a cost in355* memory bandwidth / power consumption / etc, at least on smaller scenes that356* don't really need all the smaller levels enabled */357358unsigned359panfrost_choose_hierarchy_mask(360unsigned width, unsigned height,361unsigned vertex_count, bool hierarchy)362{363/* If there is no geometry, we don't bother enabling anything */364365if (!vertex_count)366return 0x00;367368if (!hierarchy)369return panfrost_choose_tile_size(width, height, vertex_count);370371/* Otherwise, default everything on. TODO: Proper tests */372373return 0xFF;374}375376unsigned377panfrost_tiler_get_polygon_list_size(const struct panfrost_device *dev,378unsigned fb_width, unsigned fb_height,379bool has_draws)380{381if (pan_is_bifrost(dev))382return 0;383384if (!has_draws)385return MALI_MIDGARD_TILER_MINIMUM_HEADER_SIZE + 4;386387bool hierarchy = !(dev->quirks & MIDGARD_NO_HIER_TILING);388unsigned hierarchy_mask =389panfrost_choose_hierarchy_mask(fb_width, fb_height, 1, hierarchy);390391return panfrost_tiler_full_size(fb_width, fb_height, hierarchy_mask, hierarchy) +392panfrost_tiler_header_size(fb_width, fb_height, hierarchy_mask, hierarchy);393}394395396