Last week I spent at XDS 2016 with other graphics stack developers.
I gave a talk on X Server testing and our development model. Since playing around in the Servo project (and a few other github-centric communities) recently, I've become increasingly convinced that github + CI + automatic merges is the right way to go about software development now. I got a warmer reception from the other X Server developers than I expected, and I hope to keep working on building out this infrastructure.
I also spent a while with keithp and ajax (other core X developers) talking about how to fix our rendering performance. We've got two big things hurting vc4 right now: Excessive flushing, and lack of bounds on rendering operations.
The excessive flushing one is pretty easy. X currently does some of its drawing directly to the front buffer (I think this is bad and we should stop, but that *is* what we're doing today). With glamor, we do our rendering with GL, so the driver accumulates a command stream over time. In order for the drawing to the front buffer to show up on time, both for running animations, and for that last character you typed before rendering went idle, we have to periodically flush GL's command stream so that its rendering to the front buffer (if any) shows up.
Right now we just glFlush() every time we might block sending things back to a client, which is really frequent. For a tiling architecture like vc4, each glFlush() is quite possibly a load and store of the entire screen (8MB).
The solution I started back in August is to rate-limit how often we flush. When the server wakes up (possibly to render), I arm a timer. When it fires, I flush and disarm it until the next wakeup. I set the timer to 5ms as an arbitrary value that was clearly not pretending to be vsync. The state machine was a little tricky and had a bug it seemed (or at least a bad interaction with x11perf -- it was unclear). But ajax pointed out that I could easily do better: we have a function in glamor that gets called right before it does any GL operation, meaning that we could use that to arm the timer and not do the arming every time the server wakes up to process input.
Bounding our rendering operations is a bit trickier. We started doing some of this with keithp's glamor rework: We glScissor() all of our rendering to the pCompositeClip (which is usually the bounds of the current window). For front buffer rendering in X, this is a big win on vc4 because it means that when you update an uncomposited window I know that you're only loading and storing the area covered by the window, not the entire screen. That's a big bandwidth savings.
However, the pCompositeClip isn't enough. As I'm typing we should be bounding the GL rendering around each character (or span of them), so that I only load and store one or two tiles rather than the entire window. It turns out, though, that you've probably already computed the bounds of the operation in the Damage extension, because you have a compositor running! Wouldn't it be nice if we could just reuse that old damage computation and reference it?
Keith has started on making this possible: First with the idea of const-ifying all the op arguments so that we could reuse their pointers as a cheap cache key, and now with the idea of just passing along a possibly-initialized rectangle with the bounds of the operation. If you do compute the bounds, then you pass it down to anything you end up calling.
Between these two, we should get much improved performance on general desktop apps on Raspberry Pi.
Other updates: Landed opt_peephole_sel improvement for glmark2 performance (and probably mupen64plus), added regression testing of glamor to the X Server, fixed a couple of glamor rendering bugs.