I'm currently porting DirectQ's new surface refresh code to RMQ; in the end I decided that it was the right thing to do. While - as I've said - OpenGL performance has never been as good as D3D in my tests (and DirectQ pulls off a few extra tricks - like using hardware instancing - that aren't practical for RMQ), overall RMQ performance is now starting to pull close.
There are a few scenes in certain maps that I use as "test scenes" for this kind of thing. These involve huge polycounts with insane complexity that stresses various parts of the engine to quite some degree. In these I'm measuring RMQ performance of about 85% of what DirectQ is now getting (that's actually faster than the current release of DirectQ).
By way of comparison, prior to this DirectQ could go over twice as fast as RMQ in some of these scenes - at 1024x768 compared to the 800x600 I used to use for RMQ (I now use the same resolution for both).
In ID1 maps and timedemos they're now neck and neck; sometimes DirectQ goes faster, sometimes RMQ does. Previously RMQ performance was more like 75% or so of DirectQ.
I'm also experimenting somewhat with DirectQ's code as I go. The new setup makes it incredibly easy to plug in alternative surface refreshes (it's just one new function) and try out ideas to see what happens without disrupting much of anything else.
Currently I have 5 different refresh methods implemented. These give a good range of options that I can test to see which is the best for all cases (or which two are the best for software T&L and hardware T&L, even).
These are panning out something like:
Unbatched - one individual draw call per surface, using a static vertex buffer. This is nearly always the slowest, but on D3D10 class hardware it can almost match or sometimes even exceed the others. On some drivers big scenes may drop you to single digit framerates. Because it's so variable I don't think I'm going to be including this in a release.
Batched - static vertex buffer plus dynamic index buffer. On hardware T&L cards this is currently the fastest; with software T&L cards - because the driver has to do quite a lot of jumping around in the vertex buffer - performance can drop off a good bit.
Batched - dynamic vertex buffer plus dynamic index buffer. This might be a good choice for software T&L - there's no vertex buffer jumping involved and performance is good overall.
Batched - triangle strip with degenerate triangles plus dynamic vertex buffer. This one is curious; performance seems a lot higher than I had suspected it might be (I'd thought the extra vertex overhead would pull it down) and it might be comparable to the above for some hardware.
Batched - indexed triangle strip with degenerate triangles plus dynamic vertex buffer. As above but with degenerate triangled added via indexes instead of via vertexes; this one currently seems the fastest for software T&L cards.
I'm going to be able to allow a user-selectable mode here, so that you can tune it for your own hardware; there are slight complexities in that the vertex buffer needs to be setup slightly different for software T&L and hardware T&L so not all modes will be available for each (the first two are for hardware, the rest for software).
Why all the fussing over software T&L? Surely this is 2011? Quite simple - a large part of DirectQ's original target hardware (integrated Intels) is software T&L only. I still want to run on these (I even test on one quite regularly) so it's important to have a good setup for them.