Wednesday, May 27, 2009

VirtualDub

What is VirtualDub?

VirtualDub is a video capture/processing utility for 32-bit Windows platforms (95/98/ME/NT4/2000/XP), licensed under the GNU General Public License (GPL). It lacks the editing power of a general-purpose editor such as Adobe Premiere, but is streamlined for fast linear operations over video. It has batch-processing capabilities for processing large numbers of files and can be extended with third-party video filters. VirtualDub is mainly geared toward processing AVI files, although it can read (not write) MPEG-1 and also handle sets of BMP images.

I basically started VirtualDub in college to do some quick capture-and-encoding that I wanted done; from there it's basically grown into a more general utility that can trim and clean up video before exporting to tape or processing with another program. I released it on the web and others found it useful, so I've been tinkering around with its code ever since. If you have the time, please download and enjoy.

§ GPU acceleration of video processing

I've gotten to a stable enough point that I feel comfortable in revealing what I've been working on lately, which is GPU acceleration for video filters in VirtualDub. This is something I've been wanting to try for a while. I hacked a video filter to do it a while back, but it had the severe problems of (a) only supporting RGB32, and (b) being forced to upload and download immediately around each instance. The work I've been doing in the past year to support YCbCr processing and to decouple the video filters from each other cleaned up the filter system enough that I could actually put in GPU acceleration without significantly increasing the entropy of the code base.

There are two problems with the current implementation.

The first problem is the API that it uses, which is Direct3D 9. I chose Direct3D 9 as the baseline API for several reasons:

  • It's the API I'm most familiar with, by far.
  • The debug runtime is much more thorough than what I've had available with other APIs.
  • PIX and NVPerfHUD are free.
  • It runs on just about any modern video card.
  • Shaders have well-defined profiles, are portable between graphics card vendors, and use standardized byte code.

On top of this are a 3D portability layer, then the filter acceleration layer (VDXA). The API for the low level layer is designed so that it could be retargeted to Direct3D 9Ex, D3D10 and OpenGL; the VDXA layer is much more heavily restricted in feature set, but also adds easier to use 2D abstractions on top. The filter system in turn has been extended so that it inserts filters as necessary to upload or download frames from the accelerator and can initiate RGB<->YUV conversions on the graphics device. So far, so good...

...except for getting data back off the video card.

There are only two ways to download non-trivial quantities of data from the video card in Direct3D 9, which are (1) GetRenderTargetData() and (2) lock and copy. In terms of speed, the two methods are slow and pathetically slow, respectively. GetRenderTargetData() is by far the preferred method nowadays as it is decently well optimized to copy down 500MB/sec+ on any decent graphics card. The problem is that it is impossible to keep the CPU and GPU smoothly running in parallel if you use it, because it blocks the CPU until the GPU completes all outstanding commands. The result is that you spend far more time blocking on the GPU than actually doing the download and your effective throughput drops. The popular suggestion is to double-buffer render target and readback surface pairs, and as far as I can tell this doesn't help because you'll still stall on any new commands that are issued even if they go to a different render target. This means that the only way to keep the GPU busy is to sit on it with the CPU until it becomes idle, issue a single readback, and then immediately issue more commands. That sucks, and to circumvent it I'm going to have to implement another back end to see if another platform API is faster at readbacks.

The other problem is that even after loading up enough filters to ensure that readback and scheduling are not the bottlenecks, I still can't get the GPU to actually beat the CPU.

I currently have five filters accelerated: invert, deinterlace (yadif), resize, blur, blur more, and warp sharp. At full load, five out of the six are faster on the CPU by about 20-30%, and I cheated on warp sharp by implementing bilinear sampling on the GPU instead of bicubic. Part of the reason is that the CPU has less of a disadvantage on these algorithms: when dealing with 8-bit data using SSE2 it has 2-4x bandwidth than with 32-bit float data, since the narrower data types have 2-4x more parallelism in 128-bit registers. The GPU's texture cache also isn't as advantageous when the algorithm simply walks regularly over the source buffers. Finally, the systems I have for testing are a bit lopsided in terms of GPU vs. CPU power. For instance, take the back-of-the-envelope calculations for the secondary system:

  • GPU (GeForce 6800): 2600Mpix/sec * 4 components/vector = 10.4 billion operations/sec
  • CPU (Pentium M): 1.86GHz * 8 components / vector / clock = 14.8 billion operations/sec

It's even worse for my primary system (which I've already frequently complained about):

  • GPU (Quadro NVS 140M): 3200Mpix/sec * 4 components / vector = 12.8 billion operations/sec
  • CPU (Core 2): 2.5GHz * 16 components / vector / clock = 40 billion operations/sec (single core)

There are, of course, a ton of caveats in these numbers, such as memory bandwidth and the relationship between theoretical peak ops and pixel throughput. The Quadro, for instance, is only about half as fast as the GeForce in real-world benchmarks. Still, it's plausible that the CPU isn't at a disadvantage here, particularly when you consider the extra overhead in uploading and downloading frames and that some fraction of the GPU power is already used for display. I need to try a faster video card, but I don't really need one for anything else, and more importantly, I no longer have a working desktop. But then again, I could also get a faster CPU... or more cores.

The lesson here appears to be that it isn't necessarily a given that the GPU will beat the CPU, even if you're doing something that seems GPU-friendly like image processing, and particularly if you're on a laptop where the GPUs tend to be a bit underpowered. That probably explains why we haven't seen a huge explosion of GPU-accelerated apps yet, although they do exist and are increasing in number.

(Read more....)

§ Template madness

Templates are a feature in C++ where you can create functions and class types that are parameterized on other types and values. An example is the min() function. Without templates, your choices in C++ would be a macro, which has problems with side effects; a single function, which locks you down to a single type; or multiple overloads, which drives you nuts. Templates allow you to declare a min() that works with any type that has a less-than predicate without having to write all of the variants explicitly.

The problem with templates is that they're (a) awful to use and (b) very powerful. The C++ template syntax is horrible, with all sorts of notorious problems with angle brackets and typename and other issues, and anyone who used VC6 still shudders at the mention of STL errors. The thing is, I still like templates, because they're compile-time and extremely versatile. The last time I had to use generics in C#, I ran into so many limitations that I really wished I'd had templates instead. C# generics are a mixture of both compile-time and run-time instantiation, which means they're more constrained. In particular, the inability to use constructors with parameters or to cast to the generic type is crippling, especially if you're working with enums. In C++, you can pretty much do anything with T that you could with using an explicit type.

Function template instantiation is one of the areas that I have the most problem with. The idea is simple: you specify the function you want, and the compiler finds the best template to fit. In reality, you're in a role-playing game where you wish for the function you want and the compiler GM does whatever it can to do precisely what you ask and still screw you, like implicitly convert a double through type bool. I got burned by this tonight when I tried to port VirtualDub to Visual Studio 2010 beta 1. I had expected this to be quick since everything just worked in the CTP, but with beta 1 it took hours due to several nasty bugs in the project system. The first time I was able to run the program it asserted before it even opened the main window. The problem was in this code:

int nItems = std::min<int>(mMaxCount, s.length());

mMaxCount was 4, s.length() was 2, and I ended up with min(4, 2) == 4. WTF?

First, I should note the reason for the explicit call. I often end up with situations where I need to do a min() or a max() against mixed signed and unsigned types, and usually I know that the value ranges are such that it's OK to force to one type, such as if I've already done some clamping. To do this, I force the template type. Well, it turns out that specifying min() doesn't do what I had expected. It doesn't force a call to the version of min() with one template parameter of type int -- it forces a call to any template with int as the first type parameter. This used to be OK because std::min() only had one overload that took two parameters, so no other template could match. However, VS2010 beta 1 adds this evil overload:

template
inline const T& min(const T&, Pred);

Why you would ever want a min() that takes a single value and a predicate is beyond me. However, since I was calling min() with an int and an unsigned int, the compiler decided that min(int, unsigned) was a better match than min(int, int). The odd result is that 2 got turned into an ignored predicate and min(2, 4) == 4. Joy. I hacked the build into working by writing my own min() and max() and doing a massive Replace In Files.

I love templates, but I wish they didn't have so many gotchas.

(Read more....)

download