I would pull transpose and store out of the macro, and s/movq/mova/g, so that it can be instantiated for sse2. Commit message should specify the speedup for this individual function, as well as for the codec as a whole. --Loren Merritt