increase performance in double SparseQR

Developer
Aug 21, 2014 at 9:25 AM
Hi,
I've encountered something cool.
I've changed the SparseQR.ApplyHouseholder code to this:
private bool ApplyHouseholder(CompressedColumnStorage<double> V, int i, double beta, double[] x)
{
    int p;
    double tau = 0;

    if (x == null) return false; // check inputs

    var vp = V.ColumnPointers;
    var vi = V.RowIndices;
    var vx = V.Values;
    var vpi1 = vp[i + 1];
    var vpi = vp[i];

    if (vpi1 - vpi < 5)
    {
        for (p = vpi; p < vpi1; p++) // tau = v'*x
            tau += vx[p]*x[vi[p]];
        tau *= beta; // tau = beta*(v'*x)
        for (p = vpi; p < vpi1; p++) // x = x - v*tau
            x[vi[p]] -= vx[p]*tau;

    }

    else
    {
        for (p = vpi; p < vpi1; p++) // tau = v'*x
            tau += vx[p]*x[vi[p]];
        tau *= beta; // tau = beta*(v'*x)
        for (p = vpi; p < vpi1; p++) // x = x - v*tau
            x[vi[p]] -= vx[p]*tau;
    }

    return true;
}
and got about 25% increase in performance for a sample random matrix (in my case 'bcsstk18.mtx').
In comparison with original code I've used two additional variables vpi1 and vpi, and added a if else witch codes on both blocks are identical!
strange thing is if you remove the if-else block, will get the original performance!!!
Can you check it on your computer too? I'm not sure its working this way just on my computer...!

cant do it right now, but should check the generated IL code and compare two versions.
Coordinator
Aug 21, 2014 at 11:04 AM
Yes, I can confirm this: 5406ms vs. 4510ms, ~20% speedup:

It's not about the if loop. The following code will work as well:
int p;
double tau = 0;

if (x == null) return false; // check inputs

var vp = V.ColumnPointers;
var vi = V.RowIndices;
var vx = V.Values;

int vpi1 = vp[i + 1];
int vpi = vp[i];

p = vpi1 - vpi;

for (p = vpi; p < vpi1; p++) // tau = v'*x
{
    tau += vx[p] * x[vi[p]];
}
tau *= beta; // tau = beta*(v'*x)
for (p = vpi; p < vpi1; p++) // x = x - v*tau
{
    x[vi[p]] -= vx[p] * tau;
}

return true;
Referencing the two variables vpi and vpi1 before the first loop might impact where they will be located in memory/processor cache. But that's just my guess: welcome to the weirdness of C# JIT compiler. Looking at the IL won't help here...
Marked as answer by epsi1on on 8/21/2014 at 6:47 AM
Developer
Aug 21, 2014 at 11:29 AM
Hi,
Its very strange, this is the result on my computer for bcsstk18.mtx release mode and running with Ctrl+F5 (no debugger attached)

original code (without vpi and vpi1 variables)~ 4000 ms
modified code (using vpi and vpi1 variables)~ 4000 ms !
modified code (using if statement and vpi and vpi1 variables)~ 3200 ms !!!!

more ridiculous thing is that the if condition do not matters!
I'm getting 3200 ms with this code too!:
         ...
         if (vpi1 - vpi < -100000)
         ...
can you please check it on your computer?
Coordinator
Aug 21, 2014 at 12:27 PM
Well, as I said: it's not about the if statement.

What about this version? Still works for me:
private bool ApplyHouseholder(CompressedColumnStorage<double> V, int i, double beta, double[] x)
{
    int p;
    double tau = 0;

    if (x == null) return false; // check inputs

    var vp = V.ColumnPointers;
    var vi = V.RowIndices;
    var vx = V.Values;

    int vpi1 = vp[i + 1];

    p = vpi1; // Don't remove (optimization).

    for (p = vp[i]; p < vpi1; p++) // tau = v'*x
    {
        tau += vx[p] * x[vi[p]];
    }
    tau *= beta; // tau = beta*(v'*x)
    for (p = vp[i]; p < vpi1; p++) // x = x - v*tau
    {
        x[vi[p]] -= vx[p] * tau;
    }

    return true;
}
Developer
Aug 21, 2014 at 12:34 PM
Yes, working well (3200 ms).
Nice information, thanks..
Developer
Aug 21, 2014 at 12:46 PM
initializing p with a value is working same (changing first line to 'int p = 0;')!
Developer
Aug 22, 2014 at 12:08 PM
Hi,
Do you know how much is difference of speed between CSparse.NET and CXSparse appropriated module (for example cholesky)
I'm not familiar with C++, so cant build the benchmark application for CXSparse myself and check its performance.

Regards
Coordinator
Aug 22, 2014 at 5:41 PM
Edited Aug 22, 2014 at 5:44 PM
Yes, I did some benchmarks before. The managed code performs pretty well:
Matrix crystm03: 24696 x 24696, 583770 non-zeros

 Native Cholesky: 3464ms, residual = 3,976e-26, result = 1
Managed Cholesky: 4584ms, residual = 3,976e-26

 Native LU:  8464ms, residual = 4,377e-26, result = 1
Managed LU: 10891ms, residual = 4,377e-26

 Native QR: 13310ms, residual = 1,542e-25, result = 1
Managed QR: 15907ms, residual = 1,542e-25
In case you are interested, here's the Visual Studio project: CXSparse.Benchmark.rar
Developer
Aug 22, 2014 at 6:10 PM
Thanks.
In your benchmark was native files compiler optimized? (I mean is it best performance of the native files)
Coordinator
Aug 22, 2014 at 6:56 PM
I used the defaults, which is Maximize Speed (/O2) for release mode.
Developer
Aug 23, 2014 at 6:02 AM
Edited Aug 23, 2014 at 11:01 AM
Wow, great result.
It was good if other ordering methods (like metis nested dissection) included in package. was noted is some references (like this) that it can have significant effects. also possible to make it parallel,

Thanks for information anyways...
Coordinator
Sep 29, 2014 at 4:41 PM
I played around with metis' reordering routines but haven't seen any significant improvement over AMD.

There's a new constructor in the Cholesky class which takes a given permutation as input, so you can now use custom permutations (metis nd, colamd, whatever...)
Developer
Sep 29, 2014 at 4:43 PM
Do you have any plan for making the library parallel? (specificly the cholesky decomposition)