ffc team mailing list archive

Thread
Date

Re: Benchmark results for new BLAS mode

To: "Robert C. Kirby" <kirby@xxxxxxxxxxxx>
From: Johan Jansson <johanjan@xxxxxxxxxxxxxxxx>
Date: Tue, 11 Oct 2005 12:00:10 +0200
Cc: Discussion of FFC development <ffc-dev@xxxxxxxxxx>
In-reply-to: <50CB7A7A-169E-44C4-8534-A1F4AC4F64EB@uchicago.edu>
User-agent: Mutt/1.5.6+20040523i

On Mon, Oct 10, 2005 at 04:54:04PM -0500, Robert C. Kirby wrote:

...

> The best way to go is
> i.) Figure out block structure first.
> ii.) See whether FErari or level 3 wins after you do the coarse-level  
> block structure.  This will depend on the form, the polynomial  
> degree, how well ferari does, etc.
> 
> 
> >could generate Fortran code. It's probably also a significant benefit
> >to generate code for the mappings as well. In the long term, I don't
> >see how you're ever going to be able to beat code generation for
> >runtime speed.
> >
> And the build system gets even more complicated :)
> Actually, good C code shouldn't lose by a factor of 2-3.  More like  
> 5-10% atworst.
> 
> 
> Rob Kirby
> 
> "Mathematical software should be mathematical."
> 

All I'm saying is that the theoretical top speed for the generated
code mode is faster than for the BLAS mode. Do you disagree with this
statement?

The reason this is so is because the generated code mode can
potentially use all the optimizations that BLAS does (blocking or
clever cache usage, Fortran) but BLAS cannot use all the optimizations
that the generated code mode uses (Ferari-style optimizations such as
skipping known zeros).

In the short term though, I don't think anybody is going to want to
spend time on these types of optimizations for the generated code mode
(premature optimization is the root of all evil etc.). But in the long
term someone might consider it worthwhile, especially if this type of
assembly becomes an established tool.

So here's my suggested plan: let's not lock ourselves into BLAS, keep
both methods for the foreseeable future. BLAS is a great tool, and may
be faster right now in runtime for certain forms and elements, but the
generated code mode will be faster than BLAS for all forms and all
elements if the BLAS-style optimizations are included (which can
certainly be done).

On a side note, there can be a significant difference in top speed
performance between code produced by Fortran and C compilers. We had a
quite trivial example in a course to demonstrate why Fortran was so
popular for computing, and Fortran turned out to be more than 4 times
faster than C, given full optimization for both the Fortran and C
compilers (Sun compilers on an UltraSPARC). This was due to aliasing,
a Fortran compiler can assume that all variables are distinct in
memory, while a C compiler cannot.

Here's the example in C:

void horner(double* px, double* x, double* coeff, int n)
{
  int           j;
  double        xj;

  for(j = 0; j < n; j++)
    {
      xj = x[j];
      px[j] = coeff[0] + xj * (coeff[1] + xj * (coeff[2] + xj * (coeff[3] + xj * coeff[4])));
    }
}

and Fortran 90:

subroutine horner( px, x, coeff, n )
  implicit none
  integer           n, j
  double precision  px(n), x(n), coeff(5), xj;

  do j = 1, n
     xj = x(j)
     px(j) = coeff(1) + xj * (coeff(2) + xj * (coeff(3) + xj * (coeff(4) + xj * coeff(5))))
  end do

end subroutine horner

I don't think Fenics should focus on these types of issues though, it
should be enough to compare abstractions of methods, we shouldn't all
have to become Fortran or assembler hackers to prove our points (I
certainly am not, and have no ambition of becoming one).

  Johan

Follow ups

Re: Benchmark results for new BLAS mode
From: Robert C Kirby, 2005-10-11

References

Benchmark results for new BLAS mode
From: Anders Logg, 2005-10-10
Re: Benchmark results for new BLAS mode
From: Johan Jansson, 2005-10-10
Re: Benchmark results for new BLAS mode
From: Robert C. Kirby, 2005-10-10