ffc team mailing list archive

Thread
Date

Re: Benchmark results for new BLAS mode

To: Johan Jansson <johanjan@xxxxxxxxxxxxxxxx>
From: Robert C Kirby <kirby@xxxxxxxxxxxx>
Date: Tue, 11 Oct 2005 07:23:06 -0500
Cc: Discussion of FFC development <ffc-dev@xxxxxxxxxx>
In-reply-to: <20051011100010.GB6670@leja.math.chalmers.se>


All I'm saying is that the theoretical top speed for the generated
code mode is faster than for the BLAS mode. Do you disagree with this
statement?

The reason this is so is because the generated code mode can
potentially use all the optimizations that BLAS does (blocking or
clever cache usage, Fortran) but BLAS cannot use all the optimizations
that the generated code mode uses (Ferari-style optimizations such as
skipping known zeros).

I see you what you mean, and agree somewhat, but I am skeptical. Theirregularity introduced by FErari optimizations destroysspatial/temporal locality. The reduced flop count comes at a pricethat may not be able to use the machine as efficiently as the veryregular computation patterns in BLAS -- you can do fewer flops, butthey are slower. Second, as you indicate below, nobody wants torecreate something like ATLAS or FLAME. Besides, these can take a verylong time just to generate a single matrix-matrix multiply with noparticular structure (so if you tile one part you know how to tile allof it).

In the short term though, I don't think anybody is going to want to
spend time on these types of optimizations for the generated code mode
(premature optimization is the root of all evil etc.). But in the long
term someone might consider it worthwhile, especially if this type of
assembly becomes an established tool.

So here's my suggested plan: let's not lock ourselves into BLAS, keep
both methods for the foreseeable future. BLAS is a great tool, and may
be faster right now in runtime for certain forms and elements, but the
generated code mode will be faster than BLAS for all forms and all
elements if the BLAS-style optimizations are included (which can
certainly be done).


I completely agree with the following:

i.) There exist forms for which fine-grained FErari optimizations arethe way to go. These are ones with "short G" -- the geometric tensordoesn't have much data in it.ii.) There exist forms for which level 3 BLAS are the way to go. Theyare the other end of the spectrum.iii.) For vector-valued Lagrange elements, it will (almost) always payoff to utilize block structure; even if there are no zero blocks, thenonzero blocks could have lots of structure and repetition that weshould exploit.

On a side note, there can be a significant difference in top speed
performance between code produced by Fortran and C compilers. We had a
quite trivial example in a course to demonstrate why Fortran was so
popular for computing, and Fortran turned out to be more than 4 times
faster than C, given full optimization for both the Fortran and C
compilers (Sun compilers on an UltraSPARC). This was due to aliasing,
a Fortran compiler can assume that all variables are distinct in
memory, while a C compiler cannot.

I'm not sure how much this propogates to other compilers, but aliasingcan be a problem. On the other hand, since we're generating code andknow what n is up front, we could generate a very specific struct withn fields rather than use an array. This may explain Anders'observation that forming G slowed down when he started putting it in anarray. Generating structs is not incompatible with BLAS -- you pass ina pointer to the head of the struct, cast to be a double.

Also, most FORTRAN compilers don't prevent you from doing aliasing --you can pass in the same pointer to two arguments, and it won'tcomplain at you. It optimizes in an unsafe way, so we have to becareful in the code generated (not that you don't in C).

Here's the example in C:

void horner(double* px, double* x, double* coeff, int n)
{
  int           j;
  double        xj;

  for(j = 0; j < n; j++)
    {
      xj = x[j];

px[j] = coeff[0] + xj * (coeff[1] + xj * (coeff[2] + xj *(coeff[3] + xj * coeff[4])));

    }
}

and Fortran 90:

subroutine horner( px, x, coeff, n )
  implicit none
  integer           n, j
  double precision  px(n), x(n), coeff(5), xj;

  do j = 1, n
     xj = x(j)

px(j) = coeff(1) + xj * (coeff(2) + xj * (coeff(3) + xj *(coeff(4) + xj * coeff(5))))

  end do

end subroutine horner

I don't think Fenics should focus on these types of issues though, it
should be enough to compare abstractions of methods, we shouldn't all
have to become Fortran or assembler hackers to prove our points (I
certainly am not, and have no ambition of becoming one).

  Johan

References

Benchmark results for new BLAS mode
From: Anders Logg, 2005-10-10
Re: Benchmark results for new BLAS mode
From: Johan Jansson, 2005-10-10
Re: Benchmark results for new BLAS mode
From: Robert C. Kirby, 2005-10-10
Re: Benchmark results for new BLAS mode
From: Johan Jansson, 2005-10-11