fenics team mailing list archive

Thread
Date
Re: CMake 2.8.11: ExternalData

To: Florian Rathgeber <florian.rathgeber@xxxxxxxxx>
From: Anders Logg <logg@xxxxxxxxx>
Date: Mon, 15 Apr 2013 22:57:04 +0200
Cc: Jed Brown <jed@xxxxxxxx>, fenics@xxxxxxxxxxxxxxxxxxx, Nico Schlömer <nico.schloemer@xxxxxxxxx>
In-reply-to: <516C6856.7090600@gmail.com>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Apr 15, 2013 at 09:51:34PM +0100, Florian Rathgeber wrote:
> On 14/04/13 21:26, Anders Logg wrote:
> > On Fri, Apr 12, 2013 at 05:50:25PM +0100, Florian Rathgeber wrote:
> >> On 11/04/13 02:53, Florian Rathgeber wrote:
> >>> On 10/04/13 15:54, Anders Logg wrote:
> >>>> On Wed, Apr 10, 2013 at 12:42:13PM +0100, Florian Rathgeber wrote:
> >>>>> On 09/04/13 22:47, Florian Rathgeber wrote:
> >>>>>> On 09/04/13 20:14, Anders Logg wrote:
> >>>>>>> Another option would be git submodules. Florian suggested this to me
> >>>>>>> earlier.
> >>>>>>
> >>>>>> That's what I think would have been a good option for outsourcing the
> >>>>>> references. They are by far the biggest chunk of the FFC repository (in
> >>>>>> size) and only developers care about them, while everyone else has a
> >>>>>> much larger repository to clone which also takes up considerable disk
> >>>>>> space (51M at the moment).
> >>>>>>
> >>>>>> Having the references be a submodule means the
> >>>>>> test/regression/references directory would be a pointer to a particular
> >>>>>> revision (SHA1) of another repository. Each FFC revision would have a
> >>>>>> particular revision of the ffc-references repository associated with it,
> >>>>>> so there is no ambiguity. It would also have the advantage that if we
> >>>>>> would completely redesign the FFC testing infrastructure and wouldn't
> >>>>>> need the references any more we could simply get rid of the submodule
> >>>>>> and wouldn't have to carry around their burden in history forever.
> >>>>>>
> >>>>>> There's a few caveats though:
> >>>>>>
> >>>>>> 1) If we were doing this now we would need to rewrite the history again,
> >>>>>> completely strip the references folder and replace it by the submodule.
> >>>>>>
> >>>>>> 2) Syncing a git repository over to launchpad for automatic package
> >>>>>> building with the bzr builder is not possible if the repository has
> >>>>>> *ever* included a submodule in its history [1], but there are
> >>>>>> workarounds [2] (which can't be run as a BitBucket hook however).
> >>>>>>
> >>>>>> 3) Pull requests would be a bit more tricky since ffc-references and ffc
> >>>>>> would have to be always merged as a pair. For core developers with push
> >>>>>> access to the repositories this could probably be handled with a
> >>>>>> pre-commit hook.
> >>>>>>
> >>>>>> [1]: https://bugs.launchpad.net/bzr-git/+bug/402814
> >>>>>> [2]:
> >>>>>> https://bazaar.launchpad.net/~videolan/vlc/manual-bzr-import/view/head:/manual-bzr-import
> >>>>>
> >>>>> It appears we can't get anyone excited on a discussion of these issues.
> >>>>> Have we scared everyone away?
> >>>>>
> >>>>> What are your thoughts on the submodule for FFC references? If we decide
> >>>>> to rewrite again we should do it asap before people actually start
> >>>>> basing work off the new FFC repo.
> >>>>
> >>>> I think we should rewrite now and do the submodule thing. Then the the
> >>>> references won't clutter the history and we are free to later move
> >>>> them somewhere else (like automatic CMake fetch if we decide to do
> >>>> that).
> >>>
> >>> I've done some research and there seem to be some options for splicing a
> >>> subdirectory into a submodule while keeping the correct associating
> >>> throughout history i.e. every revision of the main repo points to the
> >>> correct revision of the submodule:
> >>> http://thread.gmane.org/gmane.comp.version-control.git/109805/
> >>
> >> Couldn't get this working even after some fiddling.
> >>
> >>> http://thread.gmane.org/gmane.comp.version-control.git/164489/
> >>> http://thread.gmane.org/gmane.comp.version-control.git/164463/
> >>
> >> The full thread is at
> >> http://thread.gmane.org/gmane.comp.version-control.git/164386/
> >>
> >> I could get this to work, and it seems to do pretty much what we want:
> >> splits the subdirectory into a submodule (within the same repository!)
> >> and maintains the correct association by storing the submodule revision
> >> in the parent's index. It does however not create (and update) a
> >> .gitmodules files, so you have to know where the submodule is linked to
> >> the parent and it's slightly awkward put it in place:
> >>
> >> $ git clone . test/regression/references
> >> $ rev=`git rev-parse :test/regression/references`
> >> $ ( cd test/regression/references && git reset --hard $rev )
> >>
> >> However it should be possible to add a .gitmodules file and then treat
> >> it in the normal way. To be able to push/pull the submodule tree it's
> >> also necessary to create a ref to it e.g.:
> >>
> >> $ git update-ref refs/test/references <sha>
> >>
> >> Note that this is deliberately *not* a branch ref (which live in
> >> refs/heads/), which means it won't be fetched by default. That means
> >> even though the references tree is in the repository, users don't invest
> >> the bandwidth to fetch it unless they explicitly configure it to (which
> >> developers who want to run regression tests will need to do).
> >
> > ok. I honestly can't follow the technical details here...
>
> The important points are:
> 1) FFC and the references are stored in one and the same repository
> 2) the references are not fetched (transferred) by default
> 3) from each FFC revision throughout history its associated references
> are reachable
> 4) checking out the references to test/regression/references is slightly
> awkward, but can easily be scripted; same for updating

ok.

> >>> Regarding the caveats from above: we're willing to accept 1), 2) I think
> >>> is not a big deal (I'm not even sure Johannes is using bzr builder?), so
> >>> the main thing is 3). Given that the history of the references isn't
> >>> really important only the association it's maybe not so scary. It's just
> >>> a bit more work maintaining 2 repositories, though most of it could be
> >>> scripted, at least for the benefit of the core devs.
> >>>
> >>> I've had another chat with Jed and he suggested using git-fat. He's the
> >>> author and it was specifically written for that use case: keeping a
> >>> unified repository/history but storing large (optional) files outside of
> >>> .git/objects to keep the repository slim. The downside is that you then
> >>> need a separate central location where these files are kept. git-fat
> >>> manages them for you, so running an rsync daemon on the FEniCS web
> >>> server might already do the trick.
> >>
> >> After a closer look at git-fat I think it's not perfect for our use
> >> case: the actual files on disk are only stubs (which only contain the 40
> >> byte SHA1) and are replace by the actual big blobs by a smudge/clean
> >> filter, but *only for certain operations*. Unfortunately diff is not one
> >> of them and I think it's the one we care about: being able to view the
> >> diff between the output and the old reference before updating. If we
> >> don't care about the diff we could just as well only store a hash of the
> >> reference.
> >
> > We need to be able to view the diff - if something changed, we must be
> > able to spot if it is a harmless formatting fix.
> >
> > And note that it's not just the text diffs that are important. We also
> > store data that come out of running the generated code. That is also
> > stored and then checksums aren't enough.
>
> OK. I think that rules out git-fat.

ok.

> >>> We then went on to discuss whether we could in fact leverage git in the
> >>> regression test suite itself: there is no inherent reason why the
> >>> references actually need to exist as files in the work tree. An
> >>> identifiable loose object in the repository would be sufficient. I'll
> >>> forward the log so you can get the idea.
> >>
> >> Are there any plans for changing the FFC testing infrastructure?
> >
> > Martin has been doing some work on using .json for storing the
> > reference data.
>
> I thought the reference data was an addition to testing the generated
> headers?

Yes, we do both. Both the code itself and the output from running the
code are tested.

> > Considering that there doesn't seem to be a perfect git solution for
> > storing the references at this point, my suggestion would be to store
> > the references on the web server with rsync and a small bash script
> > that will download (and upload) the appropriate references. The script
> > would look for data in a directory named with the git hash of the
> > youngest available ancestor.
>
> The submodule solution isn't perfect, but it has the main advantage that
> code and references are stored in the same place. The rsync solution you
> describe seems feasible, but introduces another place where data is kept.

It also has the advantage that it's something I can slap together in a
simple bash script. I don't know enough about git to handle it using
the submodule approach, and I don't know if I should ask you to spend
another 2 weeks developing the script(s) for it. :-)

--
Anders
References

CMake 2.8.11: ExternalData
From: Nico Schlömer, 2013-04-09
Re: CMake 2.8.11: ExternalData
From: Anders Logg, 2013-04-09
Re: CMake 2.8.11: ExternalData
From: Florian Rathgeber, 2013-04-09
Re: CMake 2.8.11: ExternalData
From: Florian Rathgeber, 2013-04-12
Re: CMake 2.8.11: ExternalData
From: Anders Logg, 2013-04-14
Re: CMake 2.8.11: ExternalData
From: Florian Rathgeber, 2013-04-15