fenics team mailing list archive

Thread
Date
Re: CMake 2.8.11: ExternalData

To: Florian Rathgeber <florian.rathgeber@xxxxxxxxx>
From: Anders Logg <logg@xxxxxxxxx>
Date: Sun, 14 Apr 2013 22:26:27 +0200
Cc: fenics@xxxxxxxxxxxxxxxxxxx, Nico Schlömer <nico.schloemer@xxxxxxxxx>
In-reply-to: <51683B51.1040709@gmail.com>
User-agent: Mutt/1.5.21 (2010-09-15)
On Fri, Apr 12, 2013 at 05:50:25PM +0100, Florian Rathgeber wrote:
> On 11/04/13 02:53, Florian Rathgeber wrote:
> > On 10/04/13 15:54, Anders Logg wrote:
> >> On Wed, Apr 10, 2013 at 12:42:13PM +0100, Florian Rathgeber wrote:
> >>> On 09/04/13 22:47, Florian Rathgeber wrote:
> >>>> On 09/04/13 20:14, Anders Logg wrote:
> >>>>> Another option would be git submodules. Florian suggested this to me
> >>>>> earlier.
> >>>>
> >>>> That's what I think would have been a good option for outsourcing the
> >>>> references. They are by far the biggest chunk of the FFC repository (in
> >>>> size) and only developers care about them, while everyone else has a
> >>>> much larger repository to clone which also takes up considerable disk
> >>>> space (51M at the moment).
> >>>>
> >>>> Having the references be a submodule means the
> >>>> test/regression/references directory would be a pointer to a particular
> >>>> revision (SHA1) of another repository. Each FFC revision would have a
> >>>> particular revision of the ffc-references repository associated with it,
> >>>> so there is no ambiguity. It would also have the advantage that if we
> >>>> would completely redesign the FFC testing infrastructure and wouldn't
> >>>> need the references any more we could simply get rid of the submodule
> >>>> and wouldn't have to carry around their burden in history forever.
> >>>>
> >>>> There's a few caveats though:
> >>>>
> >>>> 1) If we were doing this now we would need to rewrite the history again,
> >>>> completely strip the references folder and replace it by the submodule.
> >>>>
> >>>> 2) Syncing a git repository over to launchpad for automatic package
> >>>> building with the bzr builder is not possible if the repository has
> >>>> *ever* included a submodule in its history [1], but there are
> >>>> workarounds [2] (which can't be run as a BitBucket hook however).
> >>>>
> >>>> 3) Pull requests would be a bit more tricky since ffc-references and ffc
> >>>> would have to be always merged as a pair. For core developers with push
> >>>> access to the repositories this could probably be handled with a
> >>>> pre-commit hook.
> >>>>
> >>>> [1]: https://bugs.launchpad.net/bzr-git/+bug/402814
> >>>> [2]:
> >>>> https://bazaar.launchpad.net/~videolan/vlc/manual-bzr-import/view/head:/manual-bzr-import
> >>>
> >>> It appears we can't get anyone excited on a discussion of these issues.
> >>> Have we scared everyone away?
> >>>
> >>> What are your thoughts on the submodule for FFC references? If we decide
> >>> to rewrite again we should do it asap before people actually start
> >>> basing work off the new FFC repo.
> >>
> >> I think we should rewrite now and do the submodule thing. Then the the
> >> references won't clutter the history and we are free to later move
> >> them somewhere else (like automatic CMake fetch if we decide to do
> >> that).
> >
> > I've done some research and there seem to be some options for splicing a
> > subdirectory into a submodule while keeping the correct associating
> > throughout history i.e. every revision of the main repo points to the
> > correct revision of the submodule:
> > http://thread.gmane.org/gmane.comp.version-control.git/109805/
>
> Couldn't get this working even after some fiddling.
>
> > http://thread.gmane.org/gmane.comp.version-control.git/164489/
> > http://thread.gmane.org/gmane.comp.version-control.git/164463/
>
> The full thread is at
> http://thread.gmane.org/gmane.comp.version-control.git/164386/
>
> I could get this to work, and it seems to do pretty much what we want:
> splits the subdirectory into a submodule (within the same repository!)
> and maintains the correct association by storing the submodule revision
> in the parent's index. It does however not create (and update) a
> .gitmodules files, so you have to know where the submodule is linked to
> the parent and it's slightly awkward put it in place:
>
> $ git clone . test/regression/references
> $ rev=`git rev-parse :test/regression/references`
> $ ( cd test/regression/references && git reset --hard $rev )
>
> However it should be possible to add a .gitmodules file and then treat
> it in the normal way. To be able to push/pull the submodule tree it's
> also necessary to create a ref to it e.g.:
>
> $ git update-ref refs/test/references <sha>
>
> Note that this is deliberately *not* a branch ref (which live in
> refs/heads/), which means it won't be fetched by default. That means
> even though the references tree is in the repository, users don't invest
> the bandwidth to fetch it unless they explicitly configure it to (which
> developers who want to run regression tests will need to do).

ok. I honestly can't follow the technical details here...

> > Regarding the caveats from above: we're willing to accept 1), 2) I think
> > is not a big deal (I'm not even sure Johannes is using bzr builder?), so
> > the main thing is 3). Given that the history of the references isn't
> > really important only the association it's maybe not so scary. It's just
> > a bit more work maintaining 2 repositories, though most of it could be
> > scripted, at least for the benefit of the core devs.
> >
> > I've had another chat with Jed and he suggested using git-fat. He's the
> > author and it was specifically written for that use case: keeping a
> > unified repository/history but storing large (optional) files outside of
> > .git/objects to keep the repository slim. The downside is that you then
> > need a separate central location where these files are kept. git-fat
> > manages them for you, so running an rsync daemon on the FEniCS web
> > server might already do the trick.
>
> After a closer look at git-fat I think it's not perfect for our use
> case: the actual files on disk are only stubs (which only contain the 40
> byte SHA1) and are replace by the actual big blobs by a smudge/clean
> filter, but *only for certain operations*. Unfortunately diff is not one
> of them and I think it's the one we care about: being able to view the
> diff between the output and the old reference before updating. If we
> don't care about the diff we could just as well only store a hash of the
> reference.

We need to be able to view the diff - if something changed, we must be
able to spot if it is a harmless formatting fix.

And note that it's not just the text diffs that are important. We also
store data that come out of running the generated code. That is also
stored and then checksums aren't enough.

> > We then went on to discuss whether we could in fact leverage git in the
> > regression test suite itself: there is no inherent reason why the
> > references actually need to exist as files in the work tree. An
> > identifiable loose object in the repository would be sufficient. I'll
> > forward the log so you can get the idea.
>
> Are there any plans for changing the FFC testing infrastructure?

Martin has been doing some work on using .json for storing the
reference data.

Considering that there doesn't seem to be a perfect git solution for
storing the references at this point, my suggestion would be to store
the references on the web server with rsync and a small bash script
that will download (and upload) the appropriate references. The script
would look for data in a directory named with the git hash of the
youngest available ancestor.

--
Anders
Follow ups

Re: CMake 2.8.11: ExternalData
From: Florian Rathgeber, 2013-04-15
References

CMake 2.8.11: ExternalData
From: Nico Schlömer, 2013-04-09
Re: CMake 2.8.11: ExternalData
From: Anders Logg, 2013-04-09
Re: CMake 2.8.11: ExternalData
From: Florian Rathgeber, 2013-04-09
Re: CMake 2.8.11: ExternalData
From: Florian Rathgeber, 2013-04-12