fenics team mailing list archive

Thread
Date

Re: Cleanup of repositories

To: Florian Rathgeber <florian.rathgeber@xxxxxxxxx>
From: Anders Logg <logg@xxxxxxxxx>
Date: Mon, 25 Mar 2013 13:02:48 +0100
Cc: FEniCS Mailing List <fenics@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <515035D3.6040209@gmail.com>
User-agent: Mutt/1.5.21 (2010-09-15)

On Mon, Mar 25, 2013 at 11:32:35AM +0000, Florian Rathgeber wrote:
> On 25/03/13 10:50, Garth N. Wells wrote:
> > On 25 March 2013 08:31, Florian Rathgeber
> > <florian.rathgeber@xxxxxxxxx> wrote:
> >> On 22/03/13 09:59, Johan Hake wrote:
> >>> On 03/22/2013 10:57 AM, Anders Logg wrote:
> >>>> On Fri, Mar 22, 2013 at 10:52:25AM +0100, Johan Hake wrote:
> >>>>> On 03/22/2013 10:36 AM, Anders Logg wrote:
> >>>>>> On Fri, Mar 22, 2013 at 10:32:50AM +0100, Johan Hake
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Not exactly:
> >>>>>>>>
> >>>>>>>> - Meshes in demos --> remove (already done)
> >>>>>>> I suggest we keep these. There aren't any big files
> >>>>>>> anyhow, are there?
> >>>>>>
> >>>>>> They have already been removed and there's a good system
> >>>>>> in place for handling them. Keeping the meshes elsewhere
> >>>>>> will encourage use of the mesh gallery and keeping better
> >>>>>> track of which meshes to use. There were lots of meshes
> >>>>>> named 'mesh.xml' or 'mesh2d.xml' which were really copies
> >>>>>> of other meshes used in other demos, some of them were
> >>>>>> gzipped, some not etc. That's all very clean now. Take a
> >>>>>> look at how it's done in trunk. I think it looks quite
> >>>>>> nice.
> >>>>>
> >>>>> Nice and clean, but it really is just 30 meshes.
> >>>>> Duplications are mostly related to dolfin_fine.xml.gz,
> >>>>> which there are 7 copies of, and that file is 86K.
> >>
> >> If they're bit-by-bit identical git will only store a single copy
> >> in the repository anyway, regardless of how many copies you
> >> happen to have in the working tree.
> >
> > Clever.
> >
> >> On the note of storing gzipped meshes: Do they change
> >> frequently?
> >
> > No.
> >
> >> Why are they stored gzipped?
> >
> > Habit. It's not good for version control.
>
> With a bit of trickery we might even be able to convert all those
> gzipped meshes i.e. unzip them in each revision and add only keep the
> xml in the repo (retrospectively for the entire history).
>
> >> Compressed files have a few issues: 1) they're treated as binary
> >> i.e. any change requires a new copy of the entire file to be
> >> stored 2) they can't be diffed 3) git compresses its packfiles
> >> anyway, so there is little (if any) space gain through
> >> compression
> >>
> >>>>>> Most of the example meshes are not that big, but
> >>>>>> multiply that by 30 and then some when meshes are moved
> >>>>>> around or renamed.
> >>>>>
> >>>>> I just question if it is worth it. Seems convenient to
> >>>>> just have the meshes there.
> >>>>
> >>>> Keeping the meshes there will put a limit on which demos we
> >>>> can add. I think it would be good to allow for more complex
> >>>> demos requiring bigger meshes (not necessarily run on the
> >>>> buildbot every day).
> >>>
> >>> Ok.
> >>>
> >>>>> If we keep them out of the repo I think we should include
> >>>>> some automagic downloading when building the demos.
> >>>>
> >>>> Yes, or at least a message stating: "You have not downloaded
> >>>> demo data. Please run the script foo."
> >>>>
> >>>>> Also should we rename the script to download-demo-meshes,
> >>>>> or something more descriptive, as this is what that script
> >>>>> now basically does?
> >>>>
> >>>> It is not only meshes, but also markers and velocity fields.
> >>>> Perhaps it can be renamed download-demo-data?
> >>>
> >>> Sounds good.
> >>>
> >>> Johan
> >>
> >> I did some more experimenting:
> >>
> >> 1) Repository size: there is quite some mileage repacking the
> >> repos with the following steps: $ git reflog expire --expire=now
> >> --all
>
> git keeps track of how branch HEADs move and does not garbage collect
> these revision. This information is kept for 90 days by default. Tell
> git to clear this history and "release" if for garbage collection.
>
> >> $ git gc --aggressive --prune=now
>
> Invoke git's garbage collection and tell it to aggressively remove all
> objects from packfiles which are no longer reachable in the DAG.
>
> >> $ git repack -ad
>
> Rewrite the packfiles and remove all redundant packs.

Wow. I didn't much of that but it sounds like a good thing...

> >> e.g. DOLFIN: 372MiB -> 94MiB
> >
> > Wow. What do these commands do?
> >
> >> 2) Stripping out the files suggested by Anders
> >> (https://gist.github.com/alogg/5213171#file-files_to_strip-txt)
> >> brings the repo size down to 172MiB and 24MiB after repacking.
> >
> > I like this. It will make cloning on slow connection much better.

Trimming the repository down to 24 MB sounds very tempting.

> >> 3) I haven't yet found a reliable way to migrate feature branches
> >> to the filtered repository. Filtering the repository rewrites its
> >> history and therefore changes/invalidates all commit ids (SHA1s)
> >> and therefore the marks files created when initially converting
> >> the repository. There are 2 possible options for filtering the
> >> repository during conversion:
> >>
> >> a) bzr fast-import-filter: seems to be a pain to use with many
> >> files (need to pass each path individually as an argument) and
> >> seems not to support writing marks files, therefore haven't
> >> tried.
> >>
> >> b) git_fast_filter: when using to filter the converted git repo,
> >> the exported marks file in the last step contains 83932 marks
> >> instead of the expected 14399 - I can't say why. Unfortunately I
> >> haven't been able to use it directory in the conversion pipeline,
> >> it's not compatible to a bzr fast-export stream. That's probably
> >> fixable, but I can't estimate how much work it would be to fix it
> >> since I'm not familiar enough with details of the fast-import
> >> format.
> >>
> >> TL;DR: Repacking repos saves a lot of space already without
> >> stripping large files. Stripping files is easy to do and saves
> >> even considerably more space, but I haven't been able to reliably
> >> import feature branches into a filtered repository.
> >
> > How about we give everyone a periodic within which to merge code
> > on Launchpad, then we don't worry about features branches and marks
> > in the conversion? Small changes can always come later in the form
> > of patches.
>
> Yes, that's an option. Git has very good support for importing patch
> series, maybe bzr can export patch series in the git am format. The
> other alternative is importing the feature branch into the
> non-filtered git repository and transplant it to the filtered one via
> interactive rebase. It's just a bit more work than what I would have
> hoped for.

I like Garth's suggestion. How about we set a deadline for Friday for
any pending merges in combination with being a bit more accomodating
with getting the merges in place? Then we freeze the repositories over
the weekend and do the conversions/stripping. After that, anyone
wishing to merge code in will need to clone a fresh copy of the new
git repository and patch manually.

--
Anders

References

Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Florian Rathgeber, 2013-03-25
Re: Cleanup of repositories
From: Garth N. Wells, 2013-03-25
Re: Cleanup of repositories
From: Florian Rathgeber, 2013-03-25