fenics team mailing list archive

Thread
Date

Re: Cleanup of repositories

To: Johan Hake <hake.dev@xxxxxxxxx>
From: Florian Rathgeber <florian.rathgeber@xxxxxxxxx>
Date: Mon, 25 Mar 2013 00:31:04 +0000
Cc: FEniCS Mailing List <fenics@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <514C2B9C.4060303@gmail.com>
Openpgp: id=C72D0316
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4

On 22/03/13 09:59, Johan Hake wrote:
> On 03/22/2013 10:57 AM, Anders Logg wrote:
>> On Fri, Mar 22, 2013 at 10:52:25AM +0100, Johan Hake wrote:
>>> On 03/22/2013 10:36 AM, Anders Logg wrote:
>>>> On Fri, Mar 22, 2013 at 10:32:50AM +0100, Johan Hake wrote:
>>>>>> 
>>>>>> 
>>>>>> Not exactly:
>>>>>> 
>>>>>> - Meshes in demos --> remove (already done)
>>>>> I suggest we keep these. There aren't any big files
>>>>> anyhow, are there?
>>>> 
>>>> They have already been removed and there's a good system in 
>>>> place for handling them. Keeping the meshes elsewhere will 
>>>> encourage use of the mesh gallery and keeping better track
>>>> of which meshes to use. There were lots of meshes named 
>>>> 'mesh.xml' or 'mesh2d.xml' which were really copies of other 
>>>> meshes used in other demos, some of them were gzipped, some 
>>>> not etc. That's all very clean now. Take a look at how it's 
>>>> done in trunk. I think it looks quite nice.
>>> 
>>> Nice and clean, but it really is just 30 meshes. Duplications 
>>> are mostly related to dolfin_fine.xml.gz, which there are 7 
>>> copies of, and that file is 86K.

If they're bit-by-bit identical git will only store a single copy in
the repository anyway, regardless of how many copies you happen to
have in the working tree.

On the note of storing gzipped meshes: Do they change frequently? Why
are they stored gzipped? Compressed files have a few issues:
1) they're treated as binary i.e. any change requires a new copy of
the entire file to be stored
2) they can't be diffed
3) git compresses its packfiles anyway, so there is little (if any)
space gain through compression

>>>> Most of the example meshes are not that big, but multiply 
>>>> that by 30 and then some when meshes are moved around or 
>>>> renamed.
>>> 
>>> I just question if it is worth it. Seems convenient to just 
>>> have the meshes there.
>> 
>> Keeping the meshes there will put a limit on which demos we can 
>> add. I think it would be good to allow for more complex demos 
>> requiring bigger meshes (not necessarily run on the buildbot 
>> every day).
> 
> Ok.
> 
>>> If we keep them out of the repo I think we should include some 
>>> automagic downloading when building the demos.
>> 
>> Yes, or at least a message stating: "You have not downloaded demo
>> data. Please run the script foo."
>> 
>>> Also should we rename the script to download-demo-meshes, or 
>>> something more descriptive, as this is what that script now 
>>> basically does?
>> 
>> It is not only meshes, but also markers and velocity fields. 
>> Perhaps it can be renamed download-demo-data?
> 
> Sounds good.
> 
> Johan

I did some more experimenting:

1) Repository size: there is quite some mileage repacking the repos with
the following steps:
$ git reflog expire --expire=now --all
$ git gc --aggressive --prune=now
$ git repack -ad
e.g. DOLFIN: 372MiB -> 94MiB

2) Stripping out the files suggested by Anders
(https://gist.github.com/alogg/5213171#file-files_to_strip-txt) brings
the repo size down to 172MiB and 24MiB after repacking.

3) I haven't yet found a reliable way to migrate feature branches to
the filtered repository. Filtering the repository rewrites its history
and therefore changes/invalidates all commit ids (SHA1s) and therefore
the marks files created when initially converting the repository.
There are 2 possible options for filtering the repository during
conversion:

a) bzr fast-import-filter: seems to be a pain to use with many files
(need to pass each path individually as an argument) and seems not to
support writing marks files, therefore haven't tried.

b) git_fast_filter: when using to filter the converted git repo, the
exported marks file in the last step contains 83932 marks instead of
the expected 14399 - I can't say why. Unfortunately I haven't been
able to use it directory in the conversion pipeline, it's not
compatible to a bzr fast-export stream. That's probably fixable, but I
can't estimate how much work it would be to fix it since I'm not
familiar enough with details of the fast-import format.

TL;DR: Repacking repos saves a lot of space already without stripping
large files. Stripping files is easy to do and saves even considerably
more space, but I haven't been able to reliably import feature
branches into a filtered repository.

Florian

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow ups

Re: Cleanup of repositories
From: Garth N. Wells, 2013-03-25

References

Re: Cleanup of repositories
From: Anders Logg, 2013-03-21
Re: Cleanup of repositories
From: Martin Sandve Alnæs, 2013-03-21
Re: Cleanup of repositories
From: Anders Logg, 2013-03-21
Re: Cleanup of repositories
From: Florian Rathgeber, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22
Re: Cleanup of repositories
From: Anders Logg, 2013-03-22
Re: Cleanup of repositories
From: Johan Hake, 2013-03-22