Sebastien Varrette, PhD.
Research Scientist, Head Research Computing & HPC Operations
Git Repo Cleanup With BFG
Before leaving, I wanted to publicly release some or our internal repository (hosted on our private gitlab instance <gitlab>) to ensure the commit history is not lost, and that it can better survive my departure.
To get rid of eventual sensitive information dissaminated within that commit history, the best tool I’m aware of is BFG.
Below are notes taken out when “exposing” the sources of the ULHPC Technical Documentation, initially hosted within the www/ulhpc-docs repository onto [github under the ULHPC/ulhpc-docs.
Create a Mirrored clone
1234
# AFTER commit for file removal (see below)$ git clone --mirror ssh://git@<gitlab>:<port>/www/ulhpc-docs.git ulhpc-docs.bare # --mirror assumes --bare$ cd ulhpc-docs.bare
$ git remote set-url origin git@github.com:ULHPC/ulhpc-docs.git # change remote to target github
File removal
To remove deploy instructions and targets (used to be defined in .Makefile.local), simply delete the file in a commit, and run:
$ vim pattern_ulhpc-docs_to_filter.txt # 1 pattern / literal per line$ bfg --replace-text pattern_ulhpc-docs_to_filter.txt ulhpc-docs.bare
Using repo : /Users/svarrette/git/<gitlab>/www/ulhpc-docs.bare
Found 604 objects to protect
Found 20 commit-pointing refs : HEAD, refs/heads/master, refs/heads/production, ...
Found 3 tag-pointing refs : refs/tags/v0.0.1-b14, refs/tags/v0.0.2-b106, refs/tags/v0.1.0-b407
Protected commits
-----------------
These are your protected commits, and so their contents will NOT be altered:
* commit abc367fd (protected by 'HEAD') - contains 1 dirty file :
- docs/accounts/index.md (6.5 KB)WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.
Details of protected dirty content have been recorded here :
/Users/svarrette/git/<gitlab>/www/ulhpc-docs.bare.bfg-report/2022-08-19/13-34-25/protected-dirt/
If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.
Cleaning
--------
Found 505 commits
Cleaning commits: 100% (505/505)Cleaning commits completed in 523 ms.
Updating 20 Refs
----------------
Ref Before After
--------------------------------------------------
refs/heads/master | abc367fd | d4120bb9
refs/heads/production | 8e0609a8 | a10e7e12
refs/merge-requests/11/head | caafc1fc | 5b0ef797
refs/merge-requests/11/merge | 4a7ae119 | d683ee13
refs/merge-requests/24/head | b2e5abf2 | af111fda
refs/merge-requests/24/merge | aa5f79f2 | 9fd2b961
refs/merge-requests/25/head | 542077d9 | fcdfb860
refs/merge-requests/25/merge | 4bc1dc89 | 1ad9a266
refs/merge-requests/26/head | 8cdf3234 | 5bd53370
refs/merge-requests/26/merge | d7ed877c | a1cae457
refs/merge-requests/27/head | 10a07b86 | 997823db
refs/merge-requests/27/merge | e5bb3ba9 | b16c50bb
refs/merge-requests/28/head | 6a2273f4 | 13e76163
refs/merge-requests/28/merge | 7941a7b4 | b6c3a502
refs/merge-requests/29/head | 7202d70f | da5f2f77
...
Updating references: 100% (20/20)...Ref update completed in 48 ms.
Commit Tree-Dirt History
------------------------
Earliest Latest
|| ..DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
D= dirty commits (file tree fixed)m= modified commits (commit message or parents changed) . = clean commits (no changes to file tree) Before After
-------------------------------------------
First modified commit | 5ddb21b8 | 3e44aa5c
Last dirty commit | 97a608f6 | 6d8ea242
Changed files
-------------
Filename Before & After
------------------------------------------------------------
index.md | a0804fef ⇒ e541a13a, 118976d4 ⇒ 6895f223, ...
ipa.md | 54a0db40 ⇒ b4584394
mkdocs.yml | 7f866a55 ⇒ 0b825df3, 78cf5492 ⇒ cb00d91e, ...
passwords.md | cbaaf304 ⇒ 5e93b26b, 2d31bcae ⇒ 3c234b31
In total, 1305 object ids were changed. Full details are logged here:
/Users/svarrette/git/<gitlab>/www/ulhpc-docs.bare.bfg-report/2022-08-19/13-34-25
BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
For some reason, an occurence of the sensitive information I wanted to remove used to remain (you can check it with git log -S<pattern> to search the pattern within all commits).
So I repeated with --no-blob-protection
$ bfg --replace-text pattern_ulhpc-docs_to_filter.txt --no-blob-protection ulhpc-docs.bare
Using repo : /Users/svarrette/git/<gitlab>/www/ulhpc-docs.bare
Found 0 objects to protect
Found 20 commit-pointing refs : HEAD, refs/heads/master, refs/heads/production, ...
Found 3 tag-pointing refs : refs/tags/v0.0.1-b14, refs/tags/v0.0.2-b106, refs/tags/v0.1.0-b407
Protected commits
-----------------
You're not protecting any commits, which means the BFG will modify the contents of even *current* commits.This isn't recommended - ideally, if your current commits are dirty, you should fix up your working copy and commit that, check that your build still works, and only then run the BFG to clean up your history.
Cleaning
--------
Found 505 commits
Cleaning commits: 100% (505/505)Cleaning commits completed in 322 ms.
Updating 1 Ref
--------------
Ref Before After
---------------------------------------
refs/heads/master | d4120bb9 | dc17a918
Updating references: 100% (1/1)...Ref update completed in 40 ms.
Commit Tree-Dirt History
------------------------
Earliest Latest
|| ...........................................................D
D= dirty commits (file tree fixed)m= modified commits (commit message or parents changed) . = clean commits (no changes to file tree) Before After
-------------------------------------------
First modified commit | adff298b | 55ddf9df
Last dirty commit | d4120bb9 | dc17a918
Changed files
-------------
Filename Before & After
------------------------------
index.md | 24b6cb9c ⇒ 6fde6054
In total, 5 object ids were changed. Full details are logged here:
/Users/svarrette/git/<gitlab>/www/ulhpc-docs.bare.bfg-report/2022-08-19/13-47-31
BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
Checks
1234567891011121314151617181920
cd ulhpc-docs.bare
# Search for file❯ git ls-tree --full-tree -r HEAD | grep Makefile
160000 commit 42333c7f1e58e8205c94a40ce4aa1074eb91e89f .submodules/Makefiles
100644 blob 9c13714c359aca59c48746beec4ddc5b64131f6c Makefile
❯ git ls-tree --full-tree -r HEAD~1 | grep Makefile
160000 commit 42333c7f1e58e8205c94a40ce4aa1074eb91e89f .submodules/Makefiles
100644 blob 9c13714c359aca59c48746beec4ddc5b64131f6c Makefile
# You can check the difference in the **original** repo# UNDER www/ulhpc-docs:# ❯ git ls-tree --full-tree -r HEAD | grep Makefile# 160000 commit 42333c7f1e58e8205c94a40ce4aa1074eb91e89f .submodules/Makefiles# 100644 blob 9c13714c359aca59c48746beec4ddc5b64131f6c Makefile# ❯ git ls-tree --full-tree -r HEAD~1 | grep Makefile# 100644 blob 57ceeb98e652caadf0c909fcfd87210434fb67ff .Makefile.local# 160000 commit 42333c7f1e58e8205c94a40ce4aa1074eb91e89f .submodules/Makefiles# 100644 blob 9c13714c359aca59c48746beec4ddc5b64131f6c Makefile## check for pattern '<pattern>' across all commits, should return nothing - also works with: tig -S<pattern>$ tig -S<pattern> # or git log -S<pattern>
Final cleanup before push
Carefully recheck the commits
1
tig -S{<pattern1>,<pattern2>,...}
Then as per doc:
The BFG will update your commits and all branches and tags so they are clean, but it doesn’t physically delete the unwanted stuff. Examine the repo to make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which Git will now recognise as surplus to requirements