Chapter 13: Exploring and Rewriting History

Chapter Objectives

By the end of this chapter, you will be able to:

  • Use git reflog as a safety net to recover lost commits or branches.
  • Employ advanced git log techniques for filtering and custom formatting of commit history.
  • Search project history for specific changes using git log -S (pickaxe) and git log -G.
  • Use git blame to identify who last modified specific lines in a file and in which commit.
  • Summarize project contributions using git shortlog.
  • Understand the purpose and severe implications of advanced history rewriting tools like git filter-repo.
  • Appreciate the risks involved in rewriting shared history and when such actions might (cautiously) be considered.

Introduction

So far in your Git journey, you’ve learned to create commits, branch, merge, and even perform some history modifications like amending commits and rebasing branches. These skills are crucial for day-to-day development. However, Git’s power extends further, offering sophisticated tools to explore your project’s evolution in detail and, when absolutely necessary, to rewrite that history on a larger scale.

In this chapter, we’ll delve into techniques for deeply inspecting your repository’s past. We’ll start with git reflog, an essential safety net that records updates to the tips of branches and other references in your local repository, allowing you to recover from seemingly disastrous mistakes. We’ll then explore advanced git log options for pinpointing specific commits through powerful filtering and custom formatting. You’ll learn how to search for when a particular piece of code was introduced or changed, and how to trace the authorship of every line in your files using git blame. We’ll also look at git shortlog for summarizing contributions.

Finally, we will cautiously approach the topic of advanced history rewriting. While tools like git rebase -i (covered in Chapter 12) allow for localized history editing, sometimes more drastic changes are needed, such as removing a sensitive file from all commits or correcting author information throughout the project’s history. We’ll briefly discuss the older git filter-branch and its modern, recommended replacement, git filter-repo. These are powerful but dangerous tools, and their use, especially on shared history, comes with significant caveats that we will emphasize strongly.

Understanding these advanced capabilities will not only make you a more proficient Git user but also provide you with the tools to maintain a clean, understandable, and secure project history.

Theory

Your Safety Net: git reflog

Mistakes happen. You might accidentally delete a branch, perform a git reset --hard that discards commits you later realize you needed, or a rebase might go awry. Before you panic, remember git reflog.

The reflog (reference log) is a mechanism in Git that records when the tips of branches and other references (like HEAD) were updated in your local repository. Think of it as Git’s journal, noting every significant move you make. Each entry in the reflog has an index, like HEAD@{0}, HEAD@{1}, etc., where HEAD@{0} is the most recent state of HEAD.

How git reflog Works:

When you switch branches, commit, reset, or amend, Git updates the reflog for HEAD and the reflog for the affected branch(es). This log is stored locally in your .git directory (specifically in .git/logs/) and is not part of the pushed repository; it won’t be shared with collaborators when you push.

Why it’s a Safety Net:

Because the reflog tracks these movements, even if a commit is no longer reachable by any branch or tag (it’s “orphaned”), it might still be in the reflog. This allows you to find its SHA-1 hash and potentially recover it by creating a new branch from it, checking it out, or resetting to it.

Reflog Entries:

A typical reflog entry shows:

  • The reflog pointer (e.g., HEAD@{index}).
  • The SHA-1 hash of the commit the pointer referred to.
  • The action that caused the update (e.g., commit, rebase, reset, checkout).
  • A short description of the action or the commit message.

Expiration:

Reflog entries do expire. By default, reachable entries are kept for 90 days, and unreachable entries for 30 days. This is configurable but rarely needs changing for typical use.

graph RL
    subgraph "Scenario: Accidental git reset --hard"
        direction BT
        subgraph "1- Initial State (main branch)"
            C1["C1: Initial Commit"] --> C2["C2: Add feature A"]
            C2 --> C3["C3: Add feature B<br>(main, HEAD)"]
            class C1,C2,C3 commit;
            class C3 currentHEAD;
        end
        subgraph "2- Accidental git reset --hard HEAD~2"
            ResetOp["git reset --hard HEAD~2"] --> C1_Reset["C1: Initial Commit<br>(main, HEAD)"]
            C2_Lost["C2: Add feature A<br>(Orphaned)"]
            C3_Lost["C3: Add feature B<br>(Orphaned)"]
            C1_Reset -.-> C2_Lost;
            C2_Lost -.-> C3_Lost;
            class ResetOp operation;
            class C1_Reset currentHEAD;
            class C2_Lost,C3_Lost orphanedCommit;
        end
        subgraph "3- Using git reflog to Recover"
            Reflog["git reflog shows:"]
            Entry0["HEAD@{0}: reset: moving to HEAD~2 (to C1 hash)"]
            Entry1["HEAD@{1}: commit: C3: Add feature B (C3 hash)"]
            Entry2["HEAD@{2}: commit: C2: Add feature A (C2 hash)"]
            Entry3["HEAD@{3}: commit: C1: Initial Commit (C1 hash)"]
            Reflog --> Entry0 --> Entry1 --> Entry2 --> Entry3;
            RecoverOp["git reset --hard HEAD@{1}<br>(or git reset --hard C3_hash)"]
            Entry1 -- "Use this entry" --> RecoverOp;
            class Reflog operation;
            class Entry0,Entry1,Entry2,Entry3 reflogEntry;
            class RecoverOp operation;
        end
        subgraph "4- Recovered State (main branch)"
            C1_Rec["C1: Initial Commit"] --> C2_Rec["C2: Add feature A"]
            C2_Rec --> C3_Rec["C3: Add feature B<br>(main, HEAD)"]
            class C1_Rec,C2_Rec,C3_Rec commit;
            class C3_Rec currentHEAD;
        end
    end
    
    classDef commit fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF;
    classDef currentHEAD fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46,font-weight:bold;
    classDef orphanedCommit fill:#FEE2E2,stroke:#DC2626,stroke-width:1.5px,color:#991B1B,font-style:italic;
    classDef operation fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E;
    classDef reflogEntry fill:#FFFBEB,stroke:#FBBF24,stroke-width:1px,color:#78350F,font-size:11px;

Advanced git log Techniques

Chapter 4 introduced git log for viewing commit history. Now, let’s explore its more advanced capabilities for filtering and formatting.

Filtering Commits:

git log offers numerous options to narrow down the commits displayed:

  • By Author/Committer:
    • git log --author="John Doe": Shows commits where the author field matches “John Doe”.
    • git log --committer="Jane Doe": Shows commits where the committer field matches “Jane Doe”. (Author is who wrote the patch; committer is who applied it. They are often the same.)
  • By Date/Time:
    • git log --since="2 weeks ago" or git log --since="2023-01-01"
    • git log --until="1 day ago" or git log --until="2023-12-31"
    • git log --after="2023-01-01" --before="2023-01-31"
  • By Message Content:
    • git log --grep="Fixes bug #123": Shows commits whose messages contain the specified string (case-sensitive by default; use -i for case-insensitive).
  • By File/Path:
    • git log -- <path/to/file_or_directory>: Shows commits that affected the specified file or directory. The -- is important to separate paths from branch names or other options.
  • By Commit Range:
    • git log main..feature: Shows commits on feature branch that are not on main.
    • git log <commit_hash1>..<commit_hash2>: Shows commits reachable from <commit_hash2> but not from <commit_hash1>.
    • git log <commit_hash>^..<commit_hash>: Shows commits from <commit_hash> up to its parent.
  • By Number of Commits:
    • git log -n 5 or git log -5: Shows the last 5 commits.

Formatting Log Output:

The default git log format can be verbose. You can customize it extensively:

  • Predefined Formats:
    • --oneline: Shows each commit as a single line (SHA-1 abbreviation and commit title).
    • --short, --medium (default), --full, --fuller.
  • Graph and Decoration:
    • --graph: Displays an ASCII art graph of the branch and merge history.
    • --decorate: Shows branch and tag names pointing to commits. Often used with --oneline --graph.
  • Custom Formatting with –pretty=format:<string>:This is extremely powerful. The <string> can contain placeholders that Git replaces with information from the commit object. Common placeholders include:
Placeholder Description
%HFull commit hash (SHA-1)
%hAbbreviated commit hash
%TFull tree hash
%tAbbreviated tree hash
%PFull parent hashes (space separated)
%pAbbreviated parent hashes (space separated)
%anAuthor name
%aeAuthor email
%adAuthor date (format respects --date= option)
%arAuthor date, relative (e.g., “2 weeks ago”)
%cnCommitter name
%ceCommitter email
%cdCommitter date (format respects --date= option)
%crCommitter date, relative
%sSubject (commit message title – first line)
%bBody (rest of the commit message)
%NCommit notes
%dRef names (branches, tags) pointing to this commit, like --decorate
%DRef names without the ” (HEAD -> main)” part.
%gDReflog selector, e.g., refs/stash@{1}
%gsReflog subject
%C(<color_name>)Switch to specified color (e.g., red, green, blue, yellow, magenta, cyan, bold, ul, dim, reset)
%CresetReset color to default
%nNewline character
%%A raw ‘%’ character
  • Example: git log –pretty=”format:%C(yellow)%h %C(reset)%ad %C(cyan)%s %C(bold red)%d %C(reset)[%an]”This would output: abbreviated hash (yellow), author date, subject (cyan), ref names (bold red), and author name.
Filter Category Option / Example Description
By Author/Committer --author="John Doe" Shows commits where the author field matches “John Doe”.
--committer="Jane Doe" Shows commits where the committer field matches “Jane Doe”.
By Date/Time --since="2 weeks ago" Shows commits made in the last two weeks. Also accepts specific dates like --since="2023-01-01".
--until="1 day ago" Shows commits made up until one day ago. Also accepts specific dates like --until="2023-12-31".
--after="YYYY-MM-DD" --before="YYYY-MM-DD" Shows commits within a specific date range.
By Message Content --grep="Fixes bug #123" Shows commits whose messages contain the specified string. Use -i for case-insensitivity.
By File/Path -- <path/to/file_or_dir> Shows commits that affected the specified file or directory. The -- separates paths from other options/revisions.
By Commit Range main..feature Shows commits on the feature branch that are not on the main branch.
<hash1>..<hash2> Shows commits reachable from <hash2> but not from <hash1>.
<branch>~N..<branch> Shows the last N commits on <branch>. E.g., main~3..main for the last 3.
By Number of Commits -n <number> or -<number> Shows the last <number> commits (e.g., -5 for the last 5).
By Code Changes (Pickaxe & Regex) -S"string" Shows commits that changed the number of occurrences of “string” in the diff (i.e., added or removed the string).
-G"regex" Shows commits where the added/removed lines in the patch text match the given POSIX regular expression.
By Merge/No-Merge --merges / --no-merges Shows only merge commits, or excludes merge commits, respectively.

Searching History for Code Changes

Sometimes you need to find when a specific piece of code was introduced or modified, or which commits affected lines matching a certain pattern.

  • git log -S”string” (Pickaxe Search):The -S option (often called the “pickaxe” because it helps you pick out commits) looks for commits that changed the number of occurrences of the specified string. This means it finds commits that either introduced or removed that string. It’s different from –grep which searches commit messages. -S searches the actual diff/changes.For example, git log -S”myFunctionName” would show commits where myFunctionName was added or deleted.
  • git log -G”regex” (Regex Search in Diffs):The -G option searches for differences whose patch text contains added/removed lines that match the given POSIX regular expression. This is more general than -S as it doesn’t just count occurrences but looks for the pattern in the diff lines themselves.For example, git log -G”user_id\s*=\s*\d+” would find commits where lines matching user_id = <number> were added or removed.

Who Changed What and When: git blame

git blame is a powerful tool for line-by-line annotation of a file. For each line, it shows:

  • The SHA-1 of the commit that last modified the line.
  • The author of that commit.
  • The timestamp of that commit.
  • The line number.
  • The content of the line.

Usage:

Bash
git blame <filename>

This is invaluable for understanding the history of a specific piece of code, finding out who wrote it, when, and in what context (by looking up the commit).

Common git blame Options:

  • -L <start>,<end> or -L :<funcname>: Restrict blame output to the specified line range or function.
  • -e: Show author email instead of name.
  • -w: Ignore whitespace changes when blaming.
  • -C: Besides blaming lines that were modified in a commit, also blame lines that were copied or moved from other files modified in the same commit.
  • -M: Besides blaming lines that were modified in a commit, also blame lines that were copied or moved from other files modified in any commit. (More computationally expensive).
Command Option Description
git blame
git blame <file> -L <start>,<end> Restrict blame output to the specified line range (e.g., -L 10,20).
-L :<funcname> Restrict blame output to lines within the function matching <funcname> (if supported by language).
-e or --show-email Show author email addresses instead of names.
-w or --ignore-ws Ignore whitespace changes when determining the commit that last modified a line.
-C / -M Track lines moved or copied from other files within the same commit (-C) or any commit (-M). Can be computationally expensive.
git shortlog
git shortlog -s or --summary Suppress commit descriptions, showing only a count of commits per author.
-n or --numbered Sort output according to the number of commits per author in descending order (instead of alphabetically by author name).
-e or --email Show author email addresses in the output.
(log options) Can be combined with most git log filtering options (e.g., main..feature, --since).

Summarizing Contributions: git shortlog

git shortlog summarizes the output of git log in a way that’s often useful for release notes or getting an overview of who contributed what. It groups commits by author and displays the first line of each commit message.

Common git shortlog Options:

  • -s or --summary: Suppress commit descriptions, showing only a count of commits per author.
  • -n or --numbered: Sort output according to the number of commits per author in descending order (instead of alphabetically).
  • -e or --email: Show author email addresses.
  • Can be combined with regular git log filtering options (e.g., git shortlog main..feature -sn).

Advanced History Rewriting (Use With Extreme Caution!)

Sometimes, you might need to make sweeping changes to your project’s history. This is not something to be done lightly, as rewriting published history is dangerous and can cause significant problems for collaborators. Always back up your repository before attempting these operations.

Common Reasons for Large-Scale History Rewriting:

  • Removing sensitive data (passwords, private keys) accidentally committed.
  • Changing author email addresses or names across all commits.
  • Removing a large file or directory that was mistakenly added to the repository’s entire history.
  • Splitting a subdirectory into its own separate Git repository.
  • Standardizing commit messages.

1. git filter-branch (The Old Way – Largely Superseded)

git filter-branch was Git’s original tool for complex history rewriting. It’s incredibly powerful but also notoriously slow, cumbersome to use correctly, and can be error-prone. While you might encounter it in older documentation, git filter-repo is now the recommended tool for most such tasks.

filter-branch works by checking out each commit and running a specified filter (e.g., a shell command) on it. Common filters include:

  • --tree-filter: Modifies files in the working directory. Slowest.
  • --index-filter: Modifies files in the staging area (index). Faster.
  • --commit-filter: Modifies commit metadata (e.g., author, message).
  • --env-filter: Modifies environment variables affecting commit info (author/committer name, email, date).
  • --msg-filter: Modifies commit messages.
  • --subdirectory-filter: Turns a subdirectory into the root of the repository.

Warning: git filter-branch rewrites commit SHAs. If you’ve pushed the history you’re filtering, you’ll need to force push, which is highly disruptive. It also leaves your original refs backed up in refs/original/, which you’d typically clean up after verifying the rewrite. It is strongly recommended to use git filter-repo instead.

2. git filter-repo (The Modern, Recommended Tool)

git filter-repo is a third-party tool (not part of core Git, needs installation) designed as a faster, safer, and more user-friendly replacement for git filter-branch. It provides a cleaner interface for common history rewriting tasks.

Key Advantages of git filter-repo:

  • Speed: Significantly faster than filter-branch.
  • Safety: Includes more safeguards and generally has saner defaults. It typically requires you to work on a fresh clone.
  • Simplicity: Command-line options are often more intuitive for common tasks.

Installation:

git filter-repo is typically installed as a Python script, often via pip:

pip install git-filter-repo

Common git filter-repo Use Cases:

  • Removing files or paths:git filter-repo –path secret.txt –invert-paths (removes secret.txt)git filter-repo –path-glob ‘*.tmp’ –invert-paths (removes all .tmp files)
  • Changing author/committer info:You can use a mailmap file or callbacks. For example, to change an old email:git filter-repo –mailmap .mailmapWhere .mailmap contains lines like: New Name <new@example.com> Old Name <old@example.com>
  • Stripping blobs larger than a certain size:git filter-repo –strip-blobs-bigger-than 10M
  • Extracting a subdirectory to become the new root:git filter-repo –path-rename old/sub/dir: .

Critical Warnings for git filter-repo (and any history rewriting):

  1. Backup Your Repository: Always perform these operations on a fresh clone or ensure you have a reliable backup.
  2. Rewrites History: Like filter-branch, it changes commit SHAs from the point of the first modified commit onwards.
  3. Shared History: DO NOT use on branches that have been pushed and are being used by collaborators unless you have coordinated with your entire team. Everyone will need to re-clone or perform complex recovery steps on their local repositories. The “Golden Rule of Rebasing” applies even more strongly here.
  4. Local Cleanup: After using filter-repo, Git might have old objects. Running git gc --prune=now --aggressive can help clean these up, but filter-repo often handles much of this.

filter-repo is a powerful tool for repository surgery. Treat it with respect and understand its consequences before use.

Feature / Aspect git filter-branch (Old) git filter-repo (Modern)
Status Largely superseded, complex, error-prone. Part of core Git. Recommended replacement. Faster, safer, more user-friendly. Third-party tool (requires installation).
Performance Notoriously slow, especially on large repositories. Significantly faster.
Ease of Use Cumbersome syntax, requires careful handling of shell quoting and filters. Easy to make mistakes. More intuitive command-line options for common tasks. Clearer error messages.
Safety Can be dangerous; easy to corrupt repository if not used correctly. Backs up original refs to refs/original/. More safeguards. Often requires working on a fresh clone. Better defaults. Handles cleanup more gracefully.
Common Tasks Removing files, changing author info, splitting subdirectories via various filters (--tree-filter, --index-filter, --env-filter, etc.). Similar tasks, but with dedicated options like --path, --path-rename, --strip-blobs-bigger-than, --mailmap, --replace-text.
History Rewriting Yes, rewrites commit SHAs. Yes, rewrites commit SHAs.
Impact on Shared History BOTH TOOLS ARE EXTREMELY DISRUPTIVE TO SHARED HISTORY.
The “Golden Rule of Rebasing” applies even more strongly. Requires full team coordination if used on pushed branches. All collaborators will need to re-clone or reset their local copies.
Recommendation Avoid if possible. Use git filter-repo instead for new history rewriting tasks. The preferred tool for most large-scale history rewriting needs.

Practical Examples

Setup:

Let’s create a repository with a bit of history for our examples.

Bash
# Create and navigate to a new directory
mkdir git-history-lab
cd git-history-lab

# Initialize a Git repository
git init

# Configure user (if not set globally for these examples)
# git config user.name "Your Name"
# git config user.email "youremail@example.com"

# Commit 1
echo "Initial content for file1.txt" > file1.txt
git add file1.txt
git commit -m "C1: Add file1.txt"

# Commit 2 (as a different author for variety)
echo "Feature A in file2.txt" > file2.txt
git add file2.txt
git commit --author="Jane Doe <jane@example.com>" -m "C2: Add file2.txt by Jane"

# Commit 3
echo "Update file1.txt with more data" >> file1.txt
git add file1.txt
git commit -m "C3: Update file1.txt"
# Let's add a specific string we can search for later
echo "SensitiveData_XYZ" >> file1.txt
git add file1.txt
git commit -m "C4: Add sensitive data to file1 (oops)"

# Commit 5
echo "Another update to file2.txt" >> file2.txt
git add file2.txt
git commit --author="Jane Doe <jane@example.com>" -m "C5: Jane updates file2.txt again"

# Commit 6
echo "Final touch on file1.txt" >> file1.txt
git add file1.txt
git commit -m "C6: Finalize file1"

# Create a branch and make a commit there
git checkout -b feature/new-stuff C3
echo "Content for feature branch" > feature_file.txt
git add feature_file.txt
git commit -m "F1: Add feature_file.txt on feature branch"
git checkout main # Back to main branch

1. Using git reflog as a Safety Net

Let’s simulate losing a commit. Suppose we accidentally reset main too far back.

Currently, main is at C6.

Bash
# Check current HEAD
git log -n 1 --oneline main
# Expected: <hash_C6> C6: Finalize file1

# Oops! Accidentally reset main back by 3 commits
git reset --hard HEAD~3 # This will discard C6, C5, C4 from main
git log -n 1 --oneline main
# Expected: <hash_C3> C3: Update file1.txt
# Oh no! C4, C5, and C6 are "gone" from main!

Now, use git reflog to find the lost commits:

Bash
git reflog

Expected Output (will vary, look for recent entries):

Bash
<hash_C3> HEAD@{0}: reset: moving to HEAD~3
<hash_C6> HEAD@{1}: commit: C6: Finalize file1
<hash_C5> HEAD@{2}: commit: C5: Jane updates file2.txt again
<hash_C4> HEAD@{3}: commit: C4: Add sensitive data to file1 (oops)
<hash_F1> HEAD@{4}: checkout: moving from feature/new-stuff to main
<hash_F1> HEAD@{5}: commit: F1: Add feature_file.txt on feature branch
... (older entries)

We can see HEAD@{1} was commit C6 (<hash_C6>). We want to restore main to this state.

Bash
# Restore main to the state of C6
git reset --hard <hash_C6> # Replace <hash_C6> with the actual hash from your reflog output
# Or, more generally, if C6 was HEAD@{1} before the reset:
# git reset --hard main@{1} # (If using branch reflog, or HEAD@{1} if it was the immediate previous state of HEAD)

git log -n 1 --oneline main
# Expected: <hash_C6> C6: Finalize file1

The commits C4, C5, and C6 are now back on the main branch.

Recovering a deleted branch:

Bash
# Let's say feature/new-stuff was important
git branch -D feature/new-stuff # Delete the branch
# Oh no, it's gone!

git reflog
# Look for the last commit on feature/new-stuff, e.g.,
# <hash_F1> HEAD@{...}: commit: F1: Add feature_file.txt on feature branch
# Or an entry like: checkout: moving from feature/new-stuff to main

# Recover the branch
git checkout -b feature/new-stuff-recovered <hash_F1> # Replace <hash_F1> with actual hash

2. Advanced git log Examples

Bash
# Show commits by Jane Doe
git log --author="Jane Doe" --oneline
# Expected output:
# <hash_C5> C5: Jane updates file2.txt again
# <hash_C2> C2: Add file2.txt by Jane

# Show commits on main since C3, affecting file1.txt
git log C3..main --oneline -- file1.txt
# Expected output:
# <hash_C6> C6: Finalize file1
# <hash_C4> C4: Add sensitive data to file1 (oops)

# Custom pretty format
git log -n 3 --pretty="format:%h %ar: %s [%an]" --graph
# Expected output (structure):
# * <hash_C6> X days ago: C6: Finalize file1 [Your Name]
# * <hash_C5> X days ago: C5: Jane updates file2.txt again [Jane Doe]
# * <hash_C4> X days ago: C4: Add sensitive data to file1 (oops) [Your Name]

# Find commits that introduced or removed "SensitiveData_XYZ"
git log -S"SensitiveData_XYZ" --oneline -p
# Expected: Should show commit C4 and its diff highlighting the addition.

# Find commits where diffs contain "file2" (case insensitive)
git log -G"file2" -i --oneline
# Expected: C2 and C5

3. Using git blame

Bash
# Blame file1.txt
git blame file1.txt

Expected Output (structure, hashes and dates will vary):

Bash
^<hash_C1> (Your Name  2024-05-15 10:00:00 +0300 1) Initial content for file1.txt
<hash_C3> (Your Name  2024-05-15 10:02:00 +0300 2) Update file1.txt with more data
<hash_C4> (Your Name  2024-05-15 10:03:00 +0300 3) SensitiveData_XYZ
<hash_C6> (Your Name  2024-05-15 10:05:00 +0300 4) Final touch on file1.txt

This shows who last changed each line and in which commit.

Bash
# Blame only lines 2-3 of file1.txt
git blame -L 2,3 file1.txt
# Expected output:
# <hash_C3> (Your Name  2024-05-15 10:02:00 +0300 2) Update file1.txt with more data
# <hash_C4> (Your Name  2024-05-15 10:03:00 +0300 3) SensitiveData_XYZ

4. Using git shortlog

Bash
# Summarize commit counts by author, sorted by count
git shortlog -sn

Expected Output:

Bash
     4  Your Name
     2  Jane Doe
     

Bash

# Show commit messages grouped by author (Jane Doe only)
git shortlog --author="Jane Doe"

Expected Output:

Bash
Jane Doe (2):
      C2: Add file2.txt by Jane
      C5: Jane updates file2.txt again

5. Advanced History Rewriting with git filter-repo (Illustrative)

CRITICAL WARNING: The following commands rewrite history. ALWAYS run them on a fresh clone of your repository that you can afford to mess up. Never run them on your primary working copy without a backup, and especially not on a repository that has been shared if you haven’t coordinated with all collaborators.

Scenario: Remove the accidentally committed SensitiveData_XYZ string from file1.txt throughout the entire history. (This is a simplified example; real sensitive data removal might require more complex patterns or blob filtering).

First, you’d need to install git-filter-repo if you haven’t already:

pip install git-filter-repo (or other method depending on your OS/Python setup).

On a fresh clone of git-history-lab:

Bash
# cd ../
# git clone git-history-lab git-history-lab-filtered
# cd git-history-lab-filtered

# Example: Remove a file named 'secret.txt' if it existed
# git filter-repo --path secret.txt --invert-paths --force

# Example: To remove the line "SensitiveData_XYZ" from file1.txt in all commits.
# This is a bit more involved with filter-repo directly for content.
# A common approach is to use --blob-callback or --content-filter.
# Here's a conceptual --replace-text example (ensure your filter-repo version supports it or adapt):

# Create a replacements file, e.g., `replacements.txt`:
# SensitiveData_XYZ==REDACTED_DATA

# Then run (syntax might vary based on filter-repo version and specific needs):
# git filter-repo --replace-text replacements.txt --paths file1.txt --force
# For this specific string, we could also target the commit that added it (C4)
# and use interactive rebase to edit that commit if it's simple enough.
# However, filter-repo is for repository-wide changes.

# A more robust filter-repo way for content might be:
# git filter-repo --content-filter \
#    --expression 'blob.data = blob.data.replace(b"SensitiveData_XYZ", b"REDACTED")' \
#    --force

After running such a command, git filter-repo will process all commits. Commit C4 (and any subsequent commit) would have new SHA-1 hashes. file1.txt in the history would no longer contain “SensitiveData_XYZ”.

Verify:

Bash
git log --oneline
# Observe changed SHAs for C4, C5, C6.
git show <new_hash_C4> # Check content of file1.txt in the new C4

The output of git show for the new C4 should not contain “SensitiveData_XYZ”.

Remember to clean up:

After verifying, you might push this to a new remote repository if this was a cleanup of a private repo, or if replacing a shared repo, all collaborators would need to re-clone or reset their local copies to the new history. This is a major disruptive operation.

OS-Specific Notes

  • git filter-repo Installation:
    • git filter-repo is a Python script. The most common way to install it is using Python’s package installer, pip: pip install git-filter-repo.
    • Windows: You’ll need Python and pip installed and in your PATH.
    • macOS/Linux: Python is usually pre-installed. You might need to install pip if it’s not available (python3 -m ensurepip --upgrade or via your system’s package manager like apt install python3-pip). You might also use pip3 if python3 is your default.
  • Shell Quoting for git log --pretty=format::
    • When using complex format strings with spaces or special characters in git log --pretty=format:"...", quoting rules can differ slightly between shells (Bash, Zsh, PowerShell, Windows CMD).
    • Bash/Zsh (Linux/macOS): Single quotes ('...') are generally safer for literal strings, while double quotes ("...") allow variable expansion (which you usually don’t want in the format string itself).
    • PowerShell (Windows): PowerShell has its own quoting rules. Often, enclosing the format string in double quotes works, but complex characters might need escaping with a backtick (`).
    • Windows CMD: CMD’s quoting is more limited. Using Git Bash (which comes with Git for Windows) often provides a more consistent experience for complex Git commands.
  • Performance of History Rewriting:
    • git filter-branch (the old tool) is notoriously slow, especially on large repositories and on Windows due to its reliance on shell scripts and frequent process creation.
    • git filter-repo is significantly faster across all platforms because it’s optimized in Python and works more directly with Git data.
  • Case Sensitivity in Searches:
    • When using git log -S or -G, remember that the default search is case-sensitive. Use the -i option (e.g., git log -S"mystring" -i) for case-insensitive searches. This behavior is consistent across OSs, but the underlying filesystem’s case sensitivity (e.g., typically sensitive on Linux, insensitive on Windows/default macOS) is a separate factor that can affect how files are named and found.

Common Mistakes & Troubleshooting Tips

Git Issue / Error Symptom(s) Troubleshooting / Solution
git reflog is Local Only Expecting git reflog to show history from a remote or recover commits lost on another collaborator’s machine. Not finding expected commits after fetching. Understand that reflog tracks local HEAD and branch tip movements. It’s your personal safety net.
  • Remote issues require checking remote branches or server-side backups (if available).
  • Collaborators must use their own local reflogs for their recovery.
Complex git log Filters Not Working git log returns no commits, incorrect commits, or errors with complex filter combinations (dates, authors, regex).
  • Double-check syntax for dates, author names, and regular expressions (POSIX ERE for -G). Refer to git help log.
  • Remember -S (pickaxe) counts changes in occurrences, while -G matches patterns in the diff.
  • Test simpler filters first and build up complexity. Pay attention to shell quoting for format strings.
  • Ensure -- is used to separate paths from revisions/options if filtering by path.
Running filter-branch / filter-repo on a Live Shared Repository CRITICAL: Collaborators encounter major issues pulling/pushing; histories diverge; duplicated commits appear; general chaos. The Golden Rule: DO NOT REWRITE SHARED/PUBLISHED HISTORY without extreme caution and full team coordination.
  • Always backup the repository first. Work on a fresh clone for filtering.
  • If unavoidable (e.g., critical security data removal):
    1. Communicate extensively with the entire team *before* the operation.
    2. Ensure everyone has pushed their current work or knows how to re-integrate it.
    3. After force-pushing the rewritten history, all collaborators MUST re-clone the repository or perform a careful git reset --hard origin/<branch> on their local copies after fetching.
  • Consider if the problem can be solved with a new commit that reverts/fixes the issue instead of rewriting history.
Losing Original Refs After filter-branch Deleting the refs/original/ backup refs created by filter-branch before being absolutely certain the rewrite was successful. Realizing the filter had an error later. filter-branch saves original refs in refs/original/namespace/. Do not delete this namespace until 100% sure the filtered history is correct and desired.
(git filter-repo has different, often safer, backup/recovery mechanisms or expects operation on a clone).
git blame Points to a Reformatting/Refactoring Commit git blame output attributes a line to a commit that only changed formatting (e.g., indentation, auto-formatting) or moved code, not the commit that introduced the logic.
  • Be aware blame shows who *last modified* the line.
  • Use git blame -w to ignore whitespace-only changes.
  • For significant refactoring or code movement, you might need to look at the commit history around the blamed commit (e.g., git log -p <blamed_commit_hash>^!).
  • Options like -C (detect moved/copied lines within same commit) or -M (detect from any commit) can help but have limitations and performance costs. Sometimes manual history tracing is needed.
git filter-repo not found or not working Command git filter-repo results in “command not found” or Python errors.
  • git filter-repo is a third-party tool, not part of core Git. It needs to be installed separately (usually via pip install git-filter-repo).
  • Ensure Python (typically Python 3) and pip are correctly installed and in your system’s PATH.
  • Check the git-filter-repo documentation for specific installation instructions and dependencies for your OS.

Exercises

Use the git-history-lab repository you set up. You may need to reset its state or add new commits for some exercises.

  1. Reflog Branch Recovery Drill:
    1. On main, create a new branch temp-feature.
    2. Add two commits to temp-feature.
    3. Switch back to main.
    4. Accidentally delete the temp-feature branch using git branch -D temp-feature.
    5. Use git reflog to find the SHA-1 of the last commit made on temp-feature.
    6. Recover the temp-feature branch by checking out that commit to a new branch named temp-feature-recovered.
  2. Log Detective Work:
    1. Using git log on your git-history-lab repository:
      • Find all commits made by “Jane Doe” that affected file2.txt.
      • Find all commits on main made in the last hour (if you made recent commits; otherwise, adjust the timeframe or use a specific date range from your setup) that contain the word “Update” (case-insensitive) in their commit message.
      • Display the last 3 commits on main using a custom format that shows: abbreviated hash, relative committer date, author name, and the full commit message body, each on a new line.
  3. Blame and Trace:
    1. Run git blame file1.txt.
    2. Identify the commit that introduced the line “SensitiveData_XYZ”.
    3. Use git show <commit_hash> for that commit to see the full context of the change.
    4. If you have a commit that only changed whitespace in file1.txt (you might need to add one), run git blame -w file1.txt and compare its output to git blame file1.txt for those lines.
  4. (Optional/Advanced) filter-repo Simulation – On a Clone!
    1. Clone your git-history-lab repository to a new directory (e.g., git-history-lab-clone). Work only in this clone for this exercise.
    2. In the clone, imagine file2.txt should never have been committed. Use git filter-repo to remove file2.txt from the entire history of the cloned repository.Command hint: git filter-repo –path file2.txt –invert-paths –force
    3. Verify by checking the log and trying to find file2.txt in older commits (it should be gone).
    4. Delete the git-history-lab-clone directory afterwards to ensure you don’t mix it up with your original. This exercise is purely to understand the command’s effect.

Summary

  • git reflog: Your local safety net, recording movements of HEAD and branch tips. Essential for recovering “lost” commits or branches (e.g., HEAD@{index}).
  • Advanced git log:
    • Filtering: --author, --committer, --since/--until, --grep, -- <path>, commit ranges.
    • Formatting: --oneline, --graph, --decorate, --pretty="format:..." with placeholders like %h, %an, %ad, %s, %d.
  • Searching Code Changes:
    • git log -S"string" (pickaxe): Finds commits that changed the count of “string”.
    • git log -G"regex": Finds commits where diffs have lines matching “regex”.
  • git blame <file>: Shows who last modified each line of a file, and in which commit. Options: -L, -w, -C, -M.
  • git shortlog: Summarizes git log output, grouped by author. Options: -s, -n, -e.
  • Advanced History Rewriting (Use with Extreme Caution):
    • Reserved for major repository surgery (e.g., removing sensitive data, large files from all history).
    • git filter-branch: Older, slower, complex. Largely superseded.
    • git filter-repo: Modern, faster, safer alternative (requires separate installation).
    • Crucial Warning: These tools rewrite history (change SHAs). Never use on shared/pushed history without full team coordination and understanding the disruptive impact. Always backup first.

Mastering these tools allows for deep insights into your project’s history and provides mechanisms for recovery and, when absolutely necessary, for carefully considered history alterations.

Further Reading

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top