When Git starts to report unmodified files as modified...

Sometimes you get stuck in git, you are unable to switch to another branch, or pull new changes because git status tells you that one or more files are modified. When you have effectively modified those files, this is perfectly normal, but when your working copy should be clean, you get puzzled. Your first reflex is either to checkout or reset hard the modified file, to revert those spurious changes. Hopelessly, it doesn't fix the problem, git persists, telling you that some files are modified. And you are struggling to understand what could be those stupid changes that you cannot revert.

Smudge and Clean filters are playing with your nerves

You are most probably fighting with that very obscure but powerful feature of Git, Git Attributes and especially its Smudge and Clean filters used in particular for Keyword Expansion.

If while git status reports changes, git diff -w doesn't report any, you have definitely a line-ending issue. If this is your case, you have not probably read so far, and you have most probably already found a more insightful article about that, like GitHub Dealing with line endings.

But if you still have differences, maybe you have another Smudge and Clean filters issue. Explanation about these filters is well resumed in the below pictures extracted from the open-sourced Pro Git book (CC BY 3.0 licence).

smudge.png
The smudge filter is run on checkout

clean.png
The clean filter is run when file are staged

So, the smudge filter transforms files when they enter your working copy, while the clean filter transforms them when they leave it.

Let's take for example the ident filter, the one that mimic the old behavior of CVS, which purpose is to inject in the $Id$ sequence found in your code file, the SHA-1 checksum of the blob storing that file. Let look at the process more closely:

  • when the file is checked out, the smudge filter search for the $Id$ sequence, and replace it with $Id: <SHA-1 of the blob> %
  • when you stage the file, the clean filter search for the $Id: <any SHA-1> %, and replace if back to $Id$, so the file will later be stored in the repository without any SHA-1.

This process looks great, is transparent, and should certainly works perfectly if nothing breaks the cycle. And like any other uncheckable condition, it happens. At least it happens to my colleague when she starts struggling on a spurious diff looking like this:

@@ -50,5 +50,5 @@
  * Some Java class javadoc.
  * 
- * @version $Id: 3bd2c281198e59f753e49858d33d216bba98bec0 $
+ * @version $Id$
  */
 public class MyClass()

She looks at the file on her working copy, and the files contains the line shown has suppressed, so what is happening ? Why do I have such diff, my file is unchanged, what is the problem with git ? 

If you have carefully read the above process, you may have understood that in this example, the ident filter has simply done his cleaning job, and put back the $Id$ in place. So, why do I have a diff then... Surprisingly another committer has somehow broken the cycle, and they committed the SHA-1 to the repository. So now, the clean filter on our side is creating the necessary diff.

Our way out is simply to commit that diff to the repository, since this is what should have been there in first place.
This is obviously a puzzling situation since what we see is so unexpected and confusing that it is not easy to understand what git is doing and why we are in that annoying situation.

Now how this other committer has reach that point ? How could the cycle be broken at some point, causing a smudge, but no clean ?

Don't use multiple implementations of Git !

The committer has mixup the usage of different implementation of Git during their work. 

While the latest Git CLI is supporting .gitattributes, and therefore provide full support of the smudge/clean filters, this is not the case of JGit, a Java based implementation of Git. That JGit is the implementation used by eGit, the Eclipse team provider for Git. At the time this article is written, JGit suffer of the following bug 342372, and regarding the ident filter in particular of bug 452968.

Now you could imagine multiple dramatic scenarios that breaks the smudge/clean cycle. In the above case, the initial clone/checkout of the working copy was done with the native Git CLI. But later, a commit was applied from Eclipse using eGit. That later commit has not applied the clean filter, and therefore the SHA-1 produce by the smudge ident filter of the native CLI has been committed.

This is probably why you should refrain using multiple implementations of Git on the same working copy / repository, at least in write mode.

I hope this could help prevent others getting stuck, or help getting them unstuck...