For Good Measure

Git and its hub

by Colby Russell. 2015 May 31.

Historically, I've avoided GitHub. I'm one of those people that agrees with the position that you shold be conscious of the risks you run with monocultures, plus I just don't think GitHub is actually all that great. I do make concessions, of course. Skip to the bottom if you just want details about my current revision control habits.

Forewarning: I don't think I'm about to say anything that hasn't been said before. I'm writing only because it occurs to me that if someone were to say, "I've tried to avoid using GitHub", then it's entirely possible that there exist people who haven't thought much about it and would have no idea why someone would take a stance like that.

One problem with GitHub is Git itself. See, this isn't limited to GitHub; I've also avoided Git where possible. When comparing the problems of monoculture around Git and a monoculture around GitHub, lots of the problems go away—GitHub is a centralized service, and Git is not—but some of them remain arguably relevant. One is the competition argument. That is, you don't want to encourage a scenario where something, whether it be a product or a service, has no competition, because competition leads to good things, and lack of competition is thought not to drive improvements, at least not as effectively. This may not be a terribly convincing argument in the world of revision control systems, and I'm not sure that I totally agree with it myself. The fact that a near-monoculture oriented around GitHub is capable of advancing something that's approaching a monoculture around Git itself may be proof of impotence in the competition argument: in adoption and usage, Git is pretty much trouncing Mercurial. Indeed, the benefits of an industry mostly unified around one system, particularly when the system is an open one like Git, very well may outweigh any advantages that competition brings.

Git users always point out how Git is so much more powerful than Mercurial. Recent versions of Mercurial are supposed to have made many of these comparisons obsolete, but ignoring this, I would still accept that the Git advocates are right, but here's the kicker: Even so, Mercurial is still a better system. It all comes down to usability.

Here's a thing that happens frequently: someone mentions that they find Git confusing, and someone comes along to share a link that's supposed to explain the concepts behind Git. "I found Git confusing, too, until I understood it conceptually", they say. The resource they link to is almost always trying to nudge the reader away from a CVS/SVN mindset. Here's the thing: I already understand Git on a conceptual level. I understand the underpinnings of DVCSs. And as a matter of fact, it's not that I've got an SVN background clouding my thinking, because I don't. (Funnily enough, I always avoided SVN for exactly the reasons Linus gave for avoiding it.) So you can stop trying to sell us on the idea of a DVCS workflow. I understand the concepts. If I ever say anything that sounds like I'm saying Git is confusing in some way, it means exactly one thing: I'm coming at this with an exasperation for the way fundamental Git concepts map onto its infuriatingly obtuse UI.

Here's another thing that happens: someone gives an example of confusing Git output and/or documentation, then someone else comes along to say, "It's like this. Simple." I suspect there's something else at play--and this touches on a broader social theory that I've been working at the back of my mind for a while. The idea is that familiarity smooths over any rough spots. It goes like this: there's this terrain with all these cracks on the ground liable to trip a person up, then there's this thing called "familiarity" which some people are able to use, and it oozes forth in the path ahead of them, filling in the cracks and smoothing the rough patches, like those appetizing visuals that you always see in ads for facial creams. The result is that the rough spots for them become a total non-issue. But it's a little more subtle than that, because if you were to ask them about all the rough spots, they'd tell you that they don't know what you're talking about and can't even see any. And they'd be right.

There's a reason, though, why these two things exist:

My claim is that Git's UI and its documentation suffer from a particular problem, which is that of being an artifact created by those who already understand what's going on. That's not the entirety of it, because all documentation is written like that. It has to be. But if you ask someone to document something there are two things that can result: docs that are understandable to both the experts and those unfamiliar, and then docs that explain in perfectly clear language only to those already familiar while being otherwise completely baffling to anyone else.

(I guess there's a third possible result, too, which would be categorized as "just unadulterated crap", but I was trying to focus on the sublety of the other two here.)

So my claim is really that Git's documentation and UI tends to be of the second type.

Mercurial is plagued in some ways by this, too, I'm sure. In fact, if I think back, I very definitely remember instances where I encountered pitfalls due to Mercurial's UI, but I'd be unable to tell you now exactly what they were. So Mercurial suffers from it, too, absolutely. It's just that it suffers from it a lot less.

There may be a good reason for this. Mercurial is just a lot simpler, by which I mean it has a less featureful core. In contrast to Git, with Mercurial you only pay for the features you use.

A few years back, when GitHub really began taking off, I remember pushing for Git within my team for our capstone project and for my team in another course that I was taking concurrently, when the other option on the table was to use no revision control at all. Mozilla had just settled on Mercurial a couple years before, back when it wasn't clear it was going to lose. My rationale at the time was, "Hey, I've got Mercurial covered, and I'm seeing more projects using GitHub everyday. Let's get on that." Bad idea. The index was baffling. Not just for me, but for everyone. I think by pushing for Git, I may have inflicted on my teammates a wholesale fear of revision control outright, and I know it wouldn't have been a problem if my suggestion had been to use Mercurial instead.

Some people love Git's staging area. It comes up all the time. They think it's great. They couldn't work without it. Here's where we see the difference in approach for Mercurial and Git. With Git, you have to pay the cost of interacting with the staging area whether you want it or not. In Mercurial, this would exist as an extension that provides that extra layer of indirection only if you enable it. And it does exist, in the record extension. I think. I wouldn't know. I see the staging area as a completely pointless level of indirection and have no use for trying to emulate it in Mercurial.

Now, on to GitHub itself.

For starters, it's Git-only, so everything above concerning Git simultaneously affects GitHub. Then there's the issue of GitHub, as a product itself, leaving something to be desired, and that something can usually be found elsewhere. GitHub's issue tracking is a good example.

GitHub's issue tracking is more or less a capable bugtracker as far as toy bugtrackers go. Bugzilla is a good example of a tracker fit for heavy-duty workloads. Let's look at an example. If you file a bug against https://github.com/example/repo, it creates an issue that's bound to that repo for eternity. If that organization has a related repo, say https://github.com/example/otherrepo, then you're out of luck if the bug triage process reveals that it should have actually been filed against "otherrepo" instead. (Assume both "repo" and "otherrepo" are distinct components used within one product; it's conceivable the reporter would make a mistake identifying in which of the two that the problem actually lies.) The best course of action for you if this happens—the best—is to close the original issue filed against "repo" and then open up a new one for "otherrepo". Any discussion, et cetera, is completely wiped clean in the new bug, and readers have to manually cross reference the original issue. Or you can leave it open at its original site and ignore the problems that thrusts upon you, namely one of poor organization in the places where you're trying to do work.

Bugzilla, on the other hand, is meant to run as a single instance to manage all of a project's bugs, no matter where the bug lies. It has the notion of "products" and "components". You can approximate the latter with labels in GitHub, but the leaks start to become visible when you try to approximate both at the same time. Bugzilla also has the concept of bug status down pat. This isn't just about lifecycle, which you can ignore if you like, but also about bug resolutions. In GitHub, your bug is either opened or closed. Again, you can approximate both Bugzilla's bug life cyle and its resolution type with labels, but by now you've fallen back to labels for all these things, and they're all just floating around in one big soup. Want to mark a bug as the equivalent of both FIXED and WONTFIX? Go ahead, they're just labels. What does it mean? Who cares, I guess.

And then there are all sorts of problems with the way GitHub handles code reviews. The fact that GitHub has comments that are specifically meant to be in response to a pull request is a good thing. That the pull request and the issue it's meant to fix are presented as these totally isolated things is a very bad thing. Gijs specifically calls this out in the comments to Gregory Szorc's post "Please Stop Using MQ":

github is terrible about filing a bug first and then creating a patch, because you are forced to have two issues in its tracker (you file an issue first, and your pull request will create another one), which means discussion about approach etc. gets split between the "issue" and the "pull request".

Again, the fact that comments concerning a particular pull request are organized in a way that it reflects that relationship? That's a really good thing. But what GitHub should do is aggregate all discussion into the page for the issue itself. Yes, even when there are multiple pull requests for the issue. In fact, especially when there are multiple pull requests for the issue. E.g., someone creates a pull request, the maintainer indicates they'd like more work done in some area before integrating the changes, and the requestor creates another pull request after making the changes to address those concerns. Now we have three threads of discussion, or rather, one discussion spread out amongst three pages. Bugzilla handles this by simply allowing you to mark older patches as obsolete. The patch/fork distinction deserves some comment, too.

Forks are dumb. The ability to fork is an incredibly valuable one, but forks themselves are total overkill for anybody just looking to submit a patch, which is the use case for the vast majority of contributors by an it's-not-even-close margin. Gijs nails it again. As he writes, "jquery has been forked over 7000 times at the time I'm writing this comment. The only version of jquery that's actually used [...] is under the jquery project's authority in github".

The thing about forks is not just that they're these conceptually heavyweight things that feel wrong. There's actually measurable friction involved with using them; the fork-and-PR workflow is heavyweight. "Doing things the github way takes forever", Gijs writes. When comparing it to patch submission: "[doing a patch] is a 3 step process: write code, do a diff, upload the result."

With forks, there's also a weird thing that happens. Go fork a project and then browse the repo on the Web as if you're someone else. I.e., you're unfamiliar with both whoever you are and with the project itself. Take a look at its README. If the original author wasn't careful, it now reads as if it's your project and a casual observer might mistake it for the canonical repo. This is a minor detail, but it weirds me out. I go through some effort to make sure I change the project's description to make it clear that it's just a fork of the proper project. "But wait", you might say. "If you fork a project, GitHub says that it's a fork and even links to the original project." Yeah, that's right. If you fork from within the GitHub UI. If you just create a new project on GitHub and add it as a new remote and push to it, you don't get such a warning. "So just use the GitHub UI to fork it, then." Nope. That's not possible if the original project isn't hosted on GitHub. If the original project's Git instance is self-hosted, creating a new project on GitHub and pushing to it is the only way to do it if you want your fork hosted there, and GitHub doesn't show anything in the UI to indicate that. In fact, it doesn't even show the forked-from UI if you use this workflow and the original project is hosted on GitHub. It just doesn't do that sort of detection.

In addition to manually changing the description to reflect that it is, in fact a temporal "fork", I also make sure to only keep my fork around as long as it takes to integrate my changes. I'm aggressive with pruning forks, which is something that seems to be rare elsewhere on GitHub. The result is similar to before: you click on someone's profile and listed in their repositories are all of these non-forks that were only ever created because they wanted to contribute a patch once, or maybe every now and then. "Every now and then" may have something to do with their keeping the fork around. See, if you go fix something and the pull request gets accepted, and then you prune the fork like I do, if two weeks or two months later you want to fix something else, then you've got to go recreate that project again before submitting another pull request. So it's not even as if the "leave the fork around" mentality can be attributed to unforgivable laziness. It's that the whole forking workflow is working against you to do otherwise.

I've pretty much blown way more of my time on blogging than I originally allotted for this, and I didn't even get to the part about how Git logs are totally unreliable. (Example: I once made a trivial change to this file. See if it can be found in the file's change history. Spoiler alert: it can't.) I'm also getting a little bummed about how negative I'm coming off here, although I suppose that's just the nature, given the topic I set out to tackle upfront.

So I'll stop now and leave you with a rundown of how I currently operate these days: I first reach for Mercurial, especially for clean-slate repos that are never going to be seen by other eyes, since I don't have to worry about how potential contributors may be uncomfortable with something that isn't Git. When I do use Git, it's always as a result of an existing project that has chosen Git for its revision control, but I still opt to refrain from hosting on GitHub, and the only time I use its features is when the original project is hosted there. My Git remotes point to GitLab, because yay for heterogeneity. The free private repos and the fact that GitLab has a (FOSS) "Community Edition" both go a long way towards helping inform that decision.