Implémenter Git en Typescript: la commande hash-object

Git.ts hash-object illustration

Dans le premier article de cette série, on a réimplémenté git init. La structure créée était un squelette : des dossiers, un HEAD, rien à l'intérieur de .git/objects.

C'est maintenant qu'on va commencer à le remplir.

git hash-object est la commande qui transforme un contenu en objet Git. Elle calcule l'identifiant de l'objet et peut l'écrire dans la base d'objets du dépôt. Toutes les autres commandes (add, commit, write-tree, etc.) reposent dessus.

Le code de cette série se trouve sur GitHub : github.com/alexisbchz/git.ts.

Observer le comportement de `git hash-object`

Comme pour init, on commence par regarder ce que fait la vraie commande.

mkdir /tmp/git-hash-object-test
cd /tmp/git-hash-object-test
git init
printf 'hello world\n' > hello.txt

Sans option, hash-object affiche l'identifiant sans rien écrire :

git hash-object hello.txt

3b18e512dba79e4c8300dd08aeb37f8e728b8dad

Avec -w, la commande écrit aussi l'objet dans .git/objects :

git hash-object -w hello.txt
ls .git/objects/3b/

18e512dba79e4c8300dd08aeb37f8e728b8dad

Deux choses ressortent.

L'identifiant est un SHA-1 (40 caractères hexadécimaux).
L'objet est rangé dans .git/objects/<deux-premiers>/<reste>. Git découpe le SHA en deux pour limiter le nombre d'entrées par dossier.

Le format d'un objet Git

Pour calculer le SHA-1, Git ne hache pas le contenu brut. Il préfixe le contenu avec un en-tête :

<type> <taille>\0<contenu>

Pour hello world\n (12 octets), le tampon haché devient :

blob 12\0hello world\n

Le \0 est un vrai octet nul, pas la séquence littérale \0. Il sert de séparateur entre l'en-tête (texte) et le contenu (binaire).

L'objet stocké sur disque est ce même tampon, compressé en zlib. En lecture, il suffit donc de décompresser le fichier, de scinder à l'octet nul, et on retrouve type, taille, et contenu.

Les types possibles sont blob, tree, commit, tag. Pour l'instant, on ne s'occupe que des blob.

Plomberie d'objets

On extrait toute la logique d'objet dans un fichier dédié, séparé de la commande.

src/objects.ts

import { mkdir, writeFile } from "node:fs/promises";
import { join } from "node:path";
import { createHash } from "node:crypto";
import { deflateSync } from "node:zlib";

export type GitObjectType = "blob" | "tree" | "commit" | "tag";

export function hashObject(type: GitObjectType, content: Buffer): string {
  const store = serialize(type, content);
  return createHash("sha1").update(store).digest("hex");
}

export async function writeObject(
  gitDir: string,
  type: GitObjectType,
  content: Buffer,
): Promise<string> {
  const hash = hashObject(type, content);
  const compressed = deflateSync(serialize(type, content));

  const dir = join(gitDir, "objects", hash.slice(0, 2));
  const path = join(dir, hash.slice(2));

  await mkdir(dir, { recursive: true });
  await writeFile(path, compressed);

  return hash;
}

function serialize(type: GitObjectType, content: Buffer): Buffer {
  const header = Buffer.from(`${type} ${content.length}\0`);
  return Buffer.concat([header, content]);
}

hashObject produit l'identifiant sans toucher au disque. writeObject écrit l'objet compressé dans .git/objects/xx/yyy..., puis retourne l'identifiant.

La fonction interne serialize prépare le tampon à hacher et à compresser. C'est exactement le même tampon dans les deux cas : hacher le contenu brut sans en-tête ne donnerait pas le SHA attendu par Git.

La lecture (readObject) sera ajoutée dans l'article suivant, quand on implémentera cat-file.

Trouver le dépôt

Pour écrire dans .git/objects, il faut d'abord trouver le .git du dépôt courant. Git accepte d'être appelé depuis n'importe quel sous-dossier ; il remonte les parents jusqu'à trouver un .git.

On reproduit ce comportement :

src/repository.ts

import { stat } from "node:fs/promises";
import { dirname, join, resolve } from "node:path";
import { UsageError } from "./errors";

export async function findGitDir(start: string = process.cwd()): Promise<string> {
  let current = resolve(start);

  while (true) {
    const candidate = join(current, ".git");

    try {
      const info = await stat(candidate);
      if (info.isDirectory()) {
        return candidate;
      }
    } catch (error: unknown) {
      if (!(error instanceof Error) || !("code" in error) || error.code !== "ENOENT") {
        throw error;
      }
    }

    const parent = dirname(current);
    if (parent === current) {
      throw new UsageError("not a git repository");
    }

    current = parent;
  }
}

dirname retourne current lui-même quand on est à la racine du système de fichiers. C'est notre condition d'arrêt : si remonter d'un cran ne change plus rien, on s'est cogné au plafond sans trouver de dépôt.

La commande CLI

Avec la plomberie en place, la commande devient courte. Elle rassemble les entrées (fichiers et/ou stdin), choisit entre hachage seul et écriture, puis affiche un identifiant par entrée.

src/commands/hash-object.ts

import { readFile } from "node:fs/promises";
import { parseArgs } from "node:util";
import type { Command, CommandContext } from "../main";
import { UsageError } from "../errors";
import { parseCommandArgs } from "../utils/parse";
import { findGitDir } from "../repository";
import { hashObject, writeObject, type GitObjectType } from "../objects";

const VALID_TYPES = new Set<GitObjectType>(["blob", "tree", "commit", "tag"]);

export class HashObjectCommand implements Command {
  readonly description =
    "Compute object ID and optionally write to the object database";

  async run({ args }: CommandContext): Promise<void> {
    const { values, positionals } = parseCommandArgs(() =>
      parseArgs({
        args,
        options: {
          write: { type: "boolean", short: "w", default: false },
          stdin: { type: "boolean", default: false },
          type: { type: "string", short: "t", default: "blob" },
        },
        strict: true,
        allowPositionals: true,
      }),
    );

    if (!VALID_TYPES.has(values.type as GitObjectType)) {
      throw new UsageError(`invalid object type: ${values.type}`);
    }
    const type = values.type as GitObjectType;

    const inputs: Buffer[] = [];
    if (values.stdin) {
      inputs.push(await readStdin());
    }
    for (const path of positionals) {
      inputs.push(await readFile(path));
    }

    if (inputs.length === 0) {
      throw new UsageError("hash-object expects a file or --stdin");
    }

    const gitDir = values.write ? await findGitDir() : null;

    for (const content of inputs) {
      const hash = gitDir
        ? await writeObject(gitDir, type, content)
        : hashObject(type, content);
      console.log(hash);
    }
  }
}

async function readStdin(): Promise<Buffer> {
  const chunks: Buffer[] = [];
  for await (const chunk of process.stdin) {
    chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

Quelques points :

Le tableau inputs permet d'accepter plusieurs fichiers en une seule invocation, comme la vraie commande (git hash-object a.txt b.txt).
--stdin ajoute le contenu de l'entrée standard à la liste, avant les fichiers.
findGitDir n'est appelée qu'avec -w. Sans cette option, on doit pouvoir hacher un fichier hors d'un dépôt.

Brancher la commande

Notre commandName actuel transforme InitCommand en init par un simple toLowerCase(). Pour HashObjectCommand, on veut hash-object, pas hashobject. On insère un tiret entre une minuscule et une majuscule :

src/main.ts

function commandName(commandType: CommandConstructor): string {
  return commandType.name
    .replace(/Command$/, "")
    .replace(/([a-z])([A-Z])/g, "$1-$2")
    .toLowerCase();
}

Avec cette règle, InitCommand reste init, HelpCommand reste help, et HashObjectCommand devient hash-object. Toute commande à plusieurs mots prendra automatiquement la forme attendue par Git.

On enregistre ensuite la nouvelle commande :

src/main.ts

import { HashObjectCommand } from "./commands/hash-object";

registry.register(InitCommand, HashObjectCommand, HelpCommand);

Vérifier avec Git

Le test qui compte vraiment : nos identifiants correspondent-ils à ceux que produit Git ?

mkdir /tmp/git-hash-test
cd /tmp/git-hash-test
git.ts init
printf 'hello world\n' > hello.txt

git.ts hash-object hello.txt
git hash-object hello.txt

Les deux affichent :

3b18e512dba79e4c8300dd08aeb37f8e728b8dad

Avec -w, on doit pouvoir relire l'objet avec la vraie commande :

git.ts hash-object -w hello.txt
git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

hello world

Si Git lit ce qu'on a écrit, c'est que le format est correct : en-tête, octet nul, contenu, compression zlib. Tout est aligné.

Conclusion

On peut maintenant calculer un identifiant d'objet et stocker un blob.

Dans le prochain article, on implémentera cat-file. La commande complète la boucle : lire un objet écrit par hash-object, et accessoirement, lire les objets écrits par la vraie commande Git.

Bloqué ou envie de partager vos notes ? Rejoins le serveur Discord.