mirror of
https://github.com/git/git
synced 2024-11-05 18:59:29 +00:00
011b648646
The current document mentions OBJ_* constants without their actual values. A git developer would know these are from cache.h but that's not very friendly to a person who wants to read this file to implement a pack file parser. Similarly, the deltified representation is not documented at all (the "document" is basically patch-delta.c). Translate that C code to English with a bit more about what ofs-delta and ref-delta mean. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
254 lines
9 KiB
Text
254 lines
9 KiB
Text
Git pack format
|
|
===============
|
|
|
|
== pack-*.pack files have the following format:
|
|
|
|
- A header appears at the beginning and consists of the following:
|
|
|
|
4-byte signature:
|
|
The signature is: {'P', 'A', 'C', 'K'}
|
|
|
|
4-byte version number (network byte order):
|
|
Git currently accepts version number 2 or 3 but
|
|
generates version 2 only.
|
|
|
|
4-byte number of objects contained in the pack (network byte order)
|
|
|
|
Observation: we cannot have more than 4G versions ;-) and
|
|
more than 4G objects in a pack.
|
|
|
|
- The header is followed by number of object entries, each of
|
|
which looks like this:
|
|
|
|
(undeltified representation)
|
|
n-byte type and length (3-bit type, (n-1)*7+4-bit length)
|
|
compressed data
|
|
|
|
(deltified representation)
|
|
n-byte type and length (3-bit type, (n-1)*7+4-bit length)
|
|
20-byte base object name if OBJ_REF_DELTA or a negative relative
|
|
offset from the delta object's position in the pack if this
|
|
is an OBJ_OFS_DELTA object
|
|
compressed delta data
|
|
|
|
Observation: length of each object is encoded in a variable
|
|
length format and is not constrained to 32-bit or anything.
|
|
|
|
- The trailer records 20-byte SHA-1 checksum of all of the above.
|
|
|
|
=== Object types
|
|
|
|
Valid object types are:
|
|
|
|
- OBJ_COMMIT (1)
|
|
- OBJ_TREE (2)
|
|
- OBJ_BLOB (3)
|
|
- OBJ_TAG (4)
|
|
- OBJ_OFS_DELTA (6)
|
|
- OBJ_REF_DELTA (7)
|
|
|
|
Type 5 is reserved for future expansion. Type 0 is invalid.
|
|
|
|
=== Deltified representation
|
|
|
|
Conceptually there are only four object types: commit, tree, tag and
|
|
blob. However to save space, an object could be stored as a "delta" of
|
|
another "base" object. These representations are assigned new types
|
|
ofs-delta and ref-delta, which is only valid in a pack file.
|
|
|
|
Both ofs-delta and ref-delta store the "delta" to be applied to
|
|
another object (called 'base object') to reconstruct the object. The
|
|
difference between them is, ref-delta directly encodes 20-byte base
|
|
object name. If the base object is in the same pack, ofs-delta encodes
|
|
the offset of the base object in the pack instead.
|
|
|
|
The base object could also be deltified if it's in the same pack.
|
|
Ref-delta can also refer to an object outside the pack (i.e. the
|
|
so-called "thin pack"). When stored on disk however, the pack should
|
|
be self contained to avoid cyclic dependency.
|
|
|
|
The delta data is a sequence of instructions to reconstruct an object
|
|
from the base object. If the base object is deltified, it must be
|
|
converted to canonical form first. Each instruction appends more and
|
|
more data to the target object until it's complete. There are two
|
|
supported instructions so far: one for copy a byte range from the
|
|
source object and one for inserting new data embedded in the
|
|
instruction itself.
|
|
|
|
Each instruction has variable length. Instruction type is determined
|
|
by the seventh bit of the first octet. The following diagrams follow
|
|
the convention in RFC 1951 (Deflate compressed data format).
|
|
|
|
==== Instruction to copy from base object
|
|
|
|
+----------+---------+---------+---------+---------+-------+-------+-------+
|
|
| 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
|
|
+----------+---------+---------+---------+---------+-------+-------+-------+
|
|
|
|
This is the instruction format to copy a byte range from the source
|
|
object. It encodes the offset to copy from and the number of bytes to
|
|
copy. Offset and size are in little-endian order.
|
|
|
|
All offset and size bytes are optional. This is to reduce the
|
|
instruction size when encoding small offsets or sizes. The first seven
|
|
bits in the first octet determines which of the next seven octets is
|
|
present. If bit zero is set, offset1 is present. If bit one is set
|
|
offset2 is present and so on.
|
|
|
|
Note that a more compact instruction does not change offset and size
|
|
encoding. For example, if only offset2 is omitted like below, offset3
|
|
still contains bits 16-23. It does not become offset2 and contains
|
|
bits 8-15 even if it's right next to offset1.
|
|
|
|
+----------+---------+---------+
|
|
| 10000101 | offset1 | offset3 |
|
|
+----------+---------+---------+
|
|
|
|
In its most compact form, this instruction only takes up one byte
|
|
(0x80) with both offset and size omitted, which will have default
|
|
values zero. There is another exception: size zero is automatically
|
|
converted to 0x10000.
|
|
|
|
==== Instruction to add new data
|
|
|
|
+----------+============+
|
|
| 0xxxxxxx | data |
|
|
+----------+============+
|
|
|
|
This is the instruction to construct target object without the base
|
|
object. The following data is appended to the target object. The first
|
|
seven bits of the first octet determines the size of data in
|
|
bytes. The size must be non-zero.
|
|
|
|
==== Reserved instruction
|
|
|
|
+----------+============
|
|
| 00000000 |
|
|
+----------+============
|
|
|
|
This is the instruction reserved for future expansion.
|
|
|
|
== Original (version 1) pack-*.idx files have the following format:
|
|
|
|
- The header consists of 256 4-byte network byte order
|
|
integers. N-th entry of this table records the number of
|
|
objects in the corresponding pack, the first byte of whose
|
|
object name is less than or equal to N. This is called the
|
|
'first-level fan-out' table.
|
|
|
|
- The header is followed by sorted 24-byte entries, one entry
|
|
per object in the pack. Each entry is:
|
|
|
|
4-byte network byte order integer, recording where the
|
|
object is stored in the packfile as the offset from the
|
|
beginning.
|
|
|
|
20-byte object name.
|
|
|
|
- The file is concluded with a trailer:
|
|
|
|
A copy of the 20-byte SHA-1 checksum at the end of
|
|
corresponding packfile.
|
|
|
|
20-byte SHA-1-checksum of all of the above.
|
|
|
|
Pack Idx file:
|
|
|
|
-- +--------------------------------+
|
|
fanout | fanout[0] = 2 (for example) |-.
|
|
table +--------------------------------+ |
|
|
| fanout[1] | |
|
|
+--------------------------------+ |
|
|
| fanout[2] | |
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| fanout[255] = total objects |---.
|
|
-- +--------------------------------+ | |
|
|
main | offset | | |
|
|
index | object name 00XXXXXXXXXXXXXXXX | | |
|
|
table +--------------------------------+ | |
|
|
| offset | | |
|
|
| object name 00XXXXXXXXXXXXXXXX | | |
|
|
+--------------------------------+<+ |
|
|
.-| offset | |
|
|
| | object name 01XXXXXXXXXXXXXXXX | |
|
|
| +--------------------------------+ |
|
|
| | offset | |
|
|
| | object name 01XXXXXXXXXXXXXXXX | |
|
|
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| | offset | |
|
|
| | object name FFXXXXXXXXXXXXXXXX | |
|
|
--| +--------------------------------+<--+
|
|
trailer | | packfile checksum |
|
|
| +--------------------------------+
|
|
| | idxfile checksum |
|
|
| +--------------------------------+
|
|
.-------.
|
|
|
|
|
Pack file entry: <+
|
|
|
|
packed object header:
|
|
1-byte size extension bit (MSB)
|
|
type (next 3 bit)
|
|
size0 (lower 4-bit)
|
|
n-byte sizeN (as long as MSB is set, each 7-bit)
|
|
size0..sizeN form 4+7+7+..+7 bit integer, size0
|
|
is the least significant part, and sizeN is the
|
|
most significant part.
|
|
packed object data:
|
|
If it is not DELTA, then deflated bytes (the size above
|
|
is the size before compression).
|
|
If it is REF_DELTA, then
|
|
20-byte base object name SHA-1 (the size above is the
|
|
size of the delta data that follows).
|
|
delta data, deflated.
|
|
If it is OFS_DELTA, then
|
|
n-byte offset (see below) interpreted as a negative
|
|
offset from the type-byte of the header of the
|
|
ofs-delta entry (the size above is the size of
|
|
the delta data that follows).
|
|
delta data, deflated.
|
|
|
|
offset encoding:
|
|
n bytes with MSB set in all but the last one.
|
|
The offset is then the number constructed by
|
|
concatenating the lower 7 bit of each byte, and
|
|
for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1))
|
|
to the result.
|
|
|
|
|
|
|
|
== Version 2 pack-*.idx files support packs larger than 4 GiB, and
|
|
have some other reorganizations. They have the format:
|
|
|
|
- A 4-byte magic number '\377tOc' which is an unreasonable
|
|
fanout[0] value.
|
|
|
|
- A 4-byte version number (= 2)
|
|
|
|
- A 256-entry fan-out table just like v1.
|
|
|
|
- A table of sorted 20-byte SHA-1 object names. These are
|
|
packed together without offset values to reduce the cache
|
|
footprint of the binary search for a specific object name.
|
|
|
|
- A table of 4-byte CRC32 values of the packed object data.
|
|
This is new in v2 so compressed data can be copied directly
|
|
from pack to pack during repacking without undetected
|
|
data corruption.
|
|
|
|
- A table of 4-byte offset values (in network byte order).
|
|
These are usually 31-bit pack file offsets, but large
|
|
offsets are encoded as an index into the next table with
|
|
the msbit set.
|
|
|
|
- A table of 8-byte offset entries (empty for pack files less
|
|
than 2 GiB). Pack files are organized with heavily used
|
|
objects toward the front, so most object references should
|
|
not need to refer to this table.
|
|
|
|
- The same trailer as a v1 pack file:
|
|
|
|
A copy of the 20-byte SHA-1 checksum at the end of
|
|
corresponding packfile.
|
|
|
|
20-byte SHA-1-checksum of all of the above.
|