Git clone: a data-driven study on cloning behaviors By Solmaz Abbaspoursani, GitHub December 28, 2020
Here at GitHub, we use a
data-driven approach to answer these questions. We ran an experiment
to compare these different clone options and measured the client and
server behavior. It is not enough to just compare In this experiment,
we aimed to answer the below questions: It is worth special
emphasis that these results come from simulations that we performed
in our controlled environments and do not simulate complex workflows
that might be used by many Git users. Depending on your workflows
and repository characteristics these results may change. Perhaps
this experiment provides a framework that you could follow to
measure how your workflows are affected by these options. If you
would like help analyzing your worksflows, feel free to engage with GitHub’s
Professional Services team. For a summary of our
findings, feel free to jump to our
conclusions and recommendations. To maximize the
repeatability of our experiment, we use open source repositories for
our sample data. This way, you can compare your repository shape to
the tested repositories to see which is most applicable to your
scenario. We chose to use the These repositories
were mirrored to a GitHub Enterprise Server running version 2.22 on
a 8-core cloud machine. We use an internal load testing tool based
on Gatling to
generate Once a test is
complete, we use a combination of Gatling results, The git-sizer tool
measures the size of Git repositories along many dimensions. In
particular, we care about the total size on disk along with the
count of each object type. The table below contains this information
for our three test repositories. The We care about the
following clone options: In addition to these
options at clone time, we can also choose to fetch in a shallow way
using We organized our test
scenarios into the following ten categories, labeled T1 through T10.
T1 to T4, simulate four different In partial clones,
the new blobs at the new ref tip are not downloaded until we
navigate to that position and populate our working directory with
those blob contents. To be a fair comparison with the full and
shallow clone cases, we also have our simulation run In all the scenarios
above, a single user was also set to repeatedly change 3 random
files in the repository and push them to the same branch that the
other users were cloning and fetching. This simulates repository
growth so the Let’s dig into the
numbers to see what our experiment says. The full numbers are
provided in the tables below. Unsurprisingly,
shallow clone is the fastest clone for the client, followed by a
treeless then blobless partial clones, and finally full clones. This
performance is directly proportional to the amount of data required
to satisfy the clone request. Recall that full clones need all
reachable objects, blobless clones need all reachable commits and
trees, treeless clones need all reachable commits. A shallow clone
is the only clone type that does not grow at all along with the
history of your repository. The performance
impact of these clone types grows in proportion to the repository
size, especially the number of commits. For example, a shallow clone
of As for server
performance, we see that the Git CPU time per clone is higher for
the blobless partial clone (T4). Looking a bit closer with If the full clone is
sending more data in a full clone, then why is it spending more CPU
on a partial clone? When Git sends all reachable objects to the
client, it mostly transfers the data it has on disk without
decompressing or converting the data. However, partial and shallow
clones need to extract that subset of data and repackage it to send
to the client. We are investigating ways to reduce this CPU
cost in partial clones. The real fun
starts after cloning the repository and users start developing and
pushing code back up to the server. In the next section we analyze
scenarios T5 through T10, which focus on the The full fetch
performance numbers are provided in the tables below, but let’s
first summarize our findings. The biggest finding
is that shallow fetches are the worst possible options, in
particular from full clones. The technical reason is that the
existence of a “shallow boundary” disables an
important performance optimization on the server.
This causes the server to walk commits and trees to find what’s
reachable from the client’s perspective. This is more expensive in
the full clone case because there are more commits and trees on the
client that the server is checking to not duplicate. Also, as more
shallow commits are accumulated, the client needs to send more data
to the server to describe those shallow boundaries. The two partial clone
options have drastically different behavior in the Due to these extra
costs as a repository grows, we strongly recommend against shallow
fetches and fetching from treeless partial clones. The only
recommended scenario for a treeless partial clone is for quickly
cloning on a build machine that needs access to the commit history,
but will delete the repository at the end of the build. Blobless partial
clones do increase the Git CPU costs on the server somewhat, but the
network data transfer is much less than a full clone or a full fetch
from a shallow clone. The extra CPU cost is likely to become less
important if your repository has larger blobs than our test
repositories. In addition, you have access to the full commit
history, which might be valuable to real users interacting with
these repositories. It is also worth
noting that we noticed a surprising result during our testing.
During T9 and T10 tests for the Linux repository, our load
generators encountered memory issues as it seems that these
scenarios with the heavy load that we were running, triggered more
auto Garbage Collections (GC). GC in the Linux repository is
expensive and involves a full repack of all Git data. Since we were
testing on a Linux client, the GC processes were launched in the
background to avoid blocking our foreground commands. However, as we
kept fetching we ended up with several concurrent background
processes; this is not a realistic scenario but a factor of our
synthetic load testing. We ran It is worth noting
that blobless partial clones might trigger automatic garbage
collection more often than a full clone. This is a natural byproduct
of splitting the data into a larger number of small requests. We
have work in progress to make Git’s repository maintenance be more
flexible, especially for large repositories where a full repack is
too time-consuming. Look forward to more updates about that feature
here on the GitHub blog. Our experiment
demonstrated some performance changes between these different clone
and fetch options. Your mileage may vary! Our experimental load was
synthetic, and your repository shape can differ greatly from these
repositories. Here are some common
themes we identified that could help you choose the right scenario
for your own usage: If you are a
developer focused on a single repository, the best approach is to do
a full clone and then always perform a full fetch into that clone.
You might deviate to a blobless partial clone if that repository is
very large due to many large blobs, as that clone will help you get
started more quickly. The trade-off is that some commands such as In general,
calculating a shallow fetch is computationally more expensive
compared to a full fetch. Always use a full fetch instead of a
shallow fetch both in fully and shallow cloned repositories.
Blobless partial
clones are particularly effective if you are using Git’s
sparse-checkout feature to
reduce the size of your working directory. The combination greatly
reduces the number of blobs you need to do your work.
Sparse-checkout does not reduce the required data transfer for
shallow clones. Notably, we did not
test repositories that are significantly larger than the As can be observed,
our test process is not simulating a real life situation where users
have different workflows and work on different branches. Also the
set of Git commands that have been analyzed in this study is a small
set and is not a representative of a user’s daily Git usage. We are
continuing to study these options to get a holistic view of how they
change the user experience. A special thanks to Derrick
Stolee and
our Professional
Services team
for their efforts and sponsorship of this study! |
Terms of Use | Copyright © 2002 - 2021 CONSTITUENTWORKS SM CORPORATION. All rights reserved. | Privacy Statement