split: add BENCHMARKING.md documentation file

This commit is contained in:
Jeffrey Finkelstein 2021-12-31 13:16:38 -05:00
parent 70ca1f45ea
commit b37718de10

View file

@ -0,0 +1,47 @@
<!-- spell-checker:ignore testfile -->
# Benchmarking to measure performance
To compare the performance of the `uutils` version of `split` with the
GNU version of `split`, you can use a benchmarking tool like
[hyperfine][0]. On Ubuntu 18.04 or later, you can install `hyperfine` by
running
sudo apt-get install hyperfine
Next, build the `split` binary under the release profile:
cargo build --release -p uu_split
Now, get a text file to test `split` on. The `split` program has three
main modes of operation: chunk by lines, chunk by bytes, and chunk by
lines with a byte limit. You may want to test the performance of `split`
with various shapes and sizes of input files and under various modes of
operation. For example, to test chunking by bytes on a large input file,
you can create a file named `testfile.txt` containing one million null
bytes like this:
printf "%0.s\0" {1..1000000} > testfile.txt
For another example, to test chunking by bytes on a large real-world
input file, you could download a [database dump of Wikidata][1] or some
related files that the Wikimedia project provides. For example, [this
file][2] contains about 130 million lines.
Finally, you can compare the performance of the two versions of `split`
by running, for example,
cd /tmp && hyperfine \
--prepare 'rm x* || true' \
"split -b 1000 testfile.txt" \
"target/release/split -b 1000 testfile.txt"
Since `split` creates a lot of files on the filesystem, I recommend
changing to the `/tmp` directory before running the benchmark. The
`--prepare` argument to `hyperfine` runs a specified command before each
timing run. We specify `rm x* || true` so that the output files from the
previous run of `split` are removed before each run begins.
[0]: https://github.com/sharkdp/hyperfine
[1]: https://www.wikidata.org/wiki/Wikidata:Database_download
[2]: https://dumps.wikimedia.org/wikidatawiki/20211001/wikidatawiki-20211001-pages-logging.xml.gz