PigMix is a set of 17 Pig programs that are used as a benchmark to measure the comparative performance of the Pig programming language versus hand-coded Java running in a Hadoop environment. The algorithms were chosen and coded by the Pig community and should be representative of what Pig is used for and embody best practices for how to do it.
Here a->b->c is the pipeline before order-by. Previously Pig will write c to the disk first, and then the sampler will get samples from c; but now we want to avoid writing c to the disk, so the sampler will load a to get samples and pass them through b and c to generate the partition file. Here b and c can be projection, filter and any other non-blocking operators.
HPCC Systems provides a utility program called Bacon, which can automatically translate Pig programs into the equivalent ECL. The Bacon-translated versions of the PigMix tests are presented below.
Previously for order-by, Pig will force any previous pipeline to finish and write to disk first, and then sample the data and sort it, so the sampler will see the same data that will be sorted. Now we want to merge the previous map-only pipeline into both the sampler and order-by. The sampler will sample the data before that pipeline, and pass the sample results through the pipeline to generate the partition file. See the query:
|Search results for cosmic ballroom happy drunk (pigmix)|