Parallelization across regions ============================== .. toctree:: :hidden: Given that the most common data structure is a counts dict (whose keys are the region names in our dataset), we often want to call a function for each region in this dictionary:: >>> result = {region: fn(counts[region]) for region in counts} This pattern may become even more complicated if ``fn()`` returns a tuple, for example. Furthermore, it is clear that overall operation is "embarrassingly parallel" with respect to the regions being processed. In order to simplify our code, reduce redundancy, and gain the benefits of parallel execution, we introduce a new decorator: ``@parallelize_regions``, which can be found in the subpackage :mod:`lib5c.util.parallelization`. This decorator allows you to write ``fn()`` just once, writing it as if it processes only one matrix, but then call it with one matrix or an entire counts dict as is convenient. For example, we can write :: from lib5c.util.parallelization import parallelize_regions @parallelize_regions def fn(matrix): return matrix + 1 and then call this function via :: result_counts = fn(counts) or alternatively, :: result_matrix = fn(counts['Sox2']) as is convenient for us. Mechanism and caveats --------------------- The following sections dig into the mechanics behind the ``@parallelize_regions`` decorator and highlight some important features and caveats. First positional argument dependence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``@parallelize_regions`` decorator works by first checking to see if the first argument passed to the decorated function is a dict. If it is not, the decorator does nothing, and the function is executed as normal. If it is a dict, the execution of the function is parallelized across the keys of that dict. This means that if the non-parallelized version of ``fn()`` expects a dict as its first positional argument, you will not be able to use the same name for both the parallel and non-parallel versions of the function. To work around this, you can define :: from lib5c.util.parallelization import parallelize_regions def fn(somedict): return somedict fn_parallel = parallelize_regions(fn) and then you can call ``fn(somedict)`` when you want the non-parallelized version and ``fn_parallel(doubledict)`` when you want the parallelization. Per-region args and kwargs ~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, ``@parallelize_regions`` will simply copy all the other args and kwargs to each region's invocation of ``fn()``. In other words, when you call ``fn(counts, arg_1, arg_2)``, the following will be executed:: fn(counts['region_1'], arg_1, arg_2) fn(counts['region_2'], arg_1, arg_2) ... However, if any arg or kwarg is a dict which has the same keys as the first positional argument (or, if the arg is a nested dict, if its second level has these same keys), the arg will be replaced with each region's entry in that dict. In other words, if we call ``fn(counts, primermap)``, where ``primermap`` is a dict whose keys match ``counts``, the following will be executed:: fn(counts['region_1'], primermap['region_1']) fn(counts['region_2'], primermap['region_2']) ... This substitution is performed on an arg-by-arg basis, so you can use any mixture of normal and "regional dictionary" arguments when calling the fucnction. Automatic result unpacking ~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say ``fn()`` returns a tuple, for example:: from lib5c.util.parallelization import parallelize_regions @parallelize_regions def fn(matrix): return matrix + 1, matrix - 1 When we call ``fn()`` on a single matrix, we expect to see :: bigger_matrix, smaller_matrix = fn(matrix) The same thing will work when calling ``fn()`` on a counts dict:: bigger_counts_dict, smaller_counts_dict = fn(counts) In this case ``bigger_counts_dict`` and ``smaller_counts_dict`` will each be dicts whose keys match the keys of ``counts``. Fallback to series execution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If an error is encountered during the parallel processing, the decorator will attempt to re-run the same job in series, in hopes that this will result in a more readable stack trace. Signature preservation ~~~~~~~~~~~~~~~~~~~~~~ ``@parallelize_regions`` is itself decorated by the ``@pretty_decorator`` meta-decorator, which can be found in :mod:`lib5c.util.pretty_decorator`. This allows the signature of the decorated function to be preserved through the decoration process.