====== Idioms ====== There are certain types of computation that show up again and again in parallel computation. This section shows how to perform them with jug. Map/Reduce ---------- Currently, the dominant paradigm for large scale distributed computing is the map/reduce paradigm. Originally made prominent by Google's proprietary implementation, it is now available in many implementations, including open-source ones such as Hadoop. Jug is not a direct competitor to Hadoop as it focuses on medium sized problems. Jug does not implement a distributed file system of any sort, but assumes that all compute nodes have access to a central repository of information (such as a shared filesystem or a redis server). On the other hand, jug supports much more complex computations than map-reduce. If, however, your problem is naturally described as a map/reduce computation, then *jug* has some helper functions. ``jug.mapreduce.mapreduce`` --------------------------- The ``jug.mapreduce.mapreduce`` function implements mapreduce:: jug.mapreduce(reducer, mapper, inputs) is roughly equivalent to:: Task{ reduce(reducer, map(mapper, inputs)) } If the syntax of Python supported such a thing. An issue that might come up is that your *map* function can be **too fast**. A good task should take at least a few seconds (otherwise, the overhead of scheduling and loading the data overwhelms the performance advantages of parallelism. Analogously for the *reduce* function. Therefore, ``jug`` groups your inputs so that a mapping task actually consists of mapping and reducing more than one input. How many is controlled by the ``map_step`` parameter. By default, it is set to 4. Similarly, the ``reduce_step`` parameter controls how many reduction steps to perform in a single task (by default, 8; reflecting the fact that reduce operations tend to be lighter than map operations). The `compound task `__ section has a worked out example of using map/reduce. Parameter Sweep --------------- This is a standard problem in many fields, for example, in machine learning. You have an algorithm that takes a few parameters (let's call them ``p0`` and ``p1``) and a function which takes your input data (``data``) and the parameters and outputs the score of this parameter combination. In pure Python, we'd write something like:: best = None best_val = float("-Inf") for p0 in range(100): for p1 in range(-20, 20): cur = score(data, p0, p1) if cur > best_val: best = p0,p1 best_val = cur print('Best parameter pair', best) This is, obviously, an **embarassingly parallel** problem and we want *jug* to handle it. First note: we can, of course, perform this with a *map/reduce*:: def mapper(data, p0, p1): return (p0, p1, score(data, p0, p1)) def reducer(a, b): _, _, av = a _, _, bv = b if av > bv: return a return b best = jug.mapreduce.mapreduce( reducer, mapper, [(p0, p1) for p0 in range(101) for p1 in range(-20, 21)]) However, if you want to look at the whole parameter space instead of just the best score, this will not work. Instead, you can do:: from jug import TaskGenerator score = TaskGenerator(score) results = {} for p0 in range(100): for p1 in range(-20, 20): result[p0,p1] = value(data, p0, p1) Now, *after you've run ``jug execute``*, you can use ``jug shell`` and load the result dictionary to look at all the results. :: result = value(result) print(result[0, -2]) # Look for the maximum score print(max(result.values())) # Look at maximum score *and* the parameters that generated it: print(max((v, k) for k, v in result.iteritems()))