The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
Generally it is around 10-100 maps per-node. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Suppose, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with
82,000 maps, to control the number of block you can use the mapreduce.job.maps parameter (which only provides a hint to the framework).
Ultimately, the number of tasks is controlled by the number of splits returned by the InputFormat.getSplits() method (which you can override).
No comments:
Post a Comment