YARN allows multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. Many applications are available to YARN, like MapReduce for batch processing, Storm for real-time stream processing, or Spark for in-memory iterative processing.

YARN comes bundled with MapReduce 2.0 (MRv2). MapReduce has undergone a complete overhaul in MRv2. The fundamental idea of MRv2’s YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Configure YARN

Install YARN

Yarn requires the following services:

ResourceManager is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs). This service should be run on a node that is not running YARN NodeManagers, HDFS NameNodes, or HBase Masters. As an exception if you are short on nodes, running a Yarn ResourceManager on a node running an HDFS NameNode is fine. But keep it off the HBase Master if possible.
JobHistory Server – stores the history of jobs – can be recreated if destroyed, no history is required to run Yarn. The JobHistory Server should run on the same node as the YARN ResourceManager.
NodeManager(s) take instructions from the ResourceManager and manage resources available on a single node. Try to separate YARN NodeManager(s) from the HBase RegionServers, too much Java heap memory is required. Try not to install a YARN NodeManager on a node with Oozie, too much Java heap memory is required. Try not to run NodeManagers on the same node as RecourceManagers.
Gateway – The Gateway (CDH distribution of YARN) stores YARN network configurations. Install a Gateway onto all APP servers.

YARN Resource Configuration

Configure YARN with the following settings:

Configuration	Description	Small (8 GB Memory, 2 CPUs)	Medium (16 GB Memory, 4 CPUs)	Large (28 GB Memory, 4 CPUs)	Very Large (56 GB Memory, 8 CPUs)	Extreme (128 GB Memory, 8 CPUs)	Calculation
NodeManager Local Directories yarn.nodemanager.local-dirs	List of directories on the local filesystem where a NodeManager stores intermediate data files.	/space1/yarn/nm	/space1/yarn/nm, /space2/yarn/nm	/space1/yarn/nm, /space2/yarn/nm	/space1/yarn/nm, /space2/yarn/nm	/space1/yarn/nm, /space2/yarn/nm
NodeManager Container Log Directories yarn.nodemanager.log-dirs		/space1/yarn/container-logs	/space1/yarn/container-logs, /space2/yarn/container-logs	/space1/yarn/container-logs, /space2/yarn/container-logs	/space1/yarn/container-logs, /space2/yarn/container-logs	/space1/yarn/container-logs, /space2/yarn/container-logs
yarn.app.mapreduce.am.resource.mb	Physical memory for the ApplicationMaster	1 GB	2 GB	2 GB	3 GB	3 GB	= 2 * RAM-per-Container
ApplicationMaster Java Maximum Heap Size	The maximum heap size, in bytes, of the Java MapReduce ApplicationMaster. This number will be formatted and concatenated with ‘ApplicationMaster Java Opts Base’ to pass to Hadoop.	825955249 B	1.5 GB	1.5 GB	2 GB	2 GB	75% of container space. In YARN a container is the space in memory and CPU where your job will run.
Java Heap Size of NodeManager in Bytes	Maximum size for the Java Process heap memory. Passed to Java -Xmx.	1 GB	1 GB	1 GB	2 GB	2 GB	Run NodeManagers with 1 GB memory while yarn.scheduler.maximum-allocation-mb < 16 GB.
Java Heap Size of ResourceManager in Bytes	Maximum size for the Java Process heap memory. Passed to Java -Xmx. Measured in bytes.	1 GB	1.5 GB	1.5 GB	3 GB	3 GB	75% of container space or larger. Not constrained to the size of the container.
Dump Heap When Out of Memory	When set, generates heap dump file (in /tmp) when java.lang.OutOfMemoryError is thrown.	False	False	False	False	False	Not required.
yarn.nodemanager.resource.memory-mb	Amount of physical memory that can be allocated for all containers on a node.	6 GB	6 GB	14 GB ~~12 GB~~	30 GB	64 GB	= Containers * RAM-per-Container To determine how much memory will be allowed across the entire Yarn cluster: this * number-of-nodes Details: The amount of memory allotted to a NodeManager for spawning containers should be the difference between a node’s physical RAM minus all non-YARN memory demand, such as what is needed for the OS. So `yarn.nodemanager.resource.memory-mb` = total memory on the node – (sum of all memory allocations to other processes such as OS, DataNode, NodeManager, RegionServer etc.).
yarn.scheduler.minimum-allocation-mb	The smallest amount of physical memory, in MiB, that can be requested for a container.	1 GB	2 GB	2 GB	1 GB	1 GB	= RAM-per-Container To determine how many containers will be used per node: yarn.nodemanager.resource.memory-mb/yarn.scheduler.minimum-allocation-mb
Container Memory Maximum yarn.scheduler.maximum-allocation-mb	The largest amount of physical memory, in MB, that can be requested for a container.	4 GB	4 GB	8 GB ~~12 GB~~	8 GB	8 GB	= containers * RAM-per-Container The maximum allocation for a single container on a node, for every container request at the RM. To determine maximum memory for an entire cluster, add memory for all containers together.
yarn.scheduler.increment-allocation-mb	If using the Fair Scheduler, memory requests will be rounded up to the nearest multiple of this number.	512 MB	512 MB	512 MB	512 MB	512 MB	If using the Fair Scheduler, memory requests will be rounded up to the nearest multiple of this number.
yarn.app.mapreduce.am.command-opts	Java command line arguments passed to the MapReduce ApplicationMaster.	“-Djava.net.preferIPv4Stack=true”	“-Djava.net.preferIPv4Stack=true”	“-Djava.net.preferIPv4Stack=true”	“-Djava.net.preferIPv4Stack=true”	“-Djava.net.preferIPv4Stack=true”	= 0.8 * 2 * RAM-per-Container Not required.
yarn.nodemanager.container-manager.thread-count	Number of threads container manager uses.	4	20	20	20	20	Number of threads container manager uses.
yarn.resourcemanager.resource-tracker.client.thread-count	Number of threads to handle resource tracker calls.	4	20	50	50	50	Number of threads to handle resource tracker calls.
yarn.nodemanager.resource.cpu-vcores	Number of virtual CPU cores that can be allocated for containers.	4	4	4	6	6	= number of virtual cores used by all containers this * number_of_nodes = total number of cores for the cluster
yarn.scheduler.maximum-allocation-vcores	The largest number of virtual CPU cores that can be requested for a container.	2	2	2	2	2	this * yarn.nodemanager.resource.cpu-vcores = total number of containers that can run on the node Note: this setting must be greater than or equal to the number asked for by the client.
yarn.scheduler.minimum-allocation-vcores	The smallest number of virtual CPU cores that can be requested for a container	1	1	1	1	1	The smallest number of virtual CPU cores that can be requested for a container. If using the Capacity or FIFO scheduler (or any scheduler, prior to CDH 5), virtual core requests will be rounded up to the nearest multiple of this number.
mapreduce.map.memory.mb	The amount of physical memory, in MiB, allocated for each map task of a job.	1 GB	2 GB	3 GB	3 GB	4 GB	Max map memory, should be more than the yarn.scheduler.minimum-allocation-mb
mapreduce.reduce.memory.mb	The amount of physical memory, in MiB, allocated for each reduce task of a job.	1 GB	2 GB	3 GB	3 GB	4 GB	Max reduce memory, should be more than the yarn.scheduler.minimum-allocation-mb
mapreduce.map.cpu.vcores	The number of virtual CPU cores allocated for each map task of a job.	1	1	1	1	1	Number of virtual cores for a MapReduce Map job – should be < yarn.nodemanager.resource.cpu-vcores
mapreduce.reduce.cpu.vcores	The number of virtual CPU cores allocated for each reduce task of a job.	1	1	1	1	1	Number of virtual cores for a MapReduce Reduce job – should be < yarn.nodemanager.resource.cpu-vcores
Map Task Maximum Heap Size mapreduce.map.java.opts.max.heap	The maximum Java heap size, in bytes, of the map processes. This number will be formatted and concatenated with ‘Map Task Java Opts Base’ to pass to Hadoop.	825955249 B	1.5 GB	2.5 GB	2.5 GB	3 GB	= 0.75 * RAM-per-Container Should be less than the mapreduce.map.memory.mb
Reduce Task Maximum Heap Size mapreduce.reduce.java.opts.max.heap	The maximum Java heap size, in bytes, of the reduce processes. This number will be formatted and concatenated with ‘Reduce Task Java Opts Base’ to pass to Hadoop.	825955249 B	1.5 GB	2.5 GB	2.5 GB	3 GB	= 0.75 * RAM-per-Container should be less than the mapreduce.reduce.memory.mb
mapreduce.task.io.sort.factor	The number of streams to merge at the same time while sorting files on the reducer side. This determines the number of open file handles. Merging more files in parallel reduces merge sort iterations and improves run time by eliminating disk I/O. Summary: More streams merged at once while sorting files.	4	5	10	20	20	Note that merging more files in parallel uses more memory. If ‘io.sort.factor’ is set too high or the maximum JVM heap is set too low, excessive garbage collection will occur. The Hadoop default is 10, but Cloudera recommends a higher value.
mapreduce.task.io.sort.mb	Sort memory buffer – comes out of the JVM.	256 MB	512 MB	512 MB	512 MB	768 MB	JVM heap – this = total usable heap space
mapreduce.job.reduces	The default number of reduce tasks per job.	1	6	6	9	9	Typically set to 99% of the cluster’s reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is “local”.
mapreduce.reduce.shuffle.parallelcopies	The default number of parallel transfers run by reduce during the copy (shuffle) phase.	4	10	10	10	10	This number should be between sqrt(nodesnumber_of_map_slots_per_node) and nodesnumber_of_map_slots_per_node/2 Note: Setting this value too high will increase CPU, memory & network usage. And it could lead to more disk spills and slow down your job.
mapreduce.client.submit.file.replication	The replication level for submitted job files.	10	10	10	10	10	*Should be less than or equal to the number of DataNodes. When a jar is passed using the -libjars option, it is physically copied to the libs/ directory of the task working directory. This file is replicated mapreduce.client.submit.file.replication number of times because the file has to be distributed to all the required nodes.

Node Capacities

YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both are available to jobs, for example maps and reduces. Set these configurations to the amount of memory and number of cores on the machine after subtracting out resources needed for other services. YARN’s resource configuration is a careful balance between the size of the jobs and the throughput of the node. Jobs are encased in containers built with Java heap + application memory + CPU. A container contains memory + CPU.

Virtual Cores

To better handle varying CPU requests, YARN supports virtual cores (vcores), a resource meant to express parallelism. The “virtual” in the name is somewhat misleading – on the NodeManager, vcores should be configured equal to the number of physical cores on the machine. Tasks should be requested with vcores equal to the number of cores they can saturate at once. Currently vcores are very coarse – tasks will rarely want to ask for more than one of them, but a complementary axis that represents processing power may be added in the future to enable finer-grained resource configuration. Tasks that will use multiple threads can request more than 1 core with the mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores properties.

Rounding Request Sizes

Also noteworthy are the yarn.scheduler.minimum-allocation-mb, yarn.scheduler.minimum-allocation-vcores, yarn.scheduler.increment-allocation-mb, and yarn.scheduler.increment-allocation-vcores properties, which default to 1024, 1, 512, and 1 respectively. If tasks are submitted with resource requests lower than the minimum-allocation values, their requests will be set to these values. If tasks are submitted with resource requests that are not multiples of the increment-allocation values, their requests will be rounded up to the nearest increments.

Configure YARN for MapReduce

With YARN (as opposed to MRv1) memory seems to be the most difficult resource to adjust. However, MapReduce on a node with limited resources can be adjusted in such a way as to allow the Map and Reduce tasks to complete successfully while using as little memory as possible. MapReduce configurations are contained in the following parameters:

mapreduce.reduce.shuffle.parallelcopies

mapreduce.task.io.sort.factor

mapreduce.map.memory.mb

mapreduce.reduce.memory.mb

Java Heap Size of NodeManager in Bytes

Java Heap Size of ResourceManager in Bytes

The tmp files on the DataNode are non-existing and therefore all the merging and shuffling are happening in memory, you can increase the time delay before the reducer starts in order to limit the load on the DataNode.

mapreduce.job.reduce.slowstart.completedmaps (0.7 would be sufficient)

mapreduce.task.io.sort.mb

Also, at Hadoop discussions groups, it is mentioned that default value of dfs.datanode.max.xcievers parameter, the upper bound for the number of files an HDFS DataNode can serve, is too low and causes ShuffleError. In HDFS, you can try to set this value to 2048.

Running Small MapReduce Jobs with Uber

Running small MapReduce jobs (on small datasets) is more efficient when you use Uber because you remove the additional time that MapReduce normally spends spinning up and bringing down map and reduce processes. Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Instead of using the ResourceManager to create the map and reduce containers, the ApplicationMaster runs the map and reduce tasks within its own process and avoids the overhead of launching and communicating with remote containers.

To enable uber jobs, you need to set the following property:

mapreduce.job.ubertask.enable=true

Example Configuration

To make all of this more concrete, let’s use an example. Each node in the cluster has 24 GB of memory and 6 cores. Other services running on the nodes require 4 GB and 1 core, so we set yarn.nodemanager.resource.memory-mb to 20480 and yarn.nodemanager.resource.cpu-vcores to 5. If you leave the map and reduce task defaults of 1024 MB and 1 virtual core intact, you will have at most 5 tasks running at the same time. If you want each of your tasks to use 5 GB, set their mapreduce.(map|reduce).memory.mb to 5120, which would limit you to 4 tasks running at the same time. Further calculate the total number of containers by dividing the memory assigned to yarn.nodemanager.resource.memory-mb * the number of nodes divided by the memory used for a container. Repeat the calculation for CPU.

Note: Running HortonWorks’ Yarn Utility script against our R200 on 8/25/14 revealed the following resource requirement for Yarn:

/tmp/hdp_manual_install_rpm_helper_files-2.0.6.101/scripts$ python yarn-utils.py -c 4 -m 8 -d 2 -k False

Using cores=4 memory=8GB disks=2 hbase=False

Profile: cores=4 memory=6144MB reserved=2GB usableMem=6GB disks=2

Num Container=4

Container Ram=1536MB

Used Ram=6GB

Unused Ram=2GB

yarn.scheduler.minimum-allocation-mb=1536

yarn.scheduler.maximum-allocation-mb=6144

yarn.nodemanager.resource.memory-mb=6144

mapreduce.map.memory.mb=1536

mapreduce.map.java.opts=-Xmx1228m

mapreduce.reduce.memory.mb=3072

mapreduce.reduce.java.opts=-Xmx2457m

yarn.app.mapreduce.am.resource.mb=3072

yarn.app.mapreduce.am.command-opts=-Xmx2457m

mapreduce.task.io.sort.mb=614

Reference: https://www.linkedin.com/pulse/article/20140706112523-176301000-yarn-resource-allocation

Some interesting performance tweaks for YARN: http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

Yarn Memory Calculations: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1-11.html

Test Yarn

1. Log onto a host in the cluster.

2. Run the Linux find command in a single container:

sudo -u hdfs hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -debug -shell_command find -shell_args ‘`pwd`’ -jar `sudo find /opt/cloudera/parcels/ -name *-distributedshell-*.jar|head -n1` -container_memory 150 -master_memory 150

sudo -u hdfs hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -debug -shell_command find -shell_args ‘`pwd`’ -jar `sudo find /opt/cloudera/parcels/ -name *-distributedshell-*.jar|head -n1` -container_memory 350 -master_memory 350

3. Run the Hadoop MapReduce PiEstimator example using one of the following commands (for Parcel installation):

sudo -u hdfs yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

sudo -u hdfs yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1 1

Note: If the job gets stuck at:

INFO mapreduce.Job: map 0% reduce 0%

This is caused by too little memory for the MR application Master to run tasks.

Try bumping up the yarn.nodemanager.resource.memory-mb setting under Yarn and restarting the service.

4. View the results of running the job:

Cloudera Manager > Clusters > ClusterName > yarn > Applications

You will see an entry in yarn Applications

Reference: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_using-apache-hadoop/content/running_mapreduce_examples_on_yarn.html

Yarn Commands

List running Yarn applications:

yarn application -list

Kill a running Yarn application. After you retrieve the applicationId with the list command, run kill:

yarn application -kill application_1415884152130_0061

Running an example application with YARN

Create a home directory on HDFS for the user who will be running the job (for example, joe):

sudo -u hdfs hadoop fs -mkdir /user/joe

sudo -u hdfs hadoop fs -chown joe /user/joe

Do the following steps as the user joe.

Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
Set HADOOP_MAPRED_HOME for user joe:
Run an example Hadoop job to grep with a regular expression in your input data.
After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.

hadoop fs -mkdir input

hadoop fs -put /etc/hadoop/conf/*.xml input

hadoop fs -ls input

Found 3 items:

-rw-r–r– 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml

-rw-r–r– 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml

-rw-r–r– 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input/*.xml output23 ‘dfs’

hadoop fs -ls

Found 2 items

drwxr-xr-x – joe supergroup 0 2009-08-18 18:36 /user/joe/input

drwxr-xr-x – joe supergroup 0 2009-08-18 18:38 /user/joe/output23

You can see that there is a new directory called output23.

List the output files.
Read the results in the output file.

$ hadoop fs -ls output23

Found 2 items

drwxr-xr-x – joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS

-rw-r–r– 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000

hadoop fs -cat output23/part-r-00000 | head

1 dfs.safemode.min.datanodes

1 dfs.safemode.extension

1 dfs.replication

1 dfs.permissions.enabled

1 dfs.namenode.name.dir

1 dfs.namenode.checkpoint.dir

1 dfs.datanode.data.dir

Count Number of Lines in a File

Count the number of rows in an HDFS file using Pig and a MapReduce job:

A = load ‘/asset/wm/subcategory/index/2015/05/24/index.txt’;
B = group A all;
C = foreach B generate COUNT(A);
dump C;

Count Number of Words in a File

lines = LOAD ‘/user/hadoop/HDFS_File.txt’ AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

Troubleshooting

Host Clock Offset Has Become Bad

Once in a while VMs will lose connection to the time server or the ntp service will stop. Time is critical to Hadoop, all services will stop if they detect that the clock is off.

You see this error thrown when the clock offset problem is encountered:

The health test result for HOST_CLOCK_OFFSET has become bad: The host’s NTP service is not synchronized to any remote server.

Other problems, Yarn’s ResourceManager will throw an error. Yarn’s NodeManagers will fail, HBase’s Master and HBase’s RegionServers will also show red.

To fix this restart ntp on the VM:

sudo service ntp restart;

Or even easier, run the Orchestrator Runbook:

Network\Update NTP on Ubuntu VM

Yarn Example Stuck: map 0% reduce 0%

After starting a Yarn example:

sudo -u hdfs yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

The job gets stuck at:

INFO mapreduce.Job: map 0% reduce 0%

This is caused by too little memory for the MR application Master to run tasks.

Try bumping up the memory on the VM, or the yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores in the yarn-site.xml used by the NodeManager. Go back and review the YARN Configurations and make sure you didn’t miss something.

Yarn MapReduce Job Fails: error in shuffle in fetcher

The error in shuffle in fetcher at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run is caused by a lack of resources.

See the Configure Yarn for more information about how to configure Yarn to handle nodes with low resources.

From the Yarn ResourceManager log:

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#6 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56) at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46) at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:297) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:287) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:411) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:341) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)

From Hue’s log:

2014-08-21 10:50:08,835 [main] ERROR org.apache.pig.tools.grunt.GruntParser – ERROR 2997: Unable to recreate exception from backed error: Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#6

at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Caused by: java.lang.OutOfMemoryError: Java heap space

at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)

at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)

at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)

at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:297)

at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:287)

at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:411)

at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:341)

at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)