Memory Management

One of the biggest confusing factors on Spark is how to manage the memory. Proper tuning requires a deep understanding of both Spark and JVM memory allocation.

Reference: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

The Basic Structure

SparkMemoryDiagram

There are 4 levels of memory to tune in Spark:

  • JVM
  • Yarn Container
  • Spark Task
  • Compute vs. storage

JVM options:

reference: https://docs.oracle.com/cd/E19900-01/819-4742/abeik/index.html

The JVM level is controlled on Spark session launch with the -X... commands the most commonly set

-Xmx = maximum JVM memory

-Xms = minimum JVM memory; this is useful in that it avoids the cost of reallocating JVM memory if you know you'll need a lot.

-XX:+UseGIGC = set the garbage collector to the newer (and faster)

In addition, the JVM itself is a complicated tangle of memory allocations.

  • Split into heap and off-heap
  • In heap, there are a hierarchy of different memory levels

    • Generations: young and old

      • garbage collection
        • within young generation is fast
        • between generations is slow
      • young: temporary

        • Buckets:
          • Eden
            • first point of entry
          • Survivor
            • Referenced variables moved here when Eden fills up and GC happens
            • 2 survivor buckets
              • one empty and one full
              • when GC is done, the variables to move from Eden are combined with the variables in full survivor and stored in empty survivor.
      • old: tenured / permanent

        • Must be at least as big as young
        • Garbage collection from young to old is slow

JVMmemory All variables start in the Eden bucket in the Young generation

When it is time for garbage collection, the variables that are referenced in Eden get sent to one of the survivor buckets (empty to start). All the variables in the other survivor bucket are also moved to the once empty bucket.

Garbage collection that moves data within the young generation is fast; garbage collection that moves data between generations is slow

Garbage collection log analysis - look for tools to use to do this

Spark options :

This includes both spark task and compute vs. storage options.

spark.memory.fraction = set the maximum memory that Spark can actively use.

spark.memory.storage fraction = the minimum amount of memory that must be reserved for storage

spark.memory.executor.memoryOverhead = the executor heap size. This is added to the executor memory to determine full memory request to Yarn for each executor.

Yarn Options:

yarn.scheduler.minimum-allocation-mb: minimum memory allocated to the yarn task

yarn.scheduler.increment-allocation-mb: memory increment for yarn tasks.

yarn.nodemanager.resource.memory-mb controls the maximum sum of memory used by the containers on each node.

yarn.nodemanager.resource.cpu-vcores controls the maximum sum of cores used by the containers on each node.

results matching ""

    No results matching ""