Wide Vs Narrow dependencies in Apache Spark:

saurav omar
3 min readJun 27, 2020
#Copied from Google

What is Spark?

Wikipedia says “ Apache Spark is an open-source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.”

  • Apache Spark is a lightning-fast unified analytics engine for big data and machine learning.
  • It was originally developed at UC Berkeley in 2009.
  • Companies like Netflix, Yahoo, and eBay have deployed Spark at a massive scale, collectively processing multiple petabytes of data on clusters of over 8,000+ nodes.
  • It is the largest open source community in big data, with over 1000 contributors from 250+ organizations.
  • If you have large amounts of data that require low latency processing that a typical MapReduce program cannot provide, Spark is the way to go.
  • It speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.
  • Computations are represented in Spark as a DAG(Directed Acyclic Graph)officially described as a lineage graph over RDDs,
  • Lineage Graph represents data distributed across different nodes.
#Copied from Google

Transformation:

Transformations is a kind of process that will transform your RDD data from one form to another in Spark. and when you apply this operation on an RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable).

  • Transformations are the core of how you will be expressing your business logic using DAG in Spark and intermediate operation conversion from one form to another RDD.

Two types of transformations in SPARK:

  • Wide Transformations
  • Narrow Transformations

Let’s see one by one.

Narrow Transformations:

These types of transformations convert each input partition to only one output partition. When each partition at the parent RDD is used by at most one partition of the child RDD or when each partition from child produced or dependent on single parent RDD.

  • This kind of transformation is basically fast.
  • Does not require any data shuffling over the cluster network or no data movement.
  • Operation of map()and filter() belongs to this transformations.
#Copied from Google

Wide Transformations:

This type of transformation will have input partitions contributing to many output partitions. When each partition at the parent RDD is used by multiple partitions of the child RDD or when each partition from child produced or dependent on multiple parent RDD.

  • Slow as compare to narrow dependencies speed might be significantly affected as it might be required to shuffle data around different nodes when creating new partitions.
  • Might Require data shuffling over the cluster network or no data movement.
  • Functions such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some examples of wider transformations.

When working with Spark, it is always good or keeps in mind all operations or transformations Which might require data shuffling and hence slow down the process. Try to optimize and reduce the usage of wide dependencies as much as you can.

Happy Learning!!

--

--