Menu Close

spark interview questions

What is Apache Spark?

  • Apache Spark is an open-source cluster computing framework for real-time processing.
  • Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Explain the key features of Apache Spark.

  • Polyglot
  • Speed
  • Multiple Format Support
  • Lazy Evaluation
  • Real Time Computation
  • Hadoop Integration

Define Partitions?

  • Smaller and logical division of data similar to a MapReduce split.
  • Partitioning is the process of deriving logical units of data to speed up data processing.
  • Everything in Spark is a partitioned RDD.

What do you understand by Transformations in Spark?

  • Transformations are functions applied to RDDs, resulting in another RDD. It does not execute until an action occurs.
  • Functions such as map() and filer() are examples of transformations, where the map() function iterates over every line in the RDD and splits into a new RDD.
  • The filter() function creates a new RDD by selecting elements from the current RDD that passes the function argument.

What is RDD Lineage?

  • Spark does not support data replication in memory and thus, if any data is lost, it is rebuild using RDD lineage.
  • RDD lineage is a process that reconstructs lost data partitions.
  • The best thing about this is that RDDs always remember how to build from other datasets.

What is Spark Driver?

  • Spark driver is the program that runs on the master node of a machine and declares transformations and actions on data RDDs.
  • In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
  • It also delivers RDD graphs to Master, where the standalone Cluster Manager runs.

Define Spark Streaming?

  • Spark supports stream processing—an extension to the Spark API allowing stream processing of live data streams.
  • Data from different sources like Kafka, Flume, Kinesis is processed and then pushed to file systems, live dashboards, and databases. It is similar to batch processing in terms of the input data which is here divided into streams like batches in batch processing.

What is GraphX?

  • Spark uses GraphX for graph processing to build and transform interactive graphs.
  • The GraphX component enables programmers to reason about structured data at scale.

What is a Parquet file?

  • Parquet is a columnar format file supported by many other data processing systems.
  • Spark SQL performs both read and write operations with the Parquet file and considers it be one of the best Big Data Analytics formats so far.

What is YARN?

  • Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster.
  • Running Spark on YARN needs a binary distribution of Spark that is built on YARN support.

What is PageRank?

  • A unique feature and algorithm in GraphX, PageRank is the measure of each vertex in a graph.
  • For instance, an edge from u to v represents an endorsement of v‘s importance with respect to u.
  • In simple terms, if a user at Instagram is followed massively, he/she will be ranked high on that platform.

What are Spark Datasets?

  • Datasets are data structures in Spark (added since Spark 1.6) that provide the JVM object benefits of RDDs (the ability to manipulate data with lambda functions), alongside a Spark SQL-optimized execution engine.

What are Spark DataFrames?

  • When a dataset is organized into SQL-like columns, it is known as a DataFrame.
  • This is, in concept, equivalent to a data table in a relational database or a literal ‘DataFrame’ in R or Python. The only difference is the fact that Spark DataFrames are optimized for Big Data.

What is lazy evaluation?

  • Spark implements a functionality, wherein if you create an RDD out of an existing RDD or a data source, the materialization of the RDD will not occur until the RDD needs to be interacted with.
  • This is to ensure the avoidance of unnecessary memory and CPU usage that occurs due to certain mistakes, especially in the case of Big Data Analytics.