Posts

Showing posts with the label Cloudera

Apache Spark :: aggregateByKey explained :: CCA175 exam topic

Apache Spark :: aggregateByKey explained :: S ample question for Spark Developer's exam (Cloudera/Databricks) Scenario :   Sample Input Tuple 'ordersVJ' is in the form of (ItemId, RevenuePerItemId) as follows: ... .. (10,299.98) (20, 199.99 )  (20,250.0) (20,129.99) (40,49.98) -- Key = 40,  Value  = 49.98 (Input value type is  Float ) (40,299.95) (40,150.0) (40,199.92) (50,299.98) (50,299.95) ... .. Using the  aggregateByKey  Spark RDD API, find the total revenue and maximum revenue per ItemId. Desired output for ItemId 40 will be (40,( 699.85 , 299.95 )) Solution : The RDD way of achieving the result using aggregateByKey is a bit trickier and complex when you compare it with the Spark Dataframe or Spark SQL way of achieving the results. You need to have very good understanding of what is being passed in the function. The aggregateByKey function takes 2 argument. The desired output for ItemId 40 is in a (K...