Programming

Apache Spark :: aggregateByKey explained :: S ample question for Spark Developer's exam (Cloudera/Databricks) Scenario : Sample Input Tuple 'ordersVJ' is in the form of (ItemId, RevenuePerItemId) as follows: ... .. (10,299.98) (20, 199.99 ) (20,250.0) (20,129.99) (40,49.98) -- Key = 40, Value = 49.98 (Input value type is Float ) (40,299.95) (40,150.0) (40,199.92) (50,299.98) (50,299.95) ... .. Using the aggregateByKey Spark RDD API, find the total revenue and maximum revenue per ItemId. Desired output for ItemId 40 will be (40,( 699.85 , 299.95 )) Solution : The RDD way of achieving the result using aggregateByKey is a bit trickier and complex when you compare it with the Spark Dataframe or Spark SQL way of achieving the results. You need to have very good understanding of what is being passed in the function. The aggregateByKey function takes 2 argument. The desired output for ItemId 40 is in a (K...

Search This Blog

Programming

Posts

Apache Spark :: aggregateByKey explained :: CCA175 exam topic