Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'
Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'
ERROR DESCRIPTION
Consider 2 Spark RDDs to be joined together..
Say, rdd1.first is in the form of (Int, Int, Float) = (1,957,299.98)
while rdd2.first is something like (Int, Int) = (25876,1) where the join is supposed to take place on the 1st field from both the RDDs.
scala> rdd1.join(rdd2) --- results in an error
<console>:**: error: value join is not a member of org.apache.spark.rdd.RDD[(Int, Int, Float)]
REASON
Both the RDDs should be in the form of a Key-Value pair.
Here, rdd2 -- being in the form of (1,957,299.98) -- does not obey this rule.. While rdd1 -- which is in the form of (25876,1) -- does.
RESOLUTION
Convert the output of the 1st RDD from (1,957,299.98) to a Key-Value pair in the form of (1,(957,299.98)) before joining it with rdd2, as shown below:
scala> val rdd1KV = rdd1.map(x=>(x.split(",")(1).toInt,(x.split(",")(2).toInt,x.split(",")(4).toFloat))) -- modified RDD
scala> rdd1KV.first
res**: (Int, (Int, Float)) = (1,(957,299.98))
val joinedRDD = rdd1KV.join(rdd2) -- join successful
joinedRDD: org.apache.spark.rdd.RDD[(Int, ((Int, Float), Int))] = MapPartitionsRDD[67] at join ..
By the way, join is the member of org.apache.spark.rdd.PairRDDFunctions. So make sure you import this on your Eclipse or IDE, wherever you want to run your code.
Thanks,
Vishal.
It was really a nice article and i was really impressed by reading this Big Data Hadoop Online Course
ReplyDelete