Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'


Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'

ERROR DESCRIPTION

Consider 2 Spark RDDs to be joined together..

Say, rdd1.first is in the form of (Int, Int, Float) = (1,957,299.98)
while rdd2.first is something like (Int, Int) = (25876,1) where the join is supposed to take place on the 1st field from both the RDDs.

scala> rdd1.join(rdd2)  --- results in an error
<console>:**: error: value join is not a member of org.apache.spark.rdd.RDD[(Int, Int, Float)]

REASON

Both the RDDs should be in the form of a Key-Value pair.

Here, rdd2 -- being in the form of (1,957,299.98) -- does not obey this rule.. While rdd1 -- which is in the form of (25876,1) -- does.

RESOLUTION

Convert the output of the 1st RDD from (1,957,299.98) to a Key-Value pair in the form of (1,(957,299.98)) before joining it with rdd2, as shown below:

scala> val rdd1KV = rdd1.map(x=>(x.split(",")(1).toInt,(x.split(",")(2).toInt,x.split(",")(4).toFloat))) -- modified RDD

scala> rdd1KV.first
res**: (Int, (Int, Float)) = (1,(957,299.98))

val joinedRDD = rdd1KV.join(rdd2)  -- join successful
joinedRDD: org.apache.spark.rdd.RDD[(Int, ((Int, Float), Int))] = MapPartitionsRDD[67] at join ..

By the way, join is the member of org.apache.spark.rdd.PairRDDFunctions. So make sure you import this on your Eclipse or IDE, wherever you want to run your code.


Thanks,
Vishal.

Comments

  1. It was really a nice article and i was really impressed by reading this Big Data Hadoop Online Course

    ReplyDelete

Post a Comment

Popular posts from this blog

Automated bash script to export all Hive DDLs from an existing environment at one go!

Filtering out Nulls and Headers in Scala/Spark