Filtering out Nulls and Headers in Scala/Spark


Filtering out Nulls and Headers in Scala/Spark


Consider a contact_file in HDFS location /User/VJ/testfile which has a null record in a non-null field. Here, the last line has no value in the 'age' field so the requirement is to filter all such lines.

id,fname,lname,age,designation
1, amarnath, jaiswal, 61, Businessman
2, prakash, yadav, 30, Developer
3, vishal, jaiswal, 32, Engineer
4, ravi, jaiswal,, Builder

Solution: Usage of mapPartitionsWithIndex to drop the 1st iterator for 0th index will filter the header from your input file, while the Usage of != "" on the 4th field will filter out the 3rd line

scala> sc.textFile("/User/VJ/testfile").mapPartitionsWithIndex((x,y) => if (x==0) y.drop(1) else y).filter(x=>x.split(",")(3) != "" ).take(5).foreach(println)

Output:
1, amarnath, jaiswal, 61, Businessman
2, prakash, yadav, 30, Developer
3, vishal, jaiswal, 32, Engineer


Thanks,
Vishal.

Comments

Popular posts from this blog

Automated bash script to export all Hive DDLs from an existing environment at one go!

Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'