Filtering out Nulls and Headers in Scala/Spark

August 20, 2018

Filtering out Nulls and Headers in Scala/Spark

Consider a contact_file in HDFS location /User/VJ/testfile which has a null record in a non-null field. Here, the last line has no value in the 'age' field so the requirement is to filter all such lines.

id,fname,lname,age,designation
1, amarnath, jaiswal, 61, Businessman
2, prakash, yadav, 30, Developer
3, vishal, jaiswal, 32, Engineer
4, ravi, jaiswal,, Builder

Solution: Usage of mapPartitionsWithIndex to drop the 1st iterator for 0th index will filter the header from your input file, while the Usage of != "" on the 4th field will filter out the 3rd line

scala> sc.textFile("/User/VJ/testfile").mapPartitionsWithIndex((x,y) => if (x==0) y.drop(1) else y).filter(x=>x.split(",")(3) != "" ).take(5).foreach(println)

Output:
1, amarnath, jaiswal, 61, Businessman
2, prakash, yadav, 30, Developer
3, vishal, jaiswal, 32, Engineer

Thanks,
Vishal.

Search This Blog

Programming

Filtering out Nulls and Headers in Scala/Spark

Comments

Post a Comment

Popular posts from this blog

Automated bash script to export all Hive DDLs from an existing environment at one go!

Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'

Scala vs Python vs Java :: Big Data processing with Apache Spark