Filtering out Nulls and Headers in Scala/Spark
Filtering out Nulls and Headers in Scala/Spark Consider a contact_file in HDFS location /User/VJ/testfile which has a null record in a non-null field. Here, the last line has no value in the 'age' field so the requirement is to filter all such lines. id,fname,lname,age,designation 1, amarnath, jaiswal, 61, Businessman 2, prakash, yadav, 30, Developer 3, vishal, jaiswal, 32, Engineer 4, ravi, jaiswal,, Builder Solution: Usage of mapPartitionsWithIndex to drop the 1st iterator for 0th index will filter the header from your input file, while the Usage of != "" on the 4th field will filter out the 3rd line scala> sc.textFile("/User/VJ/testfile"). mapPartitionsWithIndex ((x,y) => if (x==0) y.drop(1) else y).filter(x=>x.split(",")(3) != "" ).take(5).foreach(println) Output: 1, amarnath, jaiswal, 61, Businessman 2, prakash, yadav, 30, Developer 3, vishal, jaiswal, 32, Engineer Thanks, Vishal.