Posts

Showing posts with the label apache

Scala vs Python vs Java :: Big Data processing with Apache Spark

Scala vs Python vs Java :: Big Data processing with Apache Spark I have tried to compare Python and Scala on various parameters like: performance, ease of using the language, integration using existing libraries, support for streaming use cases and of Apache Spark’s core capabilities. I did not try to evaluate Java for the following reasons: Java does not support the REPL command line feature (Read, Evaluate, Print, Loop) which is very extensively used to check if small code-snippets are working as expected.  Java is too verbose – It uses more lines of code, and displays more messages than needed. Scala on the JVM is way more powerful and cleaner than Java. SCALA PYTHON De-Facto language for Spark Scala is the 1st preferred language for Spark as Spark itself is written in Scala, so developers can dig deep into the Spark source code whenever required. New features of Spark are firs...

Filtering out Nulls and Headers in Scala/Spark

Filtering out Nulls and Headers in Scala/Spark Consider a contact_file in HDFS location  /User/VJ/testfile  which has a null record in a non-null field. Here, the last line has no value in the 'age' field so the requirement is to filter all such lines. id,fname,lname,age,designation 1, amarnath, jaiswal, 61, Businessman 2, prakash, yadav, 30, Developer 3, vishal, jaiswal, 32, Engineer 4, ravi, jaiswal,, Builder Solution: Usage of mapPartitionsWithIndex to drop the 1st iterator for 0th index will filter the header from your input file, while the Usage of != "" on the 4th field will filter out the 3rd line scala> sc.textFile("/User/VJ/testfile"). mapPartitionsWithIndex ((x,y) => if (x==0) y.drop(1) else y).filter(x=>x.split(",")(3) != "" ).take(5).foreach(println) Output: 1, amarnath, jaiswal, 61, Businessman 2, prakash, yadav, 30, Developer 3, vishal, jaiswal, 32, Engineer Thanks, Vishal.