Scala vs Python vs Java :: Big Data processing with Apache Spark



Scala vs Python vs Java :: Big Data processing with Apache Spark


I have tried to compare Python and Scala on various parameters like: performance, ease of using the language, integration using existing libraries, support for streaming use cases and of Apache Spark’s core capabilities.

I did not try to evaluate Java for the following reasons:

  • Java does not support the REPL command line feature (Read, Evaluate, Print, Loop) which is very extensively used to check if small code-snippets are working as expected. 
  • Java is too verbose – It uses more lines of code, and displays more messages than needed.
  • Scala on the JVM is way more powerful and cleaner than Java.

SCALA
PYTHON
De-Facto language for Spark
Scala is the 1st preferred language for Spark as Spark itself is written in Scala, so developers can dig deep into the Spark source code whenever required.

New features of Spark are first available in Scala and are later ported to Python or Java.
Data Scientists would traditionally have more of a background in Python will not like to learn a new language.
Performance
Scala is claimed to be 10 times faster than Python for complex data processing - A compiled language will have better performance at run-time if it's statically typed (vs. dynamically)

Scala - Statically typed,
Python - Dynamically typed
Scala is a Statically typed language, so the type of a variable (Int / String / Float etc) is known at the compile time.

Compiler is proactive to catch the errors at a very early stage. Scala makes the compiler do additional checks and so you have to do less.
Python is a dynamically typed language where a declared variable can take any value at runtime.

Dynamically typed programming languages do type checking at run-time as opposed to compile-time.
More error-prone?
If we happen to miss any spots or leave the code in an illogical state, the code will not compile, preventing any surprising run-time errors.
Every time a code-restructuring is done, there is always a risk of breaking the logic and leaving out bugs while using Python.

Since Scala is a compiled language, it gets the advantage over Python here.
Streaming support
Python Spark streaming support is not advanced and mature like Scala.

Multi-threading
Big Data systems needs the development activities be linked across multiple databases and services.

Scala is preferred here for the Play framework that offers asynchronous libraries which are easy to integrate.
Python does not support true multi-threading.
External libraries
Scala has excellent built-in concurrency support and libraries like Akka which makes   it easy for developers to build a truly Scalable application.

Spark MLib – the machine learning library has enough algorithms ideal for most Big Data related use cases.
Python has several libraries for Machine Learning and Natural Language Processing.



Hope you enjoyed reading this post. Please feel free to provide your feedback. 

Thanks!

Comments

Popular posts from this blog

Automated bash script to export all Hive DDLs from an existing environment at one go!

Apache Spark :: Error Resolution :: 'value join is not a member of org.apache.spark.rdd.RDD'

Filtering out Nulls and Headers in Scala/Spark