Scala vs Python vs Java :: Big Data processing with Apache Spark
Scala vs Python vs Java :: Big Data processing with Apache Spark
I have tried to compare
Python and Scala on various parameters like: performance, ease of using the language, integration using existing libraries, support for streaming use cases and of
Apache Spark’s core capabilities.
I did not try to evaluate
Java for the following reasons:
- Java does not
support the REPL command line feature (Read, Evaluate, Print, Loop) which is very
extensively used to check if small code-snippets are working as expected.
- Java is too
verbose – It uses more lines of code, and displays more messages than
needed.
- Scala on the JVM
is way more powerful and cleaner than Java.
|
SCALA
|
PYTHON
|
De-Facto language for Spark
|
Scala is
the 1st preferred language for Spark as Spark itself is written in Scala, so
developers can dig deep into the Spark source code whenever required.
New features of Spark are first available in Scala and are later ported to Python or Java. |
Data
Scientists would traditionally have more of a background in Python will not
like to learn a new language.
|
Performance
|
Scala is
claimed to be 10 times faster than Python for complex data processing - A
compiled language will have better performance at run-time if it's statically
typed (vs. dynamically)
|
|
Scala - Statically typed,
Python - Dynamically typed |
Scala is
a Statically typed language, so the type of a variable (Int / String / Float
etc) is known at the compile time.
Compiler is proactive to catch the errors at a very early stage. Scala makes the compiler do additional checks and so you have to do less. |
Python
is a dynamically typed language where a declared variable can take any value
at runtime.
Dynamically typed programming languages do type checking at run-time as opposed to compile-time. |
More error-prone?
|
If we
happen to miss any spots or leave the code in an illogical state, the code
will not compile, preventing any surprising run-time errors.
|
Every
time a code-restructuring is done, there is always a risk of breaking the
logic and leaving out bugs while using Python.
Since Scala is a compiled language, it gets the advantage over Python here. |
Streaming support
|
Python Spark
streaming support is not advanced and mature like Scala.
|
|
Multi-threading
|
Big
Data systems needs the development activities be
linked across multiple databases and services.
Scala is preferred here for the Play framework that offers asynchronous libraries which are easy to integrate. |
Python
does not support true multi-threading.
|
External libraries
|
Scala
has excellent built-in concurrency support and libraries like Akka which makes it easy for developers to build a truly Scalable application.
Spark MLib – the machine learning library has enough algorithms ideal for most Big Data related use cases. |
Python
has several libraries for Machine Learning and Natural Language Processing.
|
Hope you enjoyed reading this post. Please feel free to provide your feedback.
Thanks!
Comments
Post a Comment