Big Data Play Ground For Engineers: Intro

Mageswaran D
9 min readApr 15, 2020

Before you begin!

This series is to support the learn by doing approach, where each scenario is fabricated based on my past experience in the industry. I am not going to explain the nitty-gritty of the frameworks or tools used, like other tutorials, but certain tips and tricks will be highlighted, which may only pop up when mixing up multiple components together and a end to end solution needs to be made.

This is to show starters how to put different components and build a complete solution as opposed to simple tutorials out there.

For example, most of us came across building Machine Learning model and Deep Learning models, but how many times you seen them in live action?

While I am trying to be a decent writer, this is my first attempt to create a series out of my pet project, so expect some ad-hoc way of publishing the articles in this series, before my brain looses the information into thin air.

Looking forward to make this series, worth your time in couple of weeks :)

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

I am a Instrument Engineer turned Embedded Engineer turned Big Data Engineer aspired to be Machine Learning Engineer (not definitely Data Scientist, my brain is not build for that) in near future!

My initial exposure were on embedded platform software like variants of RTOS, Android and frameworks like OpenCL, OpenMP, Cuda etc., and programming languages like C & C++.

In the year 2016 (just 4 years back) I started to dive in Big Data, then the question was what programming language(s) should I learn? What kind of software frameworks should I learn or must have to land a job in Big Data/Data Engineer/Data Science Engineer ?

After few google searches, I learned as name state the Big Data is all about handling data that can’t be done in single machine and you have no idea what an enlightenment that was! (pun intended). With that figured out, the next quest was to figure out what Distributed computing frameworks are? with my small brain of mine, I simplified it to be a kind of a software stack that gonna sit on top of existing OS and gonna use networking stuff heavily, and co-ordinate all the machines in a network to work as a one so called cluster.

For Dragon Ball Z lovers, like me…others you may skip :)

If you are a Dragon Ball Z fan, then Distributed systems are something like Cell, which sits on the host machines takes the control of it, master node sends its instructions to other nodes and uses the resources to do something good unlike Cell, whose sole purpose was to become strong killing every one else!

What is Big Data?

Let me not do in-justice to other fellow bloggers, explaining this vast topic, who have done their best. A Google search will throw tons of materials on the topic, so lets move forward!

Big Data Projects

Okay Big Data needs Big Tools obviously, isn’t?

We hear lot of Big Data tools/frameworks ever other day. Being in Software Industry this is night mare for any salaried people irrespective of their brain power and strength.

Its all about how fast you can unlearn, with what you have been surviving so far and learn new stuff .

That sounds so simple buddy, by unlearn you mean forget it right? tada done!

If that was your thought for a moment, hold on, unlearn doesn’t mean forgetting your old school stuff, its about adapting yourself for new age requirements, while carefully rewiring your brain neural network.

Didn’t get what I mean by that, then for those who have some exposure to the functional programming languages like Scala, might understand it. Among all good stuff, it has two simple enforcement rules: 1. Immutable variables 2. Recursive programming style. Two simple rules changes the whole programming style. So with that two simple rules powered by case classes, hybrid switch (match expression) and powerful type system, I felt like I had Swiss Army knife. The problem was I hardly managed to use ordinary knife to its fullest, so I don’t have to tell what I could have end up doing with Swiss army knife!

The unlearning and learning process is toughest especially for old generation people, because the process is challenging their whole knowledge foundation, with which they have survived so far.

Thus said, can I learn one framework that has the power and capability to handle the so called Big Data? Well, there is no one universal framework that can meet the business needs, be it Google, Facebook, Netflix, Amazon or the startups supported by the Indian IT companies across the globe.

Big organizations were behind developing major tools like Hadoop, Kafka, Hive, Kubernetes etc.,

Sometimes educational institutes Ph.D paper turned out to be the worlds leading Distributed Compute framework as in the case of Apache Spark

Last but the least small organizations where most of us are placed do play a major role in exploring and adapting these tools/frameworks, taking the technologies to the end user with innovative ideas, pushing the boundaries of these technologies from their inception idea.

Since most of us being part of this last sector, we end up in frustration in dealing with this crazy fever of inventing new frameworks / tools every now and then. In this technology race we as Software Engineers are always on the edge date to participate at any moment!

Ok fine! Does this mean traditional tools like MySQL, PostgreSQL, Python batteries (libraries) are of no use? Nope! We are gonna see how these tools plays its vital role in this series.

In a big data solution, the real question is identifying the right frameworks/tools and how we are going to mix it up and make all the different frameworks/tools work as one entity. Often the data engineers are forced to know or learn dozen of libraries / languages for the business requirements thrown at them.

This image belongs to this series ;) I have summarized my last 3 years journey in above pie chart.

Data Engineer : 100%

Traditionally Data Engineers are SQL developers, but recent advancement has pushed the definition to something else, forcing Data Engineers to have skill set similar to full stack developer, ranging from data collection up to integration with front end.

System Design: 20%

A well designed modular framework can keep your data pipeline running for few months without going back to white board.

This includes writing test cases, utilities, designing class interfaces, adopting certain design patterns etc.,

No matter how small or big your team is or project is, you end up in re-designing your software, the question is how far can I postpone the code refactoring with the information I have right now?

Don’t believe me? Google the difference between Tensorflow 1.x and Tensorflow 2.x . Yup even the great engineers at Google have to refactor the code base or sometime completely rewrite the stack for new requirements with past learning!

There is no one good design fits all the needs! Be a child here, do it, see where it fails, fix it wherever possible, learn from it and rebuild it.

Dev Ops : 20%

Setting up cloud, managing user rights, installation of software stacks and a lot of other tasks that enables the infrastructure for developers to run their code.

Funny thing is no matter how good the Dev Ops team may be, Data Engineer team has to have certain level of understanding of the host machines and software stacks being used or else you are doomed.

In this series let us scratch the surface of this process by setting up a single node cluster (wait one machine a cluster? yup).

Configuration Management : 25%

Have ever wondered why services like AWS has so many XML/Json/Python dict configurations floating around ?

How many times you have written a function without arguments? Ok I can hear your annoyed mind voice… yes a function without arguments is of less use. So does a big data solution, without a configuration management is of no use.

On other hand finding the right configuration parameter and a right value for it is like getting you wish from an angel. (Just wish that angel to be pure and good or you are screwed!)

We need to take care of installed software package configurations, while developing our own.

When the system matures, there gonna be lot of knobs, I mean a lot!!!

In this series, let me introduce to Ginconfig and Python INI file based configuration management.

Data Handling with SQL : 25%

Didn’t I say Data Engineer core task was SQL, it is still the same. Only difference is now any one can write a SQL stuff to get decent amount of work done, but experienced SQL developer knows what how to adopt the queries for given domain.

No matter what kind of Big Data project you end up, you can’t escape SQL (night mare for people like me).

Most of the time we build pipelines just to run business specific SQL commands.

In this series let me touch up few cooked up scenarios to have a feel.

Data Handling without SQL : 5%

My favorite spot in the whole project.

Reading data from external sources like REST API, reading and decoding the data from disk on the fly, sending data to 3rd party end points etc.,

Yup there is an example for this category too!

Refactoring : 5%

More often we forget this part existing only.

It can be a simple SQL query at the data ingestion stage, that screws up the whole upstream system, going back and fixing them would fix the data issue. Triggering whole data processing from scratch.

Code, configurations and SQL query refactoring will become a daily routing as the system matures and new requirements flows in.

Please bare with me for skipping the explanation on why certain framework was selected or to explain the selected framework internals. Lets focus more on integrating different frameworks and construct a data pipeline.

Where ever possible I will try to copy and paste information from other blogs to make the reading pleasant, covering whats needs to be knows to play around the code base.

Note: All pics are copied through google image search!

--

--