Most of the Software Engineers like me must have tried quite a few times to read up on Statistics and some of its methods and theories, just to get a hang of it and its usefulness in Machine Learning.

Do I have to tell you, how boring it is? and how many times we have to read the same concept with different notations across different sources?

Got tired of it…

Recently for a change, I adapted Software Engineer work flow for Mathematics...

  • Read about the topic of interest and understand it theoretically. (This includes both the problem and possible solutions)
  • Explore the tools and programming languages which has implemented the solution or helps in implementing…

Union operation a simple operation of joining two or more table rows. What coult go wrong? Hmm…

  • Schema mismatch : Number of columns or the datatype mismatch
  • Rows getting appended in wrong column order

In this post I am gonna show some pitfalls we faced in our project some time back, when we used union two add tables from ingestion drops and how we managed it.

Join vs Union

It would be nice to recap the differences…just the basic ones!

  • In join we add two tables with different schema based on a common key column. The new table can now have columns from both the tables. User can choose what columns should be there in the joined table and how it should appear, with same names or different names. …

Image for post
Image for post

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

It would be good to read the theory on the subject of building production ready Machine Learning System through followings links, since we are going to see how materialize the theory. As part of this series I shown how to co-relate the common tutorials that we come across in web to make a end end real time pipeline.

The example use case and the code used to materialize the theory is just an attempt for educational purpose on a single local machine, sure there are lot of improvements that has to be considered for production scenario. Nevertheless, I think this would be good eye opener for many, exposing you to end to end data science pipeline, as it did to me. …


Image for post
Image for post
Data Flow

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Data without labels are of less interested for Machine Learning and for Deep Learning its a Big Nooo!

Preparing labelled data is not a child's play…

  • Domain understanding is required
  • Lot of human hours, needs to be put in to prepare the data set, accounting for human error in the process

More professional managed services are available like…

In this post lets see how to create a simple flask based annotation tool that reads data from Postgresql DB table and stores the label data back to it. …


This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Snorkel

In 2016, AI researchers from Stanford University introduced a new paradigm known as data programming that allow data engineers to express weak supervision strategies and generate probabilistic training labels representing the lineage of the individual labels. The ideas behind data programming were incredibly compelling but were lacking a practical implementation.

Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18

A weak supervised training data set creation framework. It tackles one of central questions in supervised machine learning: how do you get a large enough set of training data to power modern deep models? …


This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Any starter in Natural Language Processing has to come across spaCy, as it becoming the competitor for NLTK!

Image for post
Image for post

Below Jupyter notebook quickly introduces the basic concepts in NLP with spaCy


This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

How can any one can ignore Twitter tweet stream? when they start learning Streaming.

Image for post
Image for post
Example Illustration

Why we need Kafka? In brief its a distributed streaming platform. Imagine we wanted to do some sort of NLP task on the tweet text, with in matter of time the single machine will be overwhelmed with the text data, hitting its computing limits. Here Kafka helps mostly in handling the stream data before it gets processed by itself or by external computing platform like Apache Streaming. …


This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

After this you can setup common metastore between Spark Streaming, Spark SQL and Hive, thus enabling cross tooling query capability.

From Apache Spark you can read Hive tables and vice versa!

What is Hive?

Apache Hive is an open source data warehouse system built on top of Hadoop for querying and analyzing large datasets stored in Hadoop files. It process structured and semi-structured data in Hadoop.

In tradition Database store, both the computing and data lives in the same machine, forcing people to go with big machine every time they reach the bottleneck with the business requirements for more power and data storage. …


Image for post
Image for post

Before you begin!

This series is to support the learn by doing approach, where each scenario is fabricated based on my past experience in the industry. I am not going to explain the nitty-gritty of the frameworks or tools used, like other tutorials, but certain tips and tricks will be highlighted, which may only pop up when mixing up multiple components together and a end to end solution needs to be made.

This is to show starters how to put different components and build a complete solution as opposed to simple tutorials out there.

For example, most of us came across building Machine Learning model and Deep Learning models, but how many times you seen them in live action? …


Image for post
Image for post

RDD Basics:

Dataset Basics…

Contents

  1. RDD
  2. Partitions
  3. Joins
  4. Serialization
  5. UDF
  6. Analyse Execution plan
  7. Data Skew
  8. Cache
  9. Storage
  10. JVM
  11. Monitoring
  12. Executor
  13. Memory
  14. Streaming

When I started Apache Spark learning almost 3 years back, it was more of setting up on local machine and run the code on single machine. Later write Scala code, build and run with spark-submit. Then came PySpark and Jupyter notebook.

Most of new comers found it to be easy to get started with PySpark but feel difficult when dealing production use cases and/or projects. The same configurations, for the same data set may not work alike across pipeline modules. …

About

Mageswaran D

A simple guy in pursuit of of AI and Deep Learning with Big Data tools :) @ https://www.linkedin.com/in/mageswaran1989/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store