If you are newbie to ML you must have already come across Linear Regression, the hello world algo in ML.

Let us revisit the ways to implement the algo and can you guess whats been used in scikit-learn ?

Generally Ordinary Least Squares (OLS) is used to find the line, that minimizes the vertical offset. Best fit line can be find by minimizing the error using squared errors (SSE) or mean squared error (MSE)

Orthogonal Matrix is group of vectors that are orthogonal(perpendicular) to each other and has a special place in ML for following reasons…

Orthogonal Matrix say A[nxn] matrix such that

A^T A = A A^T = I

What is other matrix when multiplied to A equals to Identity matrix?

Matrix Inverse

A^-1 i.e A^T = A^-1

This is useful when we need to compute A^-1 on huge matrix. Jus transpose and we get its inverse!

It preserves the angles and distances when used(more general term) with vectors. …

Visual Reference for SQL Joins

In the world of structured data, most of us would have come across the joins. There is no way of escaping it, as joins basically helps in getting the unified view of information that is split between dimension and fact tables in Database.

Though the modern SQL engines are capable of joining millions of data rows, the developer needs to know certain internals of each SQL engine to get most of it.

This is post is to rgive a quick recap PySpark SQL joins and a collection of tips and tricks!

Experiment Code:

Two datasets to try Spark SQL on big dataset.

Most of the Software Engineers like me must have tried quite a few times to read up on Statistics and some of its methods and theories, just to get a hang of it and its usefulness in Machine Learning.

Do I have to tell you, how boring it is? and how many times we have to read the same concept with different notations across different sources?

Got tired of it…

Recently for a change, I adapted Software Engineer work flow for Mathematics...

  • Read about the topic of interest and understand it theoretically. (This includes both the problem and possible solutions)

Union operation a simple operation of joining two or more table rows. What coult go wrong? Hmm…

  • Schema mismatch : Number of columns or the datatype mismatch
  • Rows getting appended in wrong column order

In this post I am gonna show some pitfalls we faced in our project some time back, when we used union two add tables from ingestion drops and how we managed it.

Join vs Union

It would be nice to recap the differences…just the basic ones!

  • In join we add two tables with different schema based on a common key column. The new table can now have columns from…

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

It would be good to read the theory on the subject of building production ready Machine Learning System through followings links, since we are going to see how materialize the theory. As part of this series I shown how to co-relate the common tutorials that we come across in web to make a end end real time pipeline.

The example use case and the code…

Data Flow

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Data without labels are of less interested for Machine Learning and for Deep Learning its a Big Nooo!

Preparing labelled data is not a child's play…

  • Domain understanding is required
  • Lot of human hours, needs to be put in to prepare the data set, accounting for human error in the process

More professional managed services are available like…

In this post lets see how to…

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Snorkel

In 2016, AI researchers from Stanford University introduced a new paradigm known as data programming that allow data engineers to express weak supervision strategies and generate probabilistic training labels representing the lineage of the individual labels. The ideas behind data programming were incredibly compelling but were lacking a practical implementation.

Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18

A weak supervised…

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Any starter in Natural Language Processing has to come across spaCy, as it becoming the competitor for NLTK!

Below Jupyter notebook quickly introduces the basic concepts in NLP with spaCy

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

How can any one can ignore Twitter tweet stream? when they start learning Streaming.

Example Illustration

Why we need Kafka? In brief its a distributed streaming platform. Imagine we wanted to do some sort of NLP task on the tweet text, with in matter of time the single machine will be overwhelmed with the text data, hitting its computing limits. Here Kafka helps mostly in handling the…

Mageswaran D

A simple guy in pursuit of of AI and Deep Learning with Big Data tools :) @ https://www.linkedin.com/in/mageswaran1989/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store