Most of the Software Engineers like me must have tried quite a few times to read up on Statistics and some of its methods and theories, just to get a hang of it and its usefulness in Machine Learning.
Do I have to tell you, how boring it is? and how many times we have to read the same concept with different notations across different sources?
Got tired of it…
Recently for a change, I adapted Software Engineer work flow for Mathematics...
Union operation a simple operation of joining two or more table rows. What coult go wrong? Hmm…
In this post I am gonna show some pitfalls we faced in our project some time back, when we used union two add tables from ingestion drops and how we managed it.
It would be nice to recap the differences…just the basic ones!
This is part of the series called Big Data Playground for Engineers and the content page is here!
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
It would be good to read the theory on the subject of building production ready Machine Learning System through followings links, since we are going to see how materialize the theory. As part of this series I shown how to co-relate the common tutorials that we come across in web to make a end end real time pipeline.
The example use case and the code used to materialize the theory is just an attempt for educational purpose on a single local machine, sure there are lot of improvements that has to be considered for production scenario. Nevertheless, I think this would be good eye opener for many, exposing you to end to end data science pipeline, as it did to me. …
This is part of the series called Big Data Playground for Engineers and the content page is here!
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
Data without labels are of less interested for Machine Learning and for Deep Learning its a Big Nooo!
Preparing labelled data is not a child's play…
More professional managed services are available like…
In this post lets see how to create a simple flask based annotation tool that reads data from Postgresql DB table and stores the label data back to it. …
This is part of the series called Big Data Playground for Engineers and the content page is here!
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
In 2016, AI researchers from Stanford University introduced a new paradigm known as data programming that allow data engineers to express weak supervision strategies and generate probabilistic training labels representing the lineage of the individual labels. The ideas behind data programming were incredibly compelling but were lacking a practical implementation.
Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18
A weak supervised training data set creation framework. It tackles one of central questions in supervised machine learning: how do you get a large enough set of training data to power modern deep models? …
This is part of the series called Big Data Playground for Engineers and the content page is here!
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
Any starter in Natural Language Processing has to come across spaCy, as it becoming the competitor for NLTK!
Below Jupyter notebook quickly introduces the basic concepts in NLP with spaCy
This is part of the series called Big Data Playground for Engineers and the content page is here!
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
How can any one can ignore Twitter tweet stream? when they start learning Streaming.
Why we need Kafka? In brief its a distributed streaming platform. Imagine we wanted to do some sort of NLP task on the tweet text, with in matter of time the single machine will be overwhelmed with the text data, hitting its computing limits. Here Kafka helps mostly in handling the stream data before it gets processed by itself or by external computing platform like Apache Streaming. …
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
After this you can setup common metastore between Spark Streaming, Spark SQL and Hive, thus enabling cross tooling query capability.
From Apache Spark you can read Hive tables and vice versa!
Apache Hive is an open source data warehouse system built on top of Hadoop for querying and analyzing large datasets stored in Hadoop files. It process structured and semi-structured data in Hadoop.
In tradition Database store, both the computing and data lives in the same machine, forcing people to go with big machine every time they reach the bottleneck with the business requirements for more power and data storage. …
Before you begin!
This series is to support the learn by doing approach, where each scenario is fabricated based on my past experience in the industry. I am not going to explain the nitty-gritty of the frameworks or tools used, like other tutorials, but certain tips and tricks will be highlighted, which may only pop up when mixing up multiple components together and a end to end solution needs to be made.
This is to show starters how to put different components and build a complete solution as opposed to simple tutorials out there.
For example, most of us came across building Machine Learning model and Deep Learning models, but how many times you seen them in live action? …
RDD Basics:
Dataset Basics…
When I started Apache Spark learning almost 3 years back, it was more of setting up on local machine and run the code on single machine. Later write Scala code, build and run with spark-submit. Then came PySpark and Jupyter notebook.
Most of new comers found it to be easy to get started with PySpark but feel difficult when dealing production use cases and/or projects. The same configurations, for the same data set may not work alike across pipeline modules. …
About