Open in app
Home
Notifications
Lists
Stories

Write
Mageswaran D
Mageswaran D

Home

About

Apr 16

How to keep Cloud cost under control?

Collect the hardware metrics (CPU/memory utilization, disk I/O)

Cloud

3 min read

How to keep Cloud cost under control?
How to keep Cloud cost under control?

Mar 24

2022 : No FileSystem for scheme: s3? How to read/write data from AWS S3 with a Standalone Apache Spark 3.2.1 setup? and use Delta as format?

How many times did you face this error? S3 or S3a schema not found error while working with AWS S3 + Spark outside the EMR? An error occurred while calling o25.partitions. : java.io.IOException: No FileSystem for scheme: s3 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)…

Apache Spark

3 min read

2022 : No FileSystem for scheme: s3?
2022 : No FileSystem for scheme: s3?

Jan 2

PySpark Practice Problems

How to transform array of arrays into columns? import pyspark.sql.functions as F df = spark.createDataFrame( [([["a","b","c"], ["d","e","f"], ["g","h","i", "j"]],)], ["data"] ) df.show(20, False) df = df.withColumn("data1", F.explode("data")) df.select('data1').show() # Row(max(size(data1))=4) ---> 4 max_size = df.select(F.max(F.size('data1'))).collect()[0][0] df.select( *[F.col("data1")[i].alias(f"col_{i}") for i in range(max_size)] ).show() …

Pyspark

3 min read

PySpark Practice Problems
PySpark Practice Problems

Dec 15, 2021

Dynamic Programming Patterns to Ace Interviews

Following materials are taken from freecodeCamp.org Youtube video course https://www.youtube.com/watch?v=oBt53YbR9Kk strongly recommend to watch the video! Python version Colab Notebook @ https://colab.research.google.com/gist/Mageswaran1989/658eb75fed28f5953ac5167b3a14ff6d/dynamicprogramming.ipynb Online Visualisation Tools: https://visualgo.net/en/recursion Dynamic Programming is useful when a problem as repetitive computation that can be cached. …

Dynamic Programming

9 min read

Dynamic Programming Patterns to Ace Interviews
Dynamic Programming Patterns to Ace Interviews

Dec 8, 2021

What if I wanted to submit remote PySpark jobs to AWS EMR without worrying about library dependency versions ?

Typically Spark cluster is used to run ETL jobs or some streaming jobs, where everything is managed by developers for developers, where end user only access the processed data. I worked for nearly two years in such setup, and I can replicate the same with ease. How many times we…

Apache Livy

5 min read

What if I wanted to submit remote PySpark jobs to AWS EMR without worrying about library dependen…
What if I wanted to submit remote PySpark jobs to AWS EMR without worrying about library dependen…

Nov 29, 2021

NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?

gyan42 / receipts-form-filling An online version of Transformer model to extract information from receipts and do a form fillinggitlab.com Online Colab Notebook for model training. Use docker compose to launch the demo.

NLP

11 min read

NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?
NLP: What it takes to design a full stack DeepLearning based Receipts form filling system using NER?

Nov 10, 2021

NLP: How to predict next word in a search query? Full stack N-Gram Model withVue, FastAPI & Heroku

Dataset : google_wellformed_query 25K examples 2. Model: Naive N-Gram probability model : Single machine version and PySpark version 3. Web UI : https://v3.vuejs.org/ 4. Backend API: https://fastapi.tiangolo.com/ 5. Docker and Docker Compose 6. Deployment on Heroku Git: https://github.com/gyan42/autocomplete-ngram-model Colab Notebook for Model training: https://colab.research.google.com/gist/Mageswaran1989/e49e043f09c1de89c6f433967be118a2/autocorrectmodel.ipynb

NLP

8 min read

NLP: What it takes to model Google search suggestions or auto complete?
NLP: What it takes to model Google search suggestions or auto complete?

Oct 28, 2021

How to build custom NER HuggingFace dataset for receipts and train with HuggingFace Transformers library?

Disclaimer: It is assumed that you have some working knowledge in Hugging face library and datasets library, to begin with. Git link: mozhi-datasets/sroie2019 at main · gyan42/mozhi-datasets Official URL: https://rrc.cvc.uab.es/?ch=13&com=downloads Drive URL…github.com Named Entity Recognition the back bone of extracting information from text documents, often less visited topic on NLP. Let’s not get into what is NER topic for today’s…

Transformers

5 min read

How to build custom NER HuggingFace dataset for receipts and train with HuggingFace Transformers…
How to build custom NER HuggingFace dataset for receipts and train with HuggingFace Transformers…

Oct 26, 2021

A dive into Apache Spark Parquet Reader for small size files

While working part of my current project, I was asked how Spark reads Parquet files and how does it achieves the parallelisation. Very simple question indeed. Part of my current project I had to load lot of small files and do some filtering and aggregations to get counts for given…

Pyspark

8 min read

A dive into Apache Spark Parquet Reader for small size files
A dive into Apache Spark Parquet Reader for small size files

May 24, 2021

Some interesting interview Q&A around “randomness”

1 How to sample a random Element from an infinite stream? For an average developed this is a insane question! or at-least it was for me!!! If we know the length of the stream we can use in-build or numpy random function to generate a number. import numpy as np np.random.randint(0…

Interview

2 min read

Mageswaran D

Mageswaran D

A simple guy in pursuit of of AI and Deep Learning with Big Data tools :) @ https://www.linkedin.com/in/mageswaran1989/

Following
  • Terence Shin

    Terence Shin

  • Jano le Roux

    Jano le Roux

  • Julien Simon

    Julien Simon

  • Yosef Ardhito

    Yosef Ardhito

  • Coding Freak

    Coding Freak

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Knowable