You are a computer programmer who like to solve real world problems with the help of coding knowledge. Amazingly you constructed something very meaningful and someone liked your code so much that she is ready to pay you for using your code.

Your code got popularised and slowly you had a large audience coming to you which started asking your codebase.

Pattern! Patterns! Patterns!

That’s what we are concerned about while running a Data Mining pipeline which helps to find patterns in the dataset collected. But are all patterns interesting. Well, not really. Interesting patterns are the one, which exhibit all or some of the properties mentioned below:-

- Easily Understood
- Valid on new dataset
- Potentially useful
- Novel

But finding patterns is not always a cake walk and we encounter many a times such data points which are visibly far apart from our normal data collected and such points do fall under the category of Outliers.

An **Outlier **is a datapoint in…

When we encounter various ML algos, there are terms which actually have their origin in probability and statistics, work as lexicon for an ML engineer or Data Scientist. We will try to understand the meaning of each term and define them mathematically. Terms include:-

- Expectation | Mean
- Variance | Standard Deviation
- Covariance
- Correlation

Let there be some function f(x) with respect to a distribution P(x)

The expectation or expected value of some function f(x) with respect to a probability from P

The ultimate goal of machine learning techniques, is to predict the actual probability distribution of data generating process. Now this data generating process could be from any trivial or non-trivial probability distribution and it might not be significant to have any assumption looking at the samples/training data taken from population. Here comes Central limit theorem and give some interesting properties to observe data.

The theorem states that as the size of the sample increases, the distribution of the mean across multiple samples will approximate a Gaussian distribution.

Say there is a random variable X whose probability distribution may or may…

A claim is assumed valid

if its counterclaim is highly implausible

In mathematics there are many techniques for proving your statement to be true. These proofs could actually help in making inferences or decision making as well. Poof by induction, Proof by contradiction or statistical proofs are valued. One such technique of making statistical inferences is Hypothesis testing.

Hypothesis testing is a way for you to test the results of a survey/experiment to see if you have menaingful results. The process of hypothesis testing is to draw inferences or some conclusion about the overall population or data by conducting some…

Any machine learning algorithm, generally involves components as Optimisation procedure, cost function, modelling technique and the most important is “Dataset to learn”. Its said that any ML algo performs as good as the dataset it is fed with.

Most of the time that is spent following “Knowledge discovery from data” pipeline is data collection, cleaning and pre-processing. Data Preprocessing could involve multiple techniques as data transformation, analysing redundant data or outlier detection. All these anomalies in dataset could cause our model to under perform. There are couple of questions we should take care of before applying a model on any…

Probability theory is all about measuring uncertainty. But before defining the uncertainty, we should have some object/event whose uncertainty we are talking about.

In this article I will talk about Random Variable, Probability Distribution and some of famous distributions of concern to a machine learning enthusiast.

A variable that takes on different values randomly is called a random Variable. Now a RV can be discrete or continuous. Lets take an example of tossing two coins simultaneously. Then the RV, X= [HH, HT, TH, TT] denoted different states possible of X.

On its own, a RV is just a description of…

Recently elections were held in part of some democratic country. Instead of waiting for the final election results, some media channels just went to streets and started asking people about their view on the leader of their choice. This randomly asking people about their choice of interest, reflect the choice of very small minority of overall population of the country, but carry many useful insights. Its generally referred as Exit Poll results.

The ‘**Population**’ in statistics doesn’t necessarily have to be human. …

To understand two different approaches to probability, we should be initially very clear about Probability Theory.

In Computer science we rarely talk about uncertainty and take most of the entities to be certain and deterministic.

- Errors in hardware will not occur
- CPU will execute every command that you give to it

But unlike the case, in machine learning we deal with uncertain quantities and sometime with even non-deterministic (stochastic) quantities as well.

The three possible reasons for uncertainty are:-

- Inherent stochasticity in the system
- Incomplete observability
- Incomplete modelling

To understand the reasons mentioned above, lets take two statements to account:-

- …

Data Science is a hot topic of research but the best part is how it affects business. Today every company mid-scale or large exploits the choice of their clients and look to find the pattern shown by users to understand behaviour. Either to train their language model, automate some classification/clustering task or recommend products to the customers data is essential to find hidden and interesting patterns.

The openings for related jobs in the market are increasing every year but the material to prepare for sitting in related interviews is really not fixed as for other roles offered. Handful of Data…

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/