Beyond Basics: A Comprehensive Interview Question Bank for DL, NLP, and Diffusion Models

11 min readDec 27, 2023

Gone are the days when a basic understanding of machine learning sufficed for tech interviews. Today, employers seek candidates who can navigate the intricacies. If you are looking for basic Machine Learning topics to prepare, I would encourage you to check my Data Science Interview Preparation blog. In this article, I go beyond the basics, providing you with an extensive interview questions bank that will not only test your knowledge but also challenge you to apply these concepts in real-world scenarios. Question bank is prepared by studying the Google DeepMind, Google Brain, Microsoft, Samsung Research, OpenAI and Meta’s interview details shared online along with theoretical and practical observations, collected by self-reading papers in particular domain of choice such as Deep Learning, Transformer, LLMs and others.

The interview room can be an intimidating space, especially when the discussion revolves around complex concepts like Deep Learning, Natural Language Processing, and Diffusion Models. To equip you with the confidence and knowledge needed to excel, I present a treasure trove of interview questions that cover the breadth and depth of these fascinating domains. Let’s unravel the secrets to cracking the code of your next technical interview. In case, there are any other related questions/topics in your mind, please feel free to add them to comments of this blog. If you are looking to practise DL/MLE/Software interview live with me, please reach out me at Topmate.

Disclaimer: This blog do not in any way represent the views of my present or any previous employer(s).

The topics covered in this post include:

Probability and Information Theory
Optimization
Gradient Descent, LR, Activation functions and Optimizers
Transformer
Diffusion Models

Probability and Information

What is Random Variable
What are some of famous Probability distributions: Bernoulli, Multinoulli, Gaussian, Laplace, Exponential, Mixture-of-Distributions(GMM)
Expectation, variance, covariance and correlation
How would you find the 95% confidence intervals for a given mean?: find 2sigma
What is Information and Entropy?: Learning that a less likely event has happened is more informative then learning that a likely event has happened. Therefore we measure, self-information of an event given the prob distribution as I(X=x) = -log(P(X=x)) nats while Expected information in an event drawn from prob distribution is Entropy, H(x) = E[x~P](I(x)) = -∑(p_i*log(p_i))
Distributions that are nearly deterministic have low entropy otherwise high entropy
What is KL Divergence: If we have two separate prob distributions P(x) and Q(x) over the same random variable, we can measure how different these two distribution brae using the KL Divergence; Mathematically KL(P||Q) = E[x~P](log(P(x)/Q(x)))

Optimization

What is difference between MLE and MAP estimation
Difference in optimization and regularization
Optimization: refers to task of either minimizing or maximizing some function f(x) by altering x
Regularization: Any changes made to the ML algorithm such that its performance on training data do not changes, but it improves for unseen test data. For instance, Regularization methods aim to mitigate this issue of overfitting by introducing additional constraints or penalties on the learning algorithm.
What is the difference in convex and concave optimisation?
What are the benefits of convex function for optimization?
Convex optimisation algos are applicable only to convex functions(function for which Hessian is Positive semi definite everywhere(Eigen values are positive or 0)). Such functions are well-behaved because they lack saddle points and all their local minima are necessarily global minima. Most functions in DL are difficult to express as Convex optimisation
What are some of the commonly used optimization technqiues in neural networks: Gradient based optimization like GD, SGD, RMSProp, Adam, Ada, Momentum, Nestrov. Nature inspired optimization like Genetic Algo, PSO, Differential Evolution, Bee Colony algos.
How do we update the parameters using Gradient descent and why: During back propagation, we update parameters by subtracting negative of gradient wrt input. Negative of gradient points in a direction which minimizes the value of cost function
What is the role of learning rate in gradient descent?: The LR must be small enough to avoid overshooting minimum and gaining uphill in directions with strong positive curvature
What are Jacobian and Hessian Matrices of a function?
What is the role of second derivative in optimisation?: This is important because it tells whether a gradient step will cause as much of an improvement as we would expect based on gradient alone. It measures curvature I.e. concave or convex
What is Saddle point?
What is Taylor series expansion?
What is Newtons method?: Its a second order optimisation algorithm
What is conditioning number?: Conditioning refers to how rapidly a function changes wrt small changes in its inputs. f(x) = A(-1)x and A has an Eigen value decomposition then its condition number = max[i,j] abs(𝜆i/𝜆j), i.e. the ration of largest and smallest eigenvalue. In multiple dimensions, there can be variety of second derivatives at a single point, because of second order derivative in each direction. The conditioning number of hessian measures how much the second derivative vary. Poor conditioning number implies poor performance of gradient descent as well.
What are draw backs of using gradient descent?: GD fails to exploit the curvature information contained in the Hessian Matrix. Cost function of NN is neither convex or concave like sin(x) for x belongs to R with multiple maxima and minima

Gradient Descent, LR, Activation functions and Optimizers

Difference in GD, SGD, mini-batch GD
GD: Optimisation algo, used to minimize some function by iteratively moving in a direction of steepest descent as defined by negative of gradient. Since in SGD, only one sample from dataset is chosen at random for each iteration, path taken by also to reach minima is noisy than vanilla GD. Goal is to reach minima fastly by taking greedy decisions
Difference in GD, Momentum and Nestrov GD — Momentum: Takes into account past gradients to smooth out the update by taking a exponential weighted average. Nestrov: It works on the principle of “Look before you leap”
How to tune LR in NN training? Are there any particular optimizers doings so?: Annealing LR: Step decay, Exponential decay, Adaptive LR
Optimizer which adapt LR — Adagrad, RMSprop, Adam, AdamW=> Adagrad: Parameters that have higher gradient or frequent updates, should have slower LR, so that we do not overshoot the minimum value, while params with lower gradient/infrequent updates should have higher LR. It decays LR very aggressively, so params will start reaching small updates because of decayed LR. It gets stuck when its close to convergence to minima
RMSprop: It overcomes adagrad problem by being less aggressive to the decay of LR. It works by keeping an exponentially weighted average of the square of past gradients
Adam: adaptive momentum estimation. Combine idea of Momentum and RMSprop. Two params are beta_1, beta_2. Convergence through momentum, model do not get stuck in saddle point
AdamW: Adam with Weight Decay Regularisation.
Implement Softmax — Possible issue: Numerical stability for Overflow and Underflow
What is difference between Cross entropy and Binary Cross entropy
What is vanishing and exploding gradient?
What is ReLU activation? Are there any pros and cons of using it?
Pros: Avoid & rectifies vanishing gradient. Less computationally expensive than tank or sigmoid. Cons: Can only be used within hidden layers. Some gradients can be fragile during training and can die. It results in Dead neuron. The neurons which go in that state, will stop responding to the variation in error. “Dying ReLU” problem: If a neuron’s output is consistently negative, the gradient is consistently zero, and the neuron effectively becomes inactive. Remedy is to use Leaky Relu or p-ReLU, Its an attempt to fix “dying ReLU” by having small negative slope when x is negative
What is squashing in Sigmoid activation
Problem with sigmoid — Saturate quickly and kill gradients
What are some of the regularization techniques we can use while training a NN? — Data Augmentation, Using Dropout, Early Stopping, L1/L2 regularization, Batch Normalisation, Skip-connection: allows gradients to bypass certain layers, helping to address vanishing gradients
Can I use L2 regularization to deal with vanishing gradient problem while training NN? — This will worsen the problem, by shrinking weights towards 0. Its main purpose is to control the complexity of the model and encourage smaller weights, rather than directly addressing vanishing gradients.
How to create ensemble of NN, in a practically feasible manner? — Using dropout creates an ensemble of NN
What if all weights are initialized with same value in a NN? — Each hidden unit gets same value and same activation output then. Same derivative of cost function during back prop. No learning/Under-fitting
How does weight initialization impact the training process of a neural network? What are some of the techniques? — It can affect how quickly the network converges to a solution, the quality of the solution reached, and whether the network gets stuck in local minima. Xavier/Glorot Initialization: This method initializes weights with values drawn from a Gaussian distribution with zero mean and a variance, He Initialization, Uniform Initialization
What if I initialize all weights to 0? — Initializing all weights to zero can cause issues since all neurons will learn the same features and gradients will be equal. This results in symmetric weights and no learning capacity.
Describe the purpose and benefits of batch normalization in neural networks. — It normalize the activations of hidden layer, so that the weights of next layer can be updated faster. It tackles the problem of internal covariate shift
What is internal covariate shift — It refers to the change in the distribution of activations within a neural network layer during training due to the changing input distribution.
Can we do not use Batch Normalization in RNN or Transformer architectures — Batch Normalization was originally designed for feedforward neural networks and convolutional neural networks (CNNs). It aims to normalize the activations of each layer by adjusting them to have zero mean and unit variance. RNNs process sequential data over time steps. Traditional BN treats each time step as a separate batch, which can break the temporal dependencies in the data. Techniques like “Layer Normalization” or “Sequence Batch Normalization” have been proposed to address this issue.
What will be the last layer activation function and loss-function for multi-label and multi-class classification — Multi-class: Softmax with cross entropy, Multi-label: Sigmoid with Binary cross entropy
What is Catastrophic forgetting — The process of learning new information interferes with or overrides the previously learned information. Common in Sequence learning networks. As the network adapts to the new tasks, the weights and representations that were useful for the previous tasks might be altered → degradation in performance

Transformer

What challenges do RNNs face when processing long sequences, and how does the Long Short-Term Memory (LSTM) architecture aim to address these issues?
Transformers use attention. Attention helps to draw connections between any parts of the sequence, so long-range dependencies are not a problem anymore.
What is attention?
What is the purpose of attention mechanisms in neural networks?
Why do we need Positional embeddings in transformer? — Since transformer model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the learned word embeddings, so that the two can be summed.
What is input to a transformer model for MT?
How can I tokenise the text before training the Transformer model — Areas of concern: 1) Unit Selection, 2) Vocab Size, 3) handling languages with complex morphology, 4) out-of-vocabulary words, 5)Word level or White space tokenisation, 6) Character level: Text is tokenized at the character level, where each character becomes a token. Used in ELMO using 1-D convolution, 7) Sub-word level: To handle rare or OOV words/character sequences, 8) Byte-pair encoding: BPE works by iteratively merging the most frequent pairs of characters or subword units in the text. It starts by treating each character as a token and then merges pairs of tokens based on their frequency until a specified vocabulary size is reached. Its a greedy algo. 9) Word piece: WordPiece also breaks down words into subword units, but it performs the splitting based on maximizing a predefined criterion (such as likelihood or entropy) while maintaining a specified vocabulary size. Its an optimal algo.
What different kind of attention mechanisms are utilized in Transformer?
Self-attention in Encoder: All of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder
Cross-attention in Decoder: The queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence
Why do we use Multi-headed self attention?
What is masked multi-headed attention?
What are QKV in attention(self and cross)?
How to create mask for masked multi-head attention? — Create upper triangular matrix with negative infinity values and apply exp operation i.e. all values in the input of the softmax which correspond to illegal connections
What are different decoding strategies for language modelling task — 1) Greedy decoding, 2) Beam Search: This strategy maintains a fixed number (beam width) of partial sequences in parallel. At each step, the model generates multiple candidate tokens and keeps the top candidates based on their probabilities, 3) Random Sampling, 4) Tok-K: This strategy limits the sampling to the top-k most likely tokens at each step. This allows for controlled randomness while ensuring that the generated sequences are coherent and contextually relevant. 5) Top-p/Nucleus sampling: It keeps the top-p most likely tokens at each step, where p is a probability threshold. This technique helps in maintaining diversity while controlling the number of options considered, 6) Temperature Scaling: Using Softmax temperature. Higher temperature values introduce more randomness, while lower values make the selection more deterministic.
Why do we use Layer Norm in Transformer and not Batch Norm?
In batch normalization, we use the batch statistics: the mean and standard deviation corresponding to the current mini-batch. However, when the batch size is small, the sample mean and sample standard deviation are not representative enough of the actual distribution and the network cannot learn anything meaningful.
As batch normalization depends on batch statistics for normalization, it is less suited for sequence models. This is because, in sequence models, we may have sequences of potentially different lengths and smaller batch sizes corresponding to longer sequences.
How BERT and GPT are different?
What tasks BERT was pre-trained upon?
Why do we consider bi-directional context in BERT and not in GPT?
What is KV cache? — The KV cache in Transformers is a memory-efficient technique that stores key (K) and value (V) vectors from the encoder’s output to optimize the attention mechanism during decoding in autoregressive language models, leading to faster and more efficient generation of tokens.
What is the time complexity of attention mechanism?
Can we reduce this time complexity?
Why there is Shift right token operation when giving inputs to decoder — Input to the decoder is the target sequence shifted one position to the right by the token that signals the beginning of the sentence. The logic of this is that the output at each position should receive the previous tokens (and not the token at the same position, of course), which is achieved with this shift together with the self-attention mask.

Diffusion Models

Most of the questions along with results are mentioned with response here in this article. Noting down the questions ahead.
Are there any assumptions for noise functions added at each step of Diffusion?
Why only the noise added at each step is gaussian. Can I use any other probability distribution of noise?
Why diffusion models perform better than GANs, Is there any intuition?
What is the role of skip-connections in U-net architecture and how are they different from skip-connections in ResNet?
Why the variance schedule have small value(i.e. β_t << 1) or why the step size is small?
Why we need to do so many reverse steps to obtain clear image. What benefit it provides?
Let’s compare different Generative architecture such as VAE, GANs and Diffusion. When should one prefer to use one architecture over the other?
How is stable-diffusion different from DALLE-2 or IMAGEN model architectures. What added advantage do it provides?

Keep Learning, Keep Hustling