Institute of Product Leadership
Close this search box.

Top 30 Data Science Interview Qs & As

Q1. In Python, how is memory managed?

A: In Python, the objects and data structures are located in a private heap which the programmer is not allowed to access. The Python interpreter handles it. The memory manager  allocates the heap space for Python objects while the inbuilt garbage collector recycles all the memory that is not being used to boost available heap space. 

Q2. It is possible to design a linear regression algorithm using a neural network?

A: A neural network can be used as a universal approximator; so it can definitely implement a linear regression algorithm.

Q3. What is bias-variance trade off? 

A: Bias is an error introduced in a data science model due to oversimplification of the machine learning algorithm. Though it may avoid noise, it may also miss out legitimate information. This leads to underfitting. Variance is an error introduced in the model due to complexity of the machine learning algorithm. The model not only picks up noise from the training data set but also performs badly on new unseen test data sets. This high sensitivity leads to overfitting. An increase in bias, decreases variance and an increase in variance, decreases bias. So the manner in which an algorithm is configured helps to balance the two or achieve a bias-variance tradeoff.

Q4.  If Pearson correlation between V1 and V2 is zero, is it    right to conclude that V1 and V2 do not have any relation between them?

A: No. If the correlation coefficient between V1 and V2 is zero, it just means that they don’t have any linear relationship.   Pearson correlation coefficient between two variables can be zero even when there is a relationship between them. 

Q5. What will happen when you fit degree 4 polynomial in linear regression?

A: As it is a higher degree polynomial, it will be more complex and so there are high chances that degree 4 polynomial will result in overfitting.

Q6. What are the assumptions required for linear regression?

A: The assumptions required for linear regression are linearity, normality, independence and homoscedasticity. Linearity means there is a linear relationship between the dependent variables and the regressors. The errors or residuals of the data are normally distributed and independent from each other, and there is minimal multicollinearity between explanatory variables. Homoscedasticity implies the variance around the regression line is the same for all values of the predictor variable.

Q7. Explain what precision and recall are. How do they relate to the ROC curve?

A: Recall is the ratio of the number of true positives divided by the sum of the true positives and the false negatives. It is the percentage of true positives, described as positive by the data model. It is the same as sensitivity. 

Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class –that is, it tells us what percent of positive predictions were correct. Precision is referred to as the positive predictive value.

The ROC curve gives us the false alarm rate versus the hit rate. This is achieved by plotting the false positive rate on the x-axis and the true positive rate on the y-axis for a number of different candidate threshold values between 0.0 and 1.0.  This helps to show the relationship between model recall and specificity, where specificity is the total number of true negatives divided by the sum of the number of true negatives and false positives.

Q8. Explain the difference between L1 and L2 regularization methods.

A: Regularization methods are used to reduce complexity in a data model by imposing a penalty on the loss function. They help to prevent overfitting. 

The L1 regularization technique, called Lasso Regression limits the size of the coefficients to such an extent that they become zero and are eliminated. L1 yields sparse models by adding the absolute value of the magnitude of coefficients as penalty term to the loss function. 

On the other hand,  L2 regularization or Ridge Regression adds a penalty equal to the square of the magnitude of coefficients. It does not yield sparse models and all coefficients are shrunk by the same factor, with no eliminations. 

The key difference between L1 and L2 is hence the penalty term.

Q9. Is it better to spend five days developing a 90-percent accurate solution or 10 days for 100-percent accuracy?

A: It is better to spend five days developing a 90% accurate solution. Speed is always good. We can always optimize the model soon after, so a quick-and-dirty model to start with is always preferable. Of course it depends on the context, whether error is tolerable or not, what are the quality assurance parameters and so on.

Q10. What is a residual ?

A: A residual is the error value of a data model. It can be defined as the difference between observed and predicted values of data.  It is used as a diagnostic measure to assess the quality of a model. Therefore a lower residual is desirable.

Q11. What will happen when you apply a very large penalty?

A: When you apply a very large penalty in Lasso some of the coefficient values become zero;  but in the case of Ridge, the coefficients may come close to zero but will not turn zero.

Q12. What is a confusion matrix?

A: The confusion matrix is a performance measurement 2×2 table that contains four outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it.

Q13. What do you understand by the term normal distribution?

A: Normal distribution or Gaussian distribution, is a probability distribution that is symmetric about the mean. On a graph, when data is distributed around a central value without any bias to the left or right it takes the form of a bell-shaped curve called normal distribution.

Q14. What is correlation and covariance in statistics?

A: Both correlation and covariance measure the relationship and the dependency between two variables. Correlation measures both the strength and direction of the linear relationship between two variables, while covariance indicates the direction of the linear relationship between variables. In statistical terms, when the correlation coefficient is positive, an increase in one variable also results in an increase in the other. When the correlation coefficient is negative, the changes in the two variables are in opposite directions. When there is no relationship, there is no change in either. While correlation is a normalized form of covariance and not affected by scale, the two terms cannot be used interchangeably.

Q15. What is p-value?

A: p-value is a number between 0 and 1.  In a hypothesis test a p-value can help determine the strength of results.

Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null hypothesis .

Q16. Why is re-sampling done?

Resampling is done to estimate the accuracy of sample statistics and for model validation through random subsets . 

Q17. What are the differences between overfitting and underfitting?

A: In overfitting, a model also maps the random error or noise instead of only the underlying relationship.

In underfitting, a model cannot capture the underlying trend or relationship  of the data.

Q18. How to combat overfitting and underfitting?

A: Overfitting and underfitting can be addressed by resampling the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.

Q19. What is regularisation? Why is it useful?

A: Regularisation is a process of adding  a constant multiple( L1 or L2) to an existing weight vector to prevent overfitting.

Q20. What is the Law of Large Numbers?

A: The Law of Numbers states that when you repeat an experiment independently a large number of times and average the result, what you obtain should be close to the expected value. It means there is a convergence of the sample mean; the sample variance and the sample standard deviation converge to what we are  trying to estimate in case of large numbers.

Q21.  What are confounding variables?

A: A variable that influences both the dependent variable and independent variable is a confounding variable.

Q22. Explain how an ROC curve works?

A: The ROC curve is a graphical representation of true positive rates and false-positive rates at various thresholds and is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.

Q23. What are Eigenvectors and Eigenvalues?

A: Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.

Eigenvalue is the  strength of the transformation in the direction of the eigenvector or the factor by which the compression occurs.

Q24. What is the cost function?

A: Cost function measures the performance of a machine learning model for given data. It quantifies the error between predicted values and expected values and presents it in the form of a single real number. It is used to compute the error of the output layer during backpropagation. 

Q25. What, in your opinion, is the reason for the popularity of Deep Learning in recent times?

A: Tremendous increase in the amount of data generated through various sources has prompted interest in Deep Learning. Rapid growth in hardware resources required to run data models and the increase in computation power is also behind the widespread popularity. As a subset of machine learning, Deep Learning has improved speech to text conversion, machine translation, object identification, disease detection systems, robotics and so on which in turn makes more and more data available. 

Q26. What is the difference between Regression and classification ML techniques?

A: Both regression and classification machine learning techniques come under supervised machine learning.  They use known data sets or training data sets to make predictions.  The main difference between them is that the output variable in regression is numerical (or continuous) while that for classification is categorical (or discrete).

Q27. What is Ensemble Learning?

A: Ensemble Learning is basically combining a diverse set of learners(individual models) together to improve the stability,  predictive power and function approximation of a model. This helps to solve computational intelligence problems efficiently.

Q28. Why do we generally use Softmax non-linearity function as the last operation in a network ?

A: Softmax takes in a vector of real numbers and returns a probability function. It not only maps output to a [0,1] range but also maps each output in such a way that the total sum is 1. The output of Softmax is therefore a probability distribution which can be displayed or used as an input for other systems. Hence it is used as a final layer in neural networks.

Q29. What are hyperparameters?

A: A hyperparameter is a parameter whose value is set before the learning process begins. 

Q30. What cross-validation technique would you use on a time series data set?

A: Given that a time series is not randomly distributed, I would use techniques like forward chaining where we model on past data and then look at forward-facing data.