Posts

Retrieval 02: Swing and Surprise

There are two very important types of relationships between products: substitute and complementary. Substitute products are those that are interchangeable while complementary products are those that might be purchased in addition. For example, when a user is looking at a T-shirt, substitute products are other different types of T-shirts and complementary products can be shorts, hoodies or jackets. Swing is designed for substitute relationships and Surprise is designed for complementary relationships....

Retrieval 01: Collaborative Filtering

UserCF UserCF uses ratings of the target item from top N similar users to predict the rating from the current user. There are mainly two steps: Steps 1. Calculate similarities between users. There are three common ways of calculating user similarities: a. Jaccard Similarity: $$J_{u,v} = \frac{|N(u) \cap N(v)|}{|N(u) \cup N(v)|}$$ $N(u)$ denotes the interacted item set of user $u$. b. Cosine Similarity: $$cos(u,v) = \frac{u\cdot v}{|u|\cdot|v|}$$ $u$ and $v$ denote the rating vectors of two users respectively....

Cracking Machine Learning Interviews - 01 Feature Engineering

1. Why do we need to apply normalization to numerical features? There are two common ways of normalization: a. Min-Max Scaling $$X_{norm} = \frac{X-X_{min}}{X_{max}-X_{min}}$$ This method can scale the data into a range of [0,1). b. Z-Score Normalization $$z = \frac{x-\mu}{\rho}, \quad \rho=\sqrt{\sum\frac{(x_i-\mu)^2}{N}}$$ This method will scale the data and make the mean value and standard deviation of the new data become 0 and 1 respectively. When the scales of features are different, the gradients of weights of features can be very different, leading to a different ’learning pace’ of each weight, shown as a zig-zag on the gradient plot....

Cracking Machine Learning Interviews - 02 Model Evaluation

The limitation of metrics. 1. What’s the limitation of accuracy? When the positive samples and negative samples are imbalanced, accuracy may not correctly reflect the performance of the model. 2. How do we balance precision and recall rate? We can use the Precision-Recall curve, ROC or F1 score to evaluate the performance of a ranking/classification model. $$F1 = \frac{2\times precision \times recall}{precision + recall}$$ 3. The RMSE of the model is high but 95% of samples in the test set are predicted with a small error, why is that?...