Introduction
Recommender systems have become an essential component of modern applications, helping users navigate vast amounts of content by suggesting items they might find interesting or useful. Whether you're browsing products on an e-commerce site, looking for a movie to watch, or discovering new music, recommender systems are working behind the scenes to personalize your experience. For developers, implementing effective recommendation engines can significantly improve user engagement, satisfaction, and business metrics.
This guide provides a comprehensive overview of recommender systems from a developer's perspective. We'll explore the mathematical foundations, implementation approaches, and practical code examples to help you build effective recommendation engines for your applications. By the end of this guide, you'll understand the core concepts and have working code examples to adapt for your specific needs.
Understanding Recommender Systems
Recommender systems address the "information overload" problem by filtering vast options into a manageable subset most relevant to each user. These systems analyze patterns in user behavior, preferences, and item characteristics to predict what users might be interested in. Effective recommender systems balance accuracy (recommending items users will like), diversity (suggesting a variety of items), novelty (introducing users to new content), and serendipity (surprising users with unexpected but valuable recommendations).
Recommender systems generally fall into three categories: content-based filtering, collaborative filtering, and hybrid approaches. Each has distinct strengths and weaknesses, which we'll explore in detail.
Content-Based Filtering
Content-based filtering recommends items similar to those a user has previously liked or interacted with, based on item attributes. This approach constructs user profiles based on item features and recommends items with similar features.
How Content-Based Filtering Works
The core idea behind content-based filtering is to match users with items that have similar attributes to those they've liked in the past. The process typically involves three main steps:
First, we create feature vectors for items using their attributes. For a movie recommender, features might include genres, directors, actors, release year, and plot keywords. Text attributes are often processed using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to create numerical vectors.
Second, we build user profiles based on the items users have interacted with. A simple approach is to average the feature vectors of items a user has liked, potentially weighted by the user's ratings.
Finally, we calculate the similarity between the user profile and each candidate item's feature vector, using metrics like cosine similarity or Euclidean distance. Items with the highest similarity scores become recommendations.
Implementation Example
Let's implement a content-based recommender for movies using Python:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample movie data
movies_data = {
'movieId': [1, 2, 3, 4, 5, 6, 7, 8],
'title': [
'The Shawshank Redemption',
'The Godfather',
'The Dark Knight',
'Pulp Fiction',
'Fight Club',
'Inception',
'The Matrix',
'Interstellar'
],
'genres': [
'Drama|Crime',
'Crime|Drama',
'Action|Crime|Drama',
'Crime|Drama|Thriller',
'Drama|Thriller',
'Action|Adventure|Sci-Fi',
'Action|Sci-Fi',
'Adventure|Drama|Sci-Fi'
],
'description': [
'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.',
'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.',
'When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.',
'The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.',
'An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.',
'A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.',
'A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.',
'A team of explorers travel through a wormhole in space in an attempt to ensure humanity\'s survival.'
]
}
# Create DataFrame
movies_df = pd.DataFrame(movies_data)
# User ratings (user_id, movie_id, rating)
user_ratings = {
'userId': [1, 1, 1, 1],
'movieId': [1, 3, 5, 7],
'rating': [5, 4, 4, 5]
}
ratings_df = pd.DataFrame(user_ratings)
def content_based_recommendations(user_id, movies_df, ratings_df, num_recommendations=3):
"""
Generate content-based recommendations for a user
Parameters:
user_id (int): The user ID
movies_df (DataFrame): Movie metadata
ratings_df (DataFrame): User ratings
num_recommendations (int): Number of recommendations to return
Returns:
DataFrame: Recommended movies
"""
# Get user's rated movies
user_rated = ratings_df[ratings_df['userId'] == user_id]
user_rated = user_rated.merge(movies_df, on='movieId')
# Get movies the user hasn't rated
user_unrated = movies_df[~movies_df['movieId'].isin(user_rated['movieId'])]
if len(user_rated) == 0:
print(f"User {user_id} has not rated any movies.")
return None
# Create feature vectors using TF-IDF on combined text features
movies_df['text_features'] = movies_df['genres'].str.replace('|', ' ') + ' ' + movies_df['description']
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['text_features'])
# Compute similarity between all movies
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Create a mapping of movie indices
indices = pd.Series(movies_df.index, index=movies_df['movieId']).drop_duplicates()
# Calculate weighted average of similarity scores based on user ratings
sim_scores = np.zeros(len(movies_df))
for _, row in user_rated.iterrows():
movie_idx = indices[row['movieId']]
# Normalize rating to 0-1 scale
weight = (row['rating'] - 1) / 4.0 # Assuming ratings are 1-5
sim_scores += cosine_sim[movie_idx] * weight
# Get similarity scores for unrated movies
unrated_indices = [indices[movie_id] for movie_id in user_unrated['movieId']]
unrated_scores = [(idx, sim_scores[idx]) for idx in unrated_indices]
# Sort by similarity score
unrated_scores = sorted(unrated_scores, key=lambda x: x[1], reverse=True)
# Get top recommendations
movie_indices = [i[0] for i in unrated_scores[:num_recommendations]]
return movies_df.iloc[movie_indices]
# Get recommendations for user 1
recommendations = content_based_recommendations(1, movies_df, ratings_df)
print("User 1 has rated:")
print(ratings_df[ratings_df['userId'] == 1].merge(movies_df[['movieId', 'title']], on='movieId'))
print("\nRecommendations:")
print(recommendations[['title', 'genres']])
This implementation combines movie genres and descriptions to create feature vectors using TF-IDF, which captures the importance of words in documents. We then calculate cosine similarity between movies to find those with similar content. The recommendation function weighs these similarities by the user's ratings, creating personalized recommendations based on content features.
Advantages and Limitations
Content-based filtering has several advantages. It doesn't require data from other users, which makes it suitable for new items or niche content (no cold-start problem for items). The recommendations are transparent and can be explained to users ("recommended because you liked X"). The model can also capture specific user interests that might not be popular among other users.
However, there are limitations to this approach. Content-based filtering tends to over-specialize, recommending items too similar to what users already know. This reduces diversity and serendipity in recommendations. The quality of recommendations depends heavily on the available item features and how well they capture what users actually care about. The approach also suffers from the new user problem, as it needs sufficient user history to build an accurate profile.
Collaborative Filtering
Collaborative filtering makes recommendations based on past user behavior without needing content features. It leverages the collective intelligence of all users in the system, assuming that users who agreed in the past will agree in the future.
Types of Collaborative Filtering
There are two main types of collaborative filtering:
Memory-based approaches calculate similarities between users (user-based) or items (item-based) directly from the rating matrix. User-based CF finds similar users and recommends items they liked. Item-based CF identifies items similar to those the user has liked.
Model-based approaches learn latent factors that explain observed ratings. Matrix factorization is a popular technique that decomposes the user-item matrix into lower-dimensional matrices representing latent factors.
Memory-Based Collaborative Filtering
Let's implement both user-based and item-based collaborative filtering:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
# Sample user-item rating matrix (users in rows, items in columns)
ratings_data = {
'userId': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'itemId': [1, 2, 3, 1, 2, 4, 1, 3, 5, 2, 3, 4, 1, 2, 5],
'rating': [5, 4, 3, 4, 5, 4, 2, 5, 4, 3, 3, 4, 1, 4, 5]
}
ratings_df = pd.DataFrame(ratings_data)
# Create user-item rating matrix
matrix = ratings_df.pivot(index='userId', columns='itemId', values='rating').fillna(0)
print("User-Item Rating Matrix:")
print(matrix)
def user_based_cf(matrix, user_id, num_similar_users=2, num_recommendations=2):
"""
Generate user-based collaborative filtering recommendations
Parameters:
matrix (DataFrame): User-item rating matrix
user_id (int): Target user ID
num_similar_users (int): Number of similar users to consider
num_recommendations (int): Number of recommendations to return
Returns:
Series: Recommended items with predicted ratings
"""
# Calculate user similarity
user_similarity = cosine_similarity(matrix)
user_similarity_df = pd.DataFrame(user_similarity,
index=matrix.index,
columns=matrix.index)
# Get most similar users
similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:num_similar_users+1]
# Items that the target user has not rated
user_unrated_items = matrix.loc[user_id][matrix.loc[user_id] == 0].index
# Get recommendations from similar users
recommendations = {}
for item in user_unrated_items:
item_ratings = []
for similar_user_id, similarity in similar_users.items():
rating = matrix.loc[similar_user_id, item]
if rating > 0: # If the similar user has rated this item
item_ratings.append(rating * similarity) # Weight by similarity
if item_ratings: # If any similar users rated this item
recommendations[item] = sum(item_ratings) / sum(similar_users)
# Return top recommendations
return pd.Series(recommendations).sort_values(ascending=False)[:num_recommendations]
def item_based_cf(matrix, user_id, num_recommendations=2):
"""
Generate item-based collaborative filtering recommendations
Parameters:
matrix (DataFrame): User-item rating matrix
user_id (int): Target user ID
num_recommendations (int): Number of recommendations to return
Returns:
Series: Recommended items with predicted ratings
"""
# Calculate item similarity
item_similarity = cosine_similarity(matrix.T)
item_similarity_df = pd.DataFrame(item_similarity,
index=matrix.columns,
columns=matrix.columns)
# Get items that the user has rated
user_rated_items = matrix.loc[user_id][matrix.loc[user_id] > 0]
# Items that the user has not rated
user_unrated_items = matrix.loc[user_id][matrix.loc[user_id] == 0].index
# Get recommendations
recommendations = {}
for unrated_item in user_unrated_items:
weighted_sum = 0
similarity_sum = 0
for rated_item, rating in user_rated_items.items():
similarity = item_similarity_df.loc[rated_item, unrated_item]
weighted_sum += rating * similarity
similarity_sum += abs(similarity)
if similarity_sum > 0: # Avoid division by zero
recommendations[unrated_item] = weighted_sum / similarity_sum
# Return top recommendations
return pd.Series(recommendations).sort_values(ascending=False)[:num_recommendations]
# Get recommendations using different methods
target_user = 1
print(f"\nUser-Based CF Recommendations for User {target_user}:")
print(user_based_cf(matrix, target_user))
print(f"\nItem-Based CF Recommendations for User {target_user}:")
print(item_based_cf(matrix, target_user))
User-based collaborative filtering identifies users with similar preferences to the target user and recommends items they've rated highly. The cosine similarity between user rating vectors determines user similarity. This approach works well with sparse data but doesn't scale well with large user bases.
Item-based collaborative filtering calculates similarities between items based on their rating patterns across users. This method tends to be more stable than user-based approaches because item similarities change less frequently than user preferences.
Matrix Factorization
Matrix factorization is a model-based approach that decomposes the user-item matrix into lower-dimensional matrices to discover latent factors:
import numpy as np
import pandas as pd
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt
def matrix_factorization_cf(matrix, user_id, num_factors=2, num_recommendations=2):
"""
Generate recommendations using matrix factorization
Parameters:
matrix (DataFrame): User-item rating matrix
user_id (int): Target user ID
num_factors (int): Number of latent factors
num_recommendations (int): Number of recommendations to return
Returns:
Series: Recommended items with predicted ratings
"""
# Get user index (0-based)
user_idx = list(matrix.index).index(user_id)
# Convert to numpy array
ratings_array = matrix.to_numpy()
# Get averages for centering
user_ratings_mean = np.mean(ratings_array, axis=1)
ratings_centered = ratings_array - user_ratings_mean.reshape(-1, 1)
# Perform SVD
U, sigma, Vt = svds(ratings_centered, k=num_factors)
# Convert to diagonal matrix
sigma = np.diag(sigma)
# Predict ratings for all items
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
# Convert to DataFrame
pred_df = pd.DataFrame(all_user_predicted_ratings,
index=matrix.index,
columns=matrix.columns)
# Get the user's predictions
user_predictions = pred_df.loc[user_id]
# Get items that the user has not rated
user_unrated_items = matrix.loc[user_id][matrix.loc[user_id] == 0].index
# Filter for unrated items and sort by predicted rating
recommendations = user_predictions[user_unrated_items].sort_values(ascending=False)
return recommendations[:num_recommendations]
# Use the same matrix from the previous example
print(f"\nMatrix Factorization Recommendations for User {target_user}:")
print(matrix_factorization_cf(matrix, target_user))
# Visualize latent factors
plt.figure(figsize=(12, 6))
# Get latent factors
ratings_array = matrix.to_numpy()
user_ratings_mean = np.mean(ratings_array, axis=1)
ratings_centered = ratings_array - user_ratings_mean.reshape(-1, 1)
U, sigma, Vt = svds(ratings_centered, k=2)
# Plot users in latent space
plt.subplot(1, 2, 1)
plt.scatter(U[:, 0], U[:, 1])
for i, user in enumerate(matrix.index):
plt.annotate(f"User {user}", (U[i, 0], U[i, 1]))
plt.title('Users in Latent Space')
plt.xlabel('Latent Factor 1')
plt.ylabel('Latent Factor 2')
# Plot items in latent space
plt.subplot(1, 2, 2)
plt.scatter(Vt[0, :], Vt[1, :])
for i, item in enumerate(matrix.columns):
plt.annotate(f"Item {item}", (Vt[0, i], Vt[1, i]))
plt.title('Items in Latent Space')
plt.xlabel('Latent Factor 1')
plt.ylabel('Latent Factor 2')
plt.tight_layout()
plt.show()
Matrix factorization uses Singular Value Decomposition (SVD) to decompose the user-item matrix into lower-dimensional matrices. The user matrix (U) represents users in a latent factor space, the item matrix (Vt) represents items in the same space, and the diagonal matrix (sigma) represents the strength of each factor. Multiplying these matrices reconstructs the original rating matrix, filling in missing values as predictions.
The visualization shows users and items plotted in the two-dimensional latent space, providing insight into how the matrix factorization model groups similar entities together. Users or items that are close in this space are predicted to have similar preferences or characteristics.
Mathematical Foundation
The core idea of matrix factorization is to approximate the user-item rating matrix R as the product of two lower-dimensional matrices:
R ≈ P × Q^T
Where:
- R is the m×n user-item rating matrix (m users, n items)
- P is the m×k user-factor matrix
- Q is the n×k item-factor matrix
- k is the number of latent factors (much smaller than m or n)
Each row in P represents a user's association with k latent factors, and each row in Q represents an item's association with the same factors. The predicted rating for user u on item i is calculated as:
r̂_ui = p_u · q_i = ∑(k=1 to K) p_uk × q_ik
where p_u is the user vector and q_i is the item vector.
The matrices P and Q are learned by minimizing the regularized squared error on the observed ratings:
min_{P,Q} ∑_{(u,i)∈K} (r_ui - p_u · q_i)^2 + λ(||p_u||^2 + ||q_i||^2)
where K is the set of observed user-item interactions, and λ is the regularization parameter that prevents overfitting.
Advantages and Limitations of Collaborative Filtering
Collaborative filtering has several advantages. It doesn't require item features, making it applicable to any domain. It can discover complex patterns and relationships that might not be captured by content features. It often provides more diverse recommendations than content-based filtering.
However, there are limitations. The cold-start problem affects new users and items with no interaction history. Sparsity issues arise when most users interact with only a tiny fraction of items. Popularity bias can lead to popular items being recommended too frequently. Additionally, collaborative filtering often struggles to provide good explanations for recommendations.
Deep Learning for Recommender Systems
Recent advances in deep learning have introduced neural network-based recommender systems that can capture complex patterns and nonlinear relationships. Neural Collaborative Filtering (NCF) is one such approach that combines the flexibility of deep learning with collaborative filtering principles.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, Concatenate
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
# Sample user-item interactions
interactions_data = {
'userId': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'itemId': [1, 2, 3, 1, 2, 4, 1, 3, 5, 2, 3, 4, 1, 2, 5],
'interaction': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] # 1 indicates positive interaction
}
interactions_df = pd.DataFrame(interactions_data)
# Add negative interactions (items the user didn't interact with)
negative_samples = []
all_items = interactions_df['itemId'].unique()
all_users = interactions_df['userId'].unique()
for user in all_users:
user_items = interactions_df[interactions_df['userId'] == user]['itemId'].values
negative_items = [item for item in all_items if item not in user_items]
# Sample the same number of negative items as positive ones
if negative_items:
sampled_neg = np.random.choice(negative_items,
size=min(len(user_items), len(negative_items)),
replace=False)
for item in sampled_neg:
negative_samples.append({'userId': user, 'itemId': item, 'interaction': 0})
# Add negative samples to the dataframe
interactions_df = pd.concat([interactions_df, pd.DataFrame(negative_samples)], ignore_index=True)
# Create user and item mapping
user_ids = interactions_df['userId'].unique()
item_ids = interactions_df['itemId'].unique()
user_map = {user: i for i, user in enumerate(user_ids)}
item_map = {item: i for i, item in enumerate(item_ids)}
# Map IDs to sequential integers
interactions_df['user_idx'] = interactions_df['userId'].map(user_map)
interactions_df['item_idx'] = interactions_df['itemId'].map(item_map)
# Prepare data for the model
X = interactions_df[['user_idx', 'item_idx']].values
y = interactions_df['interaction'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Neural Collaborative Filtering model
def create_ncf_model(num_users, num_items, embedding_size=10):
"""
Create a Neural Collaborative Filtering model
Parameters:
num_users (int): Number of users
num_items (int): Number of items
embedding_size (int): Size of embedding vectors
Returns:
Model: Keras model
"""
# Input layers
user_input = Input(shape=(1,), name='user_input')
item_input = Input(shape=(1,), name='item_input')
# Embedding layers
user_embedding = Embedding(num_users, embedding_size, name='user_embedding')(user_input)
item_embedding = Embedding(num_items, embedding_size, name='item_embedding')(item_input)
# Flatten embeddings
user_vector = Flatten()(user_embedding)
item_vector = Flatten()(item_embedding)
# Concatenate embeddings
concat = Concatenate()([user_vector, item_vector])
# Dense layers
dense1 = Dense(32, activation='relu')(concat)
dense2 = Dense(16, activation='relu')(dense1)
output = Dense(1, activation='sigmoid')(dense2)
# Create model
model = Model(inputs=[user_input, item_input], outputs=output)
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Create and train model
num_users = len(user_ids)
num_items = len(item_ids)
model = create_ncf_model(num_users, num_items)
# Train the model
history = model.fit(
[X_train[:, 0], X_train[:, 1]], y_train,
batch_size=64,
epochs=20,
validation_data=([X_test[:, 0], X_test[:, 1]], y_test),
verbose=2
)
# Plot training history
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
def get_recommendations(model, user_id, n_recommendations=3):
"""
Generate recommendations for a user using the trained NCF model
Parameters:
model (Model): Trained NCF model
user_id (int): Target user ID
n_recommendations (int): Number of recommendations to return
Returns:
list: Recommended item IDs
"""
# Get items the user hasn't interacted with
user_items = interactions_df[interactions_df['userId'] == user_id]['itemId'].values
candidate_items = [item for item in item_ids if item not in user_items]
if not candidate_items:
return []
# Map user ID to index
user_idx = user_map[user_id]
# Create prediction inputs
user_input = np.array([user_idx] * len(candidate_items))
item_input = np.array([item_map[item] for item in candidate_items])
# Make predictions
predictions = model.predict([user_input, item_input], verbose=0).flatten()
# Get top N recommendations
top_indices = np.argsort(predictions)[-n_recommendations:][::-1]
recommendations = [candidate_items[idx] for idx in top_indices]
return recommendations
# Get recommendations for a user
target_user = 1
recommendations = get_recommendations(model, target_user)
print(f"\nNeural CF Recommendations for User {target_user}:")
print(recommendations)
```
The Neural Collaborative Filtering model embeds users and items in a latent space using embedding layers, then uses neural networks to model their interactions. Unlike matrix factorization, NCF can capture non-linear relationships between users and items.
The model architecture consists of embedding layers for users and items, followed by dense layers that learn to predict the probability of interaction. We train the model using binary cross-entropy loss, as the task is framed as a binary classification problem (will the user interact with the item or not?).
Other Deep Learning Approaches
Beyond Neural Collaborative Filtering, several other deep learning architectures have been applied to recommendation:
Autoencoders can reconstruct user-item interaction patterns, with the latent space serving as a compressed representation of user preferences. These models are particularly useful for top-N recommendations.
Recurrent Neural Networks (RNNs) and their variants like LSTM and GRU capture sequential patterns in user behavior, making them suitable for session-based recommendations where the order of interactions matters.
Graph Neural Networks (GNNs) model users, items, and their interactions as nodes and edges in a graph, capturing complex relationships and dependencies beyond direct user-item interactions.
Attention mechanisms, popularized by transformer architectures, have been incorporated into recommender systems to weigh the importance of different factors in making recommendations, providing both improved performance and better explainability.
Hybrid Recommender Systems
Hybrid recommender systems combine multiple approaches to overcome the limitations of individual methods. They can integrate content-based, collaborative filtering, and contextual information to provide more accurate and diverse recommendations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
# Sample data
# Movies with features (content data)
movies_data = {
'movieId': [1, 2, 3, 4, 5],
'title': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E'],
'genres': ['Action|Adventure', 'Comedy|Romance', 'Drama|Thriller', 'Sci-Fi|Action', 'Comedy|Drama'],
'year': [2010, 2015, 2012, 2019, 2017]
}
movies_df = pd.DataFrame(movies_data)
# User-item ratings (collaborative data)
ratings_data = {
'userId': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5],
'movieId': [1, 3, 5, 2, 3, 4, 1, 2, 3, 5, 1, 4],
'rating': [5, 3, 4, 4, 3, 5, 2, 5, 4, 3, 1, 4]
}
ratings_df = pd.DataFrame(ratings_data)
# Create feature vectors for content-based filtering
genres_set = set()
for genres in movies_df['genres']:
genres_set.update(genres.split('|'))
# Create one-hot encoding for genres
for genre in genres_set:
movies_df[f'genre_{genre}'] = movies_df['genres'].apply(lambda x: 1 if genre in x else 0)
# Normalize year
movies_df['year_norm'] = (movies_df['year'] - movies_df['year'].min()) / (movies_df['year'].max() - movies_df['year'].min())
# Select features for content-based similarity
content_features = movies_df[[col for col in movies_df.columns if col.startswith('genre_') or col == 'year_norm']]
class HybridRecommender:
def __init__(self, ratings_df, movies_df, content_features,
cf_weight=0.7, content_weight=0.3, num_factors=2):
"""
Initialize hybrid recommender system
Parameters:
ratings_df (DataFrame): User-item ratings
movies_df (DataFrame): Movie metadata
content_features (DataFrame): Feature vectors for content-based filtering
cf_weight (float): Weight for collaborative filtering
content_weight (float): Weight for content-based filtering
num_factors (int): Number of latent factors for matrix factorization
"""
self.ratings_df = ratings_df
self.movies_df = movies_df
self.content_features = content_features
self.cf_weight = cf_weight
self.content_weight = content_weight
self.num_factors = num_factors
# Create user-item matrix for collaborative filtering
self.matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating').fillna(0)
# Calculate content similarity matrix
self.content_sim = cosine_similarity(content_features)
self.content_sim_df = pd.DataFrame(
self.content_sim,
index=movies_df['movieId'],
columns=movies_df['movieId']
)
# Train matrix factorization model
self._train_mf_model()
def _train_mf_model(self):
"""Train matrix factorization model"""
ratings_array = self.matrix.to_numpy()
self.user_ratings_mean = np.mean(ratings_array, axis=1)
ratings_centered = ratings_array - self.user_ratings_mean.reshape(-1, 1)
# SVD
self.U, self.sigma, self.Vt = svds(ratings_centered, k=self.num_factors)
self.sigma = np.diag(self.sigma)
# Pre-calculate complete reconstruction for all users
self.predictions = np.dot(np.dot(self.U, self.sigma), self.Vt) + self.user_ratings_mean.reshape(-1, 1)
self.pred_df = pd.DataFrame(
self.predictions,
index=self.matrix.index,
columns=self.matrix.columns
)
def get_content_recommendations(self, user_id, num_recommendations=3):
"""Get content-based recommendations"""
# Get user's rated movies
user_rated = self.ratings_df[self.ratings_df['userId'] == user_id]
if len(user_rated) == 0:
return pd.Series()
# Get weighted average of content similarity based on user ratings
scores = {}
for movieId in self.movies_df['movieId']:
# Skip if the user has already rated this movie
if movieId in user_rated['movieId'].values:
continue
weighted_sum = 0
weight_sum = 0
for _, row in user_rated.iterrows():
rated_movie = row['movieId']
rating = row['rating']
# Normalize rating to 0-1 scale
rating_norm = (rating - 1) / 4.0 # Assuming ratings are 1-5
# Get similarity between this movie and the rated movie
sim = self.content_sim_df.loc[rated_movie, movieId]
weighted_sum += sim * rating_norm
weight_sum += abs(sim)
if weight_sum > 0:
scores[movieId] = weighted_sum / weight_sum
return pd.Series(scores).sort_values(ascending=False)[:num_recommendations]
def get_cf_recommendations(self, user_id, num_recommendations=3):
"""Get collaborative filtering recommendations using matrix factorization"""
# Check if user exists in training data
if user_id not in self.matrix.index:
return pd.Series()
# Get the user's predictions
user_predictions = self.pred_df.loc[user_id]
# Get items that the user has not rated
user_unrated_items = self.matrix.loc[user_id][self.matrix.loc[user_id] == 0].index
# Filter for unrated items
recommendations = user_predictions[user_unrated_items]
return recommendations.sort_values(ascending=False)[:num_recommendations]
def get_hybrid_recommendations(self, user_id, num_recommendations=3):
"""Get hybrid recommendations combining CF and content-based"""
# Get recommendations from both methods
cf_recs = self.get_cf_recommendations(user_id, num_recommendations=None)
content_recs = self.get_content_recommendations(user_id, num_recommendations=None)
# Normalize scores to 0-1 scale
if not cf_recs.empty:
cf_min, cf_max = cf_recs.min(), cf_recs.max()
if cf_max > cf_min:
cf_recs = (cf_recs - cf_min) / (cf_max - cf_min)
if not content_recs.empty:
content_min, content_max = content_recs.min(), content_recs.max()
if content_max > content_min:
content_recs = (content_recs - content_min) / (content_max - content_min)
# Combine recommendations
all_items = set(cf_recs.index) | set(content_recs.index)
hybrid_scores = {}
for item in all_items:
cf_score = cf_recs.get(item, 0)
content_score = content_recs.get(item, 0)
# Weighted combination
hybrid_scores[item] = self.cf_weight * cf_score + self.content_weight * content_score
# Convert to Series and sort
hybrid_recs = pd.Series(hybrid_scores).sort_values(ascending=False)[:num_recommendations]
return hybrid_recs
# Create and use the hybrid recommender
hybrid_rec = HybridRecommender(ratings_df, movies_df, content_features)
# Get recommendations for a user
target_user = 1
cf_recs = hybrid_rec.get_cf_recommendations(target_user)
content_recs = hybrid_rec.get_content_recommendations(target_user)
hybrid_recs = hybrid_rec.get_hybrid_recommendations(target_user)
print(f"\nUser {target_user} has rated:")
user_ratings = ratings_df[ratings_df['userId'] == target_user].merge(
movies_df[['movieId', 'title']], on='movieId')
print(user_ratings[['title', 'rating']])
print(f"\nCollaborative Filtering Recommendations:")
cf_rec_titles = hybrid_rec.movies_df[hybrid_rec.movies_df['movieId'].isin(cf_recs.index)]
for movie_id, score in cf_recs.items():
title = cf_rec_titles[cf_rec_titles['movieId'] == movie_id]['title'].values[0]
print(f"{title}: {score:.3f}")
print(f"\nContent-Based Recommendations:")
content_rec_titles = hybrid_rec.movies_df[hybrid_rec.movies_df['movieId'].isin(content_recs.index)]
for movie_id, score in content_recs.items():
title = content_rec_titles[content_rec_titles['movieId'] == movie_id]['title'].values[0]
print(f"{title}: {score:.3f}")
print(f"\nHybrid Recommendations:")
hybrid_rec_titles = hybrid_rec.movies_df[hybrid_rec.movies_df['movieId'].isin(hybrid_recs.index)]
for movie_id, score in hybrid_recs.items():
title = hybrid_rec_titles[hybrid_rec_titles['movieId'] == movie_id]['title'].values[0]
print(f"{title}: {score:.3f}")
```
The hybrid recommender combines collaborative filtering (using matrix factorization) and content-based filtering (using genre and year features). The weights for each component can be adjusted based on the specific domain and data characteristics.
The implementation normalizes scores from each method to a common scale before combining them. This visualization helps us see the contribution of each component to the final recommendation score, providing insight into which method influenced each recommendation.
Hybrid systems often outperform individual approaches by leveraging their complementary strengths: collaborative filtering provides personalized recommendations based on similar users, while content-based filtering helps address the cold-start problem for new items and enables more diverse recommendations.
Types of Hybrid Approaches
There are several ways to combine recommendation techniques:
Weighted hybridization assigns weights to different recommenders and combines their scores, as we've implemented above. The weights can be fixed or adaptive based on recommendation quality.
Switching hybridization selects among different recommenders based on certain criteria, such as the amount of user history available or the confidence in the recommendations.
Cascade hybridization applies recommenders sequentially, with each recommender refining the candidates from the previous one. This focuses on precision by progressively filtering recommendations.
Feature augmentation uses the output of one recommender as input features for another, enriching the available information for the final recommendation.
Meta-level hybridization trains a model on the output of another recommender, essentially using the entire model as a feature transformation.
Evaluation of Recommender Systems
Evaluating recommender systems is challenging because the true user preferences for unrated items are unknown. Common evaluation approaches include offline metrics, A/B testing, and user studies.
Offline Evaluation Metrics
Offline evaluation uses historical data to measure how well a recommender system would have performed:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, precision_score, recall_score
# Sample rating data
ratings_data = {
'userId': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'itemId': [1, 2, 3, 4, 1, 2, 5, 1, 3, 5, 2, 3, 4, 1, 2, 5],
'rating': [5, 4, 3, 4, 4, 5, 5, 2, 4, 3, 3, 3, 4, 1, 4, 5]
}
ratings_df = pd.DataFrame(ratings_data)
# Split data into train and test sets
train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)
# Create user-item matrices
train_matrix = train_df.pivot(index='userId', columns='itemId', values='rating').fillna(0)
test_matrix = test_df.pivot(index='userId', columns='itemId', values='rating').fillna(0)
def evaluate_rating_prediction(pred_matrix, true_matrix, threshold=3.5):
"""
Evaluate rating prediction performance
Parameters:
pred_matrix (DataFrame): Predicted ratings
true_matrix (DataFrame): True ratings from test set
threshold (float): Rating threshold for binary classification
Returns:
dict: Evaluation metrics
"""
# Align matrices (in case some users or items are missing in predictions)
common_users = set(pred_matrix.index) & set(true_matrix.index)
common_items = set(pred_matrix.columns) & set(true_matrix.columns)
if not common_users or not common_items:
return {"error": "No common users or items between prediction and test set"}
# Filter for common users and items
pred = pred_matrix.loc[common_users, common_items]
true = true_matrix.loc[common_users, common_items]
# Calculate RMSE only for items that are in the test set
rated_mask = true > 0
if rated_mask.sum().sum() == 0:
rmse = np.nan
else:
# Extract ratings
y_pred = pred.values[rated_mask.values]
y_true = true.values[rated_mask.values]
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
# For precision and recall, convert to binary relevance
y_pred_bin = (pred.values >= threshold).astype(int)
y_true_bin = (true.values >= threshold).astype(int)
# Only consider where true ratings exist (> 0)
mask = true.values > 0
y_pred_bin = y_pred_bin[mask]
y_true_bin = y_true_bin[mask]
# Calculate precision and recall
precision = precision_score(y_true_bin, y_pred_bin, zero_division=0)
recall = recall_score(y_true_bin, y_pred_bin, zero_division=0)
# Calculate F1 score
if precision + recall > 0:
f1 = 2 * (precision * recall) / (precision + recall)
else:
f1 = 0
return {
"RMSE": rmse,
"Precision": precision,
"Recall": recall,
"F1": f1
}
def evaluate_ranking(pred_scores, true_ratings, k=3, threshold=3.5):
"""
Evaluate ranking performance
Parameters:
pred_scores (dict): {user_id: {item_id: score}} predicted scores
true_ratings (DataFrame): User-item ratings
k (int): Number of items to consider for metrics
threshold (float): Rating threshold for relevance
Returns:
dict: Evaluation metrics
"""
metrics = {
'precision@k': [],
'recall@k': [],
'ndcg@k': [],
'map@k': []
}
for user_id, user_scores in pred_scores.items():
# Get true ratings for this user
user_true = true_ratings[true_ratings['userId'] == user_id]
if len(user_true) == 0:
continue
# Get relevant items (items with rating above threshold)
relevant_items = set(user_true[user_true['rating'] >= threshold]['itemId'])
if len(relevant_items) == 0:
continue
# Get top-k recommended items
recommended_items = [item for item, _ in sorted(user_scores.items(),
key=lambda x: x[1],
reverse=True)[:k]]
# Precision@k: proportion of recommended items that are relevant
precision_k = len(set(recommended_items) & relevant_items) / len(recommended_items)
metrics['precision@k'].append(precision_k)
# Recall@k: proportion of relevant items that are recommended
recall_k = len(set(recommended_items) & relevant_items) / len(relevant_items)
metrics['recall@k'].append(recall_k)
# NDCG@k: normalized discounted cumulative gain
# DCG = sum(rel_i / log2(i+1)) for i in 1 to k
# IDCG = DCG of ideal ranking
# NDCG = DCG / IDCG
dcg = 0
for i, item in enumerate(recommended_items):
rel = 1 if item in relevant_items else 0
dcg += rel / np.log2(i + 2) # i+2 because i is 0-indexed
# Ideal DCG
idcg = sum(1 / np.log2(i + 2) for i in range(min(k, len(relevant_items))))
if idcg > 0:
ndcg = dcg / idcg
else:
ndcg = 0
metrics['ndcg@k'].append(ndcg)
# MAP@k: mean average precision
# AP = sum(P@i * rel_i) / min(k, |relevant_items|)
average_precision = 0
relevant_count = 0
for i, item in enumerate(recommended_items):
if item in relevant_items:
relevant_count += 1
precision_at_i = relevant_count / (i + 1)
average_precision += precision_at_i
if relevant_count > 0:
average_precision /= min(k, len(relevant_items))
else:
average_precision = 0
metrics['map@k'].append(average_precision)
# Calculate mean of metrics across users
return {metric: np.mean(values) if values else 0 for metric, values in metrics.items()}
# Example usage with matrix factorization
# Train model on training data
ratings_array = train_matrix.to_numpy()
user_ratings_mean = np.mean(ratings_array, axis=1)
ratings_centered = ratings_array - user_ratings_mean.reshape(-1, 1)
U, sigma, Vt = svds(ratings_centered, k=2)
sigma = np.diag(sigma)
predictions = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
pred_matrix = pd.DataFrame(predictions, index=train_matrix.index, columns=train_matrix.columns)
# Evaluate rating prediction
rating_metrics = evaluate_rating_prediction(pred_matrix, test_matrix)
print("Rating Prediction Metrics:")
for metric, value in rating_metrics.items():
print(f"{metric}: {value:.4f}")
# Prepare data for ranking evaluation
pred_scores = {}
for user_id in train_matrix.index:
user_pred = pred_matrix.loc[user_id]
pred_scores[user_id] = {item_id: score for item_id, score in user_pred.items() if item_id in test_matrix.columns}
# Evaluate ranking
ranking_metrics = evaluate_ranking(pred_scores, test_df, k=2)
print("\nRanking Metrics:")
for metric, value in ranking_metrics.items():
print(f"{metric}: {value:.4f}")
The evaluation code implements two sets of metrics:
Rating prediction metrics evaluate how accurately a system predicts user ratings. Root Mean Square Error (RMSE) measures the average magnitude of prediction errors. Precision, recall, and F1-score treat rating prediction as a binary classification (like/dislike) using a threshold.
Ranking metrics evaluate the quality of recommended item lists. Precision@k measures what fraction of recommended items are relevant. Recall@k measures what fraction of relevant items are recommended. Normalized Discounted Cumulative Gain (NDCG@k) accounts for the position of relevant items in the recommendation list. Mean Average Precision (MAP@k) averages precision values calculated after each relevant item is retrieved.
Beyond Accuracy: Diversity and Serendipity
While accuracy metrics are important, they don't tell the whole story. Other dimensions of recommendation quality include:
Diversity measures how different the recommended items are from each other. Recommending very similar items might be accurate but not useful to users exploring options.
Novelty measures how familiar users are with the recommended items. Recommending only well-known items might be safe but adds little value.
Serendipity measures how surprising yet relevant the recommendations are. Unexpected discoveries often lead to high user satisfaction.
Coverage measures what fraction of items in the catalog can be recommended by the system. Low coverage might indicate biases in the algorithm.
Evaluating these dimensions often requires user studies or A/B testing in production environments, as they are difficult to measure using historical data alone.
Practical Considerations for Deployment
When moving beyond prototypes to production systems, several practical considerations come into play:
Scalability and Performance
Production recommender systems often deal with millions of users and items, requiring efficient algorithms and infrastructure:
Feature reduction techniques like Principal Component Analysis (PCA) can reduce dimensionality while preserving important information.
Approximate nearest neighbor algorithms like locality-sensitive hashing (LSH) can accelerate similarity search in large-scale systems.
Distributed computing frameworks like Apache Spark enable processing large datasets across multiple machines.
Pre-computation and caching of recommendations for active users improve response times for interactive applications.
Cold Start Problem
New users and items present challenges for recommender systems, as they lack historical data:
Explicit preference elicitation during onboarding can jumpstart user profiles by asking direct questions about preferences.
Content-based methods can recommend new items based on their features, even without interaction data.
Popularity-based recommendations provide a reasonable fallback for new users until more personalized data is available.
Exploration strategies like multi-armed bandits balance exploiting known preferences with exploring potential new interests.
Real-Time Updates
User preferences and item catalogs evolve over time, requiring recommender systems to adapt:
Incremental model updates allow incorporating new data without full retraining, which is especially important for matrix factorization models.
Temporal dynamics modeling captures how user preferences change over time, accounting for both long-term preference evolution and short-term context.
Trend detection identifies emerging popular items that might be relevant to users, even if not perfectly aligned with their historical preferences.
Event-triggered updates can respond to significant changes in user behavior, such as a sudden interest in a new category or a major life event.
Ethical Considerations
Recommender systems influence what information users see, raising important ethical considerations:
Filter bubbles can form when recommendations reinforce existing preferences, potentially limiting exposure to diverse perspectives. To address this, recommender systems can intentionally introduce diversity in suggestions.
Bias amplification occurs when biased training data leads to even more biased recommendations, potentially disadvantaging certain user groups or item categories. Regular audits for fairness and bias in recommendations help identify and mitigate these issues.
Transparency and explainability help users understand why items are recommended, building trust and enabling them to provide more useful feedback. Explanations like "Recommended because you liked X" make the system's reasoning clear.
Privacy concerns arise from the sensitive nature of user preference data, requiring careful handling and anonymization where appropriate. Federated learning approaches can keep sensitive user data on devices while still enabling model training.
Implementation Strategies for Different Domains
Different application domains have unique characteristics that influence the design of recommender systems:
E-commerce Recommendations
E-commerce systems often focus on complementary and substitute product recommendations:
Complementary products ("frequently bought together") can be identified using association rule mining techniques like Apriori or FP-Growth algorithms.
Substitute products ("customers who viewed this also viewed") typically use item-based collaborative filtering on browsing behavior.
Purchase funnels can be optimized with different recommendation strategies at each stage, from broad exploration during initial browsing to highly personalized suggestions at checkout.
Seasonality and temporal effects are significant in retail, requiring models that account for time-dependent preferences.
# Example: Association rule mining for "frequently bought together" recommendations
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
# Sample transaction data
transactions = [
['item1', 'item2', 'item3'],
['item1', 'item2', 'item4'],
['item1', 'item4'],
['item2', 'item3'],
['item2', 'item3', 'item4'],
['item2', 'item4'],
['item3', 'item4'],
]
# Convert to one-hot encoded format
def encode_transactions(transactions):
items = set(item for transaction in transactions for item in transaction)
one_hot = []
for transaction in transactions:
row = {item: (item in transaction) for item in items}
one_hot.append(row)
return pd.DataFrame(one_hot)
# Get one-hot encoded DataFrame
one_hot_df = encode_transactions(transactions)
# Find frequent itemsets and association rules
frequent_itemsets = apriori(one_hot_df, min_support=0.3, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
# Get "frequently bought together" recommendations for a specific item
def get_complementary_items(rules_df, item, top_n=2):
# Find rules where the item is in the antecedent
item_rules = rules_df[rules_df['antecedents'].apply(lambda x: item in set(x))]
# Sort by lift and return top N consequents
item_rules = item_rules.sort_values('lift', ascending=False).head(top_n)
# Extract consequent items (remove frozen set wrapper)
complementary_items = [list(rule)[0] for rule in item_rules['consequents']]
return complementary_items
# Get complementary items for 'item2'
complementary_items = get_complementary_items(rules, 'item2')
print(f"Frequently bought with item2: {complementary_items}")
Media and Content Recommendations
Media platforms like streaming services, news sites, and music apps have specific requirements:
Content-based methods work well for media since rich metadata is usually available (genres, actors, directors, authors, etc.).
Session-based recommendations are important as consumption is often influenced by current mood or context rather than just long-term preferences.
Recency effects matter, as newer content often has higher value to users than older content.
Cold-start for new content is addressed by leveraging metadata and content features until sufficient user feedback is
# Example: Simple session-based recommender using RNNs
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, GRU, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample session data (user_id, session, item_sequence)
sessions_data = {
'user_id': [1, 1, 2, 2, 3, 3],
'session_id': [1, 2, 1, 2, 1, 2],
'item_sequence': [
[101, 102, 103],
[103, 104, 105],
[101, 105, 106],
[103, 106, 107],
[102, 104, 108],
[104, 108, 109]
]
}
# Convert to arrays for training
item_sequences = np.array(sessions_data['item_sequence'])
num_sessions = len(item_sequences)
max_sequence_length = max(len(seq) for seq in item_sequences)
# Create input sequences and target items
X = []
y = []
for sequence in item_sequences:
for i in range(1, len(sequence)):
X.append(sequence[:i])
y.append(sequence[i])
# Pad sequences to same length
X_padded = pad_sequences(X, maxlen=max_sequence_length-1, padding='pre')
# Convert targets to one-hot encoding (for simplicity, assuming fixed item range)
all_items = set(item for sequence in item_sequences for item in sequence)
num_items = max(all_items) + 1 # +1 for zero padding
y_one_hot = tf.keras.utils.to_categorical(y, num_classes=num_items)
# Create and train session-based RNN model
def create_session_rnn_model(num_items, sequence_length, embedding_dim=32, rnn_units=64):
inputs = Input(shape=(sequence_length,))
embedding = Embedding(input_dim=num_items, output_dim=embedding_dim)(inputs)
gru_layer = GRU(rnn_units)(embedding)
outputs = Dense(num_items, activation='softmax')(gru_layer)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# Create and train model
model = create_session_rnn_model(num_items, max_sequence_length-1)
model.fit(X_padded, y_one_hot, epochs=20, batch_size=8, verbose=0)
# Function to get next item recommendations based on current session
def get_session_recommendations(model, current_session, num_items, max_length, top_n=3):
# Pad session to match model input
padded_session = pad_sequences([current_session], maxlen=max_length, padding='pre')
# Get predictions
predictions = model.predict(padded_session)[0]
# Get top N recommendations (excluding items already in session)
sorted_indices = np.argsort(predictions)[::-1]
recommendations = [idx for idx in sorted_indices if idx not in current_session][:top_n]
return recommendations
# Get recommendations for an active session
active_session = [101, 102]
recommendations = get_session_recommendations(model, active_session, num_items, max_sequence_length-1)
print(f"Recommended next items for session {active_session}: {recommendations}")
```
Social Recommendations
Social networks and community platforms leverage social connections and shared interests:
Social graph-based recommendations use friendship or follower relationships to identify potential interests.
Engagement patterns and interaction types (likes, comments, shares) provide valuable signals about content relevance.
Virality and trending content detection requires real-time analysis of spreading patterns across the network.
Community detection algorithms can identify clusters of users with shared interests for targeted recommendations.
# Example: Simple social graph-based recommendations
import networkx as nx
import numpy as np
# Create sample social graph (users as nodes, friendships as edges)
G = nx.Graph()
# Add users
for user_id in range(1, 11):
G.add_node(user_id)
# Add friendship connections
friendships = [
(1, 2), (1, 3), (1, 4),
(2, 3), (2, 5),
(3, 4), (3, 6),
(4, 7),
(5, 6), (5, 8),
(6, 9),
(7, 8), (7, 10),
(8, 9), (8, 10),
(9, 10)
]
G.add_edges_from(friendships)
# Sample user interests (user_id -> items)
user_interests = {
1: [101, 102, 103],
2: [101, 104, 105],
3: [102, 105, 106],
4: [103, 107],
5: [104, 108],
6: [105, 109],
7: [106, 110],
8: [107, 111],
9: [108, 112],
10: [109, 110]
}
# Function to get social recommendations
def get_social_recommendations(G, user_interests, user_id, top_n=3):
# Get friends (directly connected nodes)
friends = list(G.neighbors(user_id))
# Get items liked by friends
friend_items = {}
for friend in friends:
for item in user_interests[friend]:
if item not in user_interests[user_id]: # Exclude items user already likes
friend_items[item] = friend_items.get(item, 0) + 1
# Sort by popularity among friends
sorted_items = sorted(friend_items.items(), key=lambda x: x[1], reverse=True)
recommendations = [item for item, count in sorted_items[:top_n]]
return recommendations
# Get recommendations for user 1
social_recs = get_social_recommendations(G, user_interests, 1)
print(f"Social recommendations for user 1: {social_recs}")
```
## Advanced Topics and Future Directions
The field of recommender systems continues to evolve with several exciting research directions:
### Context-Aware Recommendations
Context-aware recommender systems consider situational factors beyond user-item interactions:
Temporal context captures time-dependent preferences, from time of day effects to seasonal trends.
Location-based recommendations adjust suggestions based on user geography, nearby options, or travel patterns.
Device and platform context recognizes that user needs differ between mobile, desktop, or smart TV interfaces.
Social context considers the influence of friends, family, or colleagues on preferences and decision-making.
```python
# Example: Simple time-aware recommendation weighting
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
# Sample rating data with timestamps
ratings_data = {
'userId': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'itemId': [101, 102, 103, 101, 102, 104, 101, 103, 105],
'rating': [5, 4, 3, 4, 5, 4, 3, 4, 5],
'timestamp': [
# Recent ratings
datetime.now() - timedelta(days=1),
datetime.now() - timedelta(days=2),
datetime.now() - timedelta(days=3),
# Older ratings
datetime.now() - timedelta(days=30),
datetime.now() - timedelta(days=40),
datetime.now() - timedelta(days=20),
# Very old ratings
datetime.now() - timedelta(days=100),
datetime.now() - timedelta(days=120),
datetime.now() - timedelta(days=90)
]
}
ratings_df = pd.DataFrame(ratings_data)
# Add decay weights based on recency
def add_time_weights(df, decay_factor=0.01):
"""Add weights that decay exponentially with time"""
now = datetime.now()
df['days_ago'] = df['timestamp'].apply(lambda x: (now - x).days)
df['time_weight'] = df['days_ago'].apply(lambda x: np.exp(-decay_factor * x))
return df
# Weight ratings by recency
weighted_df = add_time_weights(ratings_df)
print("Ratings with time weights:")
print(weighted_df[['userId', 'itemId', 'rating', 'days_ago', 'time_weight']])
# Function to get time-weighted average ratings
def get_time_weighted_ratings(df):
# Calculate weighted average rating per item
weighted_ratings = df.groupby('itemId').apply(
lambda x: np.average(x['rating'], weights=x['time_weight'])
).reset_index()
weighted_ratings.columns = ['itemId', 'weighted_rating']
return weighted_ratings.sort_values('weighted_rating', ascending=False)
# Get time-weighted item ratings
time_weighted_ratings = get_time_weighted_ratings(weighted_df)
print("\nItems ranked by time-weighted ratings:")
print(time_weighted_ratings)
```
Multi-objective Recommendations
Real-world recommender systems often need to balance multiple competing objectives:
User satisfaction metrics like relevance and diversity may conflict with business goals like conversion and revenue.
Short-term engagement can differ from long-term retention goals, requiring a balanced approach.
Fairness considerations might require trading off some accuracy to ensure equal treatment of different user or item groups.
Multi-task learning approaches can train models to simultaneously optimize for multiple objectives.
Explainable Recommendations
Explainable AI is particularly important for recommender systems to build user trust:
Feature-based explanations highlight content attributes that match user preferences ("Because you enjoy science fiction").
Social explanations leverage network effects ("Your friend Jane liked this").
Statistical explanations provide context about popularity or patterns ("95% of users who bought this also bought...").
Counterfactual explanations help users understand what would need to change to get different recommendations.
Reinforcement Learning for Recommendations
Reinforcement learning (RL) treats the recommendation process as a sequential decision problem:
RL approaches can optimize for long-term user engagement rather than immediate clicks.
Exploration-exploitation trade-offs are explicitly modeled, addressing the filter bubble problem.
Multi-armed bandit algorithms provide a lightweight approach for new item exploration.
State representations can incorporate user history, context, and recommendation diversity.
# Example: Simple multi-armed bandit for new item exploration
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
class BetaBandit:
"""Thompson Sampling implementation using Beta distribution"""
def __init__(self, num_arms):
"""Initialize bandit with num_arms items"""
self.num_arms = num_arms
# For each arm, track alpha (successes) and beta (failures)
self.alpha = np.ones(num_arms)
self.beta = np.ones(num_arms)
def select_arm(self):
"""Select arm using Thompson Sampling"""
# Sample from beta distribution for each arm
samples = np.random.beta(self.alpha, self.beta)
# Return arm with highest sample
return np.argmax(samples)
def update(self, arm, reward):
"""Update arm with observed reward (0 or 1)"""
if reward == 1:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
def get_expected_rewards(self):
"""Get expected reward for each arm"""
return self.alpha / (self.alpha + self.beta)
# Simulate bandit algorithm for new item exploration
def simulate_bandits(num_items, true_ctr, num_trials):
"""Simulate multi-armed bandit for recommendation"""
# Initialize bandit
bandit = BetaBandit(num_items)
# Track history
chosen_arms = []
rewards = []
regret = []
# Cumulative regret (difference between optimal and chosen reward)
optimal_arm = np.argmax(true_ctr)
cumulative_regret = 0
for t in range(num_trials):
# Select item to recommend
arm = bandit.select_arm()
chosen_arms.append(arm)
# Simulate user feedback (click or no click)
reward = np.random.binomial(1, true_ctr[arm])
rewards.append(reward)
# Update bandit with feedback
bandit.update(arm, reward)
# Calculate regret
cumulative_regret += true_ctr[optimal_arm] - true_ctr[arm]
regret.append(cumulative_regret)
return chosen_arms, rewards, regret, bandit.get_expected_rewards()
# Set up simulation
num_items = 5
# True click-through rates for items (unknown to the algorithm)
true_ctr = [0.1, 0.05, 0.2, 0.25, 0.15]
num_trials = 1000
# Run simulation
chosen_arms, rewards, regret, estimated_ctr = simulate_bandits(num_items, true_ctr, num_trials)
# Plot results
plt.figure(figsize=(15, 10))
# Plot arm selection over time
plt.subplot(2, 2, 1)
plt.plot(chosen_arms, '.')
plt.xlabel('Trial')
plt.ylabel('Selected Item')
plt.title('Item Selection Over Time')
# Plot cumulative regret
plt.subplot(2, 2, 2)
plt.plot(regret)
plt.xlabel('Trial')
plt.ylabel('Cumulative Regret')
plt.title('Cumulative Regret Over Time')
# Plot true vs estimated CTR
plt.subplot(2, 2, 3)
plt.bar(range(num_items), true_ctr, alpha=0.5, label='True CTR')
plt.bar(range(num_items), estimated_ctr, alpha=0.5, label='Estimated CTR')
plt.xlabel('Item')
plt.ylabel('Click-Through Rate')
plt.title('True vs Estimated CTR')
plt.legend()
# Plot Beta distributions for each arm
plt.subplot(2, 2, 4)
x = np.linspace(0, 1, 100)
for i in range(num_items):
y = beta.pdf(x, bandit.alpha[i], bandit.beta[i])
plt.plot(x, y, label=f'Item {i}')
plt.xlabel('Click-Through Rate')
plt.ylabel('Probability Density')
plt.title('Beta Distributions After Simulation')
plt.legend()
plt.tight_layout()
plt.show()
Conclusion
Recommender systems have become essential components of modern applications, helping users navigate the vast amounts of content available to them. Understanding the strengths and limitations of different recommendation approaches allows developers to implement systems that not only provide accurate suggestions but also diverse, novel, and serendipitous discoveries.
This guide has covered the fundamental approaches to recommendation, from content-based and collaborative filtering to deep learning and hybrid methods. We've explored implementation details, evaluation strategies, and practical considerations for deploying these systems in real-world applications.
As you implement your own recommender system, remember that the best approach often depends on your specific domain, data availability, and user needs. Starting with simpler methods and iteratively refining based on evaluation metrics and user feedback is usually more effective than immediately jumping to complex models.
With the code examples and principles covered in this guide, you now have the foundation to build effective recommendation engines that enhance user experience and drive engagement in your applications.
No comments:
Post a Comment