Quick Start Guide¶

This guide will get you started with fewlab in just a few minutes.

Installation¶

Install fewlab using pip:

pip install fewlab

Basic Usage¶

The main function you’ll use is fewlab.items_to_label(), which tells you which items to prioritize for labeling.

Simple Example¶

import pandas as pd
import numpy as np
from fewlab import items_to_label

# Create sample data
# users × items usage matrix
np.random.seed(42)
n_users, n_items = 100, 50

# User features (e.g., demographics)
user_features = pd.DataFrame({
    'age': np.random.normal(35, 10, n_users),
    'income': np.random.normal(50000, 15000, n_users),
    'urban': np.random.binomial(1, 0.6, n_users)
})

# Item usage counts (sparse matrix)
item_usage = pd.DataFrame(
    np.random.poisson(2, (n_users, n_items)),
    columns=[f'item_{i}' for i in range(n_items)]
)

# Get top 15 items to label
priority_items = items_to_label(
    counts=item_usage,
    X=user_features,
    K=15
)

print(f"Priority items to label: {priority_items}")

What You Get¶

The function returns a list of item names (column names from your usage matrix) ranked by importance. These are the items that will give you the most statistical power for understanding the relationship between user features and item characteristics.

Understanding the Results¶

The algorithm prioritizes items based on:

High usage: Items used by many users provide more information
Differential usage: Items with different usage patterns across user segments
Statistical leverage: Items that most influence your regression estimates

Choosing K (Number of Items)¶

Start with 10-20% of your total items:

total_items = item_usage.shape[1]
K = max(10, total_items // 5)  # At least 10, or 20% of items

priority_items = items_to_label(counts=item_usage, X=user_features, K=K)

You can always label more items later if your analysis needs higher precision.

Handling Real Data¶

Data Preparation¶

User alignment: Ensure your user features and usage data have the same index:

# Make sure indices align
assert user_features.index.equals(item_usage.index)

Zero totals: Users with zero usage are automatically dropped:

# Check for users with no usage
total_usage = item_usage.sum(axis=1)
print(f"Users with zero usage: {(total_usage == 0).sum()}")

Sparse data: The algorithm works well with sparse usage matrices.

Advanced Options¶

Control numerical stability:

# Add ridge regularization if needed
priority_items = items_to_label(
    counts=item_usage,
    X=user_features,
    K=15,
    ridge=1e-6  # Small regularization
)

Next Steps¶

See Examples for more detailed use cases
Check API Reference for complete function documentation
Review Mathematical Background for the statistical theory

Common Issues¶

ValueError: counts.index must align with X.index: Make sure your user features and item usage DataFrames have the same row indices.
Singular matrix errors: Try adding ridge regularization or check for perfectly correlated features.
Empty results: Ensure you have users with non-zero usage totals.