Importe de archivos

Introducción: Todo arranca con el carrito de librerías. Sabes cómo es: spacy para NLP, sklearn para los modelos, pandas para los datos, sqlalchemy para conectarnos a PostgreSQL… La idea era clara desde el inicio: explorar un dataset real de ventas de bicicletas, conectar Python con SQL, hacer feature engineering y probar varios modelos. Sin prisa pero sin pausa.

In [199]:
import spacy as scy
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score
import numpy as np
import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
from sklearn.cluster import KMeans
from statsmodels.tsa.arima.model import ARIMA
from scipy import stats

Seleccion de datos y preparacion de los datos

Conexión SQL y Python: Aquí es donde la magia ocurre. Nos conectamos a PostgreSQL con SQLAlchemy y empezamos a tirar SELECT * como si no hubiera mañana. Tiendas, stocks, órdenes, productos, clientes… cada tabla cuenta una parte de la historia de este negocio de bicicletas. La idea era tener todo el contexto para luego armar un dataset unificado y explotable.

In [155]:
engine = create_engine(
    "postgresql://postgres:3061@localhost:5432/proyecto_portafolio"
)

# pd.read_sql maneja el text() automáticamente
df_stores = pd.read_sql("SELECT * FROM stores", engine)
print(f"Tabla cargada: {df_stores.shape[0]} filas, {df_stores.shape[1]} columnas")
print(df_stores.head())
Tabla cargada: 3 filas, 8 columnas
   store_id        store_name           phone                 email  \
0         1  Santa Cruz Bikes  (831) 476-4321  santacruz@bikes.shop   
1         2     Baldwin Bikes  (516) 379-8888    baldwin@bikes.shop   
2         3     Rowlett Bikes  (972) 530-5555    rowlett@bikes.shop   

                street        city state  zip_code  
0   3700 Portola Drive  Santa Cruz    CA   95060.0  
1   4200 Chestnut Lane     Baldwin    NY   11432.0  
2  8000 Fairway Avenue     Rowlett    TX   75088.0  
In [156]:
df_stocks = pd.read_sql("SELECT * FROM stocks", engine)
print(f"Tabla cargada: {df_stocks.shape[0]} filas, {df_stocks.shape[1]} columnas")
print(df_stocks.head())
Tabla cargada: 939 filas, 3 columnas
   store_id  product_id  quantity
0         1           1      27.0
1         1           2       5.0
2         1           3       6.0
3         1           4      23.0
4         1           5      22.0
In [157]:
df_order_items = pd.read_sql("SELECT * FROM order_items", engine)
print(f"Tabla cargada: {df_order_items.shape[0]} filas, {df_order_items.shape[1]} columnas")
print(df_order_items.head())
Tabla cargada: 4722 filas, 6 columnas
   order_id  item_id  product_id  quantity  list_price  discount
0         1      1.0          20       1.0      599.99      0.20
1         1      2.0           8       2.0     1799.99      0.07
2         1      3.0          10       2.0     1549.00      0.05
3         1      4.0          16       2.0      599.99      0.05
4         1      5.0           4       1.0     2899.99      0.20
In [158]:
df_products = pd.read_sql("SELECT * FROM products", engine)
print(f"Tabla cargada: {df_products.shape[0]} filas, {df_products.shape[1]} columnas")
print(df_products.head())
Tabla cargada: 321 filas, 6 columnas
   product_id                        product_name  brand_id  category_id  \
0           1                     Trek 820 - 2016         9            6   
1           2  Ritchey Timberwolf Frameset - 2016         5            6   
2           3     Surly Wednesday Frameset - 2016         8            6   
3           4            Trek Fuel EX 8 29 - 2016         9            6   
4           5        Heller Shagamaw Frame - 2016         3            6   

   model_year  list_price  
0        2016      379.99  
1        2016      749.99  
2        2016      999.99  
3        2016     2899.99  
4        2016     1320.99  
In [159]:
df_orders = pd.read_sql("SELECT * FROM orders", engine)
print(f"Tabla cargada: {df_orders.shape[0]} filas, {df_orders.shape[1]} columnas")
print(df_orders.head())
Tabla cargada: 1615 filas, 8 columnas
   order_id  customer_id  order_status  order_date required_date shipped_date  \
0         1          259           4.0  2016-01-01    2016-01-03   2016-01-03   
1         2         1212           4.0  2016-01-01    2016-01-04   2016-01-03   
2         3          523           4.0  2016-01-02    2016-01-05   2016-01-03   
3         4          175           4.0  2016-01-03    2016-01-04   2016-01-05   
4         5         1324           4.0  2016-01-03    2016-01-06   2016-01-06   

   store_id  staff_id  
0         1         2  
1         2         6  
2         2         7  
3         1         3  
4         2         6  
In [160]:
df_brands = pd.read_sql("SELECT * FROM brands", engine)
print(f"Tabla cargada: {df_brands.shape[0]} filas, {df_brands.shape[1]} columnas")
print(df_brands.head())
Tabla cargada: 9 filas, 2 columnas
   brand_id   brand_name
0         1      Electra
1         2         Haro
2         3       Heller
3         4  Pure Cycles
4         5      Ritchey
In [161]:
df_categories = pd.read_sql("SELECT * FROM categories", engine)
print(f"Tabla cargada: {df_categories.shape[0]} filas, {df_categories.shape[1]} columnas")
print(df_categories.head())
Tabla cargada: 7 filas, 2 columnas
   category_id        category_name
0            1    Children Bicycles
1            2     Comfort Bicycles
2            3    Cruisers Bicycles
3            4  Cyclocross Bicycles
4            5       Electric Bikes
In [162]:
df_categories = pd.read_sql("SELECT * FROM categories", engine)
print(f"Tabla cargada: {df_categories.shape[0]} filas, {df_categories.shape[1]} columnas")
print(df_categories.head())
Tabla cargada: 7 filas, 2 columnas
   category_id        category_name
0            1    Children Bicycles
1            2     Comfort Bicycles
2            3    Cruisers Bicycles
3            4  Cyclocross Bicycles
4            5       Electric Bikes
In [163]:
df_customers = pd.read_sql("SELECT * FROM customers", engine)
print(f"Tabla cargada: {df_customers.shape[0]} filas, {df_customers.shape[1]} columnas")
print(df_customers.head())
Tabla cargada: 1445 filas, 9 columnas
   customer_id  first_name last_name           phone                    email  \
0            1       Debra     Burks            None    debra.burks@yahoo.com   
1            2       Kasha      Todd            None     kasha.todd@yahoo.com   
2            3      Tameka    Fisher            None    tameka.fisher@aol.com   
3            4       Daryl    Spence            None     daryl.spence@aol.com   
4            5  Charolette      Rice  (916) 381-6003  charolette.rice@msn.com   

                  street           city state  zip_code  
0      9273 Thorne Ave.    Orchard Park    NY   14127.0  
1       910 Vine Street        Campbell    CA   95008.0  
2  769C Honey Creek St.   Redondo Beach    CA   90278.0  
3        988 Pearl Lane       Uniondale    NY   11553.0  
4         107 River Dr.      Sacramento    CA   95820.0  
In [164]:
df_staffs = pd.read_sql("SELECT * FROM staffs", engine)
print(f"Tabla cargada: {df_staffs.shape[0]} filas, {df_staffs.shape[1]} columnas")
print(df_staffs.head())
Tabla cargada: 10 filas, 8 columnas
   staff_id first_name last_name                       email           phone  \
0         1    Fabiola   Jackson  fabiola.jackson@bikes.shop  (831) 555-5554   
1         2     Mireya  Copeland  mireya.copeland@bikes.shop  (831) 555-5555   
2         3      Genna   Serrano    genna.serrano@bikes.shop  (831) 555-5556   
3         4     Virgie   Wiggins   virgie.wiggins@bikes.shop  (831) 555-5557   
4         5   Jannette     David   jannette.david@bikes.shop  (516) 379-4444   

   active  store_id  manager_id  
0     1.0         1         NaN  
1     1.0         1         1.0  
2     1.0         1         2.0  
3     1.0         1         2.0  
4     1.0         2         1.0  
In [165]:
# 1. CARGA DE DATOS Y FEATURE ENGINEERING
# Crear un dataset completo uniendo las tablas relevantes
query = """
SELECT 
    oi.order_id,
    oi.item_id,
    oi.quantity,
    oi.list_price,
    oi.discount,
    o.order_date,
    o.required_date,
    o.shipped_date,
    o.order_status,
    p.product_name,
    p.model_year,
    p.list_price as product_list_price,
    b.brand_name,
    c.category_name,
    cust.city as customer_city,
    cust.state as customer_state,
    s.store_name,
    st.first_name as staff_first_name,
    st.last_name as staff_last_name
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN brands b ON p.brand_id = b.brand_id
JOIN categories c ON p.category_id = c.category_id
JOIN customers cust ON o.customer_id = cust.customer_id
JOIN stores s ON o.store_id = s.store_id
JOIN staffs st ON o.staff_id = st.staff_id
"""

df_bd_completa = pd.read_sql(query, engine)

print(f"Dataset cargado: {df_bd_completa.shape}")
print(df_bd_completa.head())
print(f"\nColumnas: {df_bd_completa.columns.tolist()}")
print(f"\nValores nulos:\n{df_bd_completa.isnull().sum()}")
Dataset cargado: (4722, 19)
   order_id  item_id  quantity  list_price  discount  order_date  \
0         1      1.0       1.0      599.99      0.20  2016-01-01   
1         1      2.0       2.0     1799.99      0.07  2016-01-01   
2         1      3.0       2.0     1549.00      0.05  2016-01-01   
3         1      4.0       2.0      599.99      0.05  2016-01-01   
4         1      5.0       1.0     2899.99      0.20  2016-01-01   

  required_date shipped_date  order_status  \
0    2016-01-03   2016-01-03           4.0   
1    2016-01-03   2016-01-03           4.0   
2    2016-01-03   2016-01-03           4.0   
3    2016-01-03   2016-01-03           4.0   
4    2016-01-03   2016-01-03           4.0   

                                     product_name  model_year  \
0  Electra Townie Original 7D EQ - Women's - 2016        2016   
1           Trek Remedy 29 Carbon Frameset - 2016        2016   
2                          Surly Straggler - 2016        2016   
3            Electra Townie Original 7D EQ - 2016        2016   
4                        Trek Fuel EX 8 29 - 2016        2016   

   product_list_price brand_name        category_name customer_city  \
0              599.99    Electra    Cruisers Bicycles    Pleasanton   
1             1799.99       Trek       Mountain Bikes    Pleasanton   
2             1549.00      Surly  Cyclocross Bicycles    Pleasanton   
3              599.99    Electra    Cruisers Bicycles    Pleasanton   
4             2899.99       Trek       Mountain Bikes    Pleasanton   

  customer_state        store_name staff_first_name staff_last_name  
0             CA  Santa Cruz Bikes           Mireya        Copeland  
1             CA  Santa Cruz Bikes           Mireya        Copeland  
2             CA  Santa Cruz Bikes           Mireya        Copeland  
3             CA  Santa Cruz Bikes           Mireya        Copeland  
4             CA  Santa Cruz Bikes           Mireya        Copeland  

Columnas: ['order_id', 'item_id', 'quantity', 'list_price', 'discount', 'order_date', 'required_date', 'shipped_date', 'order_status', 'product_name', 'model_year', 'product_list_price', 'brand_name', 'category_name', 'customer_city', 'customer_state', 'store_name', 'staff_first_name', 'staff_last_name']

Valores nulos:
order_id                0
item_id                 0
quantity                0
list_price              0
discount                0
order_date              0
required_date           0
shipped_date          508
order_status            0
product_name            0
model_year              0
product_list_price      0
brand_name              0
category_name           0
customer_city           0
customer_state          0
store_name              0
staff_first_name        0
staff_last_name         0
dtype: int64
In [166]:
# ==========================================
# 2. FEATURE ENGINEERING
# ==========================================
print("\n[2/6] Realizando Feature Engineering...")

df_bd_completa['total_sale'] = df_bd_completa['quantity'] * df_bd_completa['list_price'] * (1 - df_bd_completa['discount'])
df_bd_completa['order_date'] = pd.to_datetime(df_bd_completa['order_date'])
df_bd_completa['month'] = df_bd_completa['order_date'].dt.month
df_bd_completa['day_of_week'] = df_bd_completa['order_date'].dt.dayofweek
df_bd_completa['quarter'] = df_bd_completa['order_date'].dt.quarter

print(f"Variables temporales creadas")
print(f"  - total_sale: ${df_bd_completa['total_sale'].min():.2f} - ${df_bd_completa['total_sale'].max():.2f}")
print(f"  - Promedio de venta: ${df_bd_completa['total_sale'].mean():.2f}")
print(f"  - Mediana de venta: ${df_bd_completa['total_sale'].median():.2f}")

# Encoding
le_brand = LabelEncoder()
le_category = LabelEncoder()
le_state = LabelEncoder()
le_status = LabelEncoder()

df_bd_completa['brand_encoded'] = le_brand.fit_transform(df_bd_completa['brand_name'])
df_bd_completa['category_encoded'] = le_category.fit_transform(df_bd_completa['category_name'])
df_bd_completa['state_encoded'] = le_state.fit_transform(df_bd_completa['customer_state'])
df_bd_completa['status_encoded'] = le_status.fit_transform(df_bd_completa['order_status'])

print(f"Variables categóricas codificadas")
print(f"  - Brands únicos: {df_bd_completa['brand_name'].nunique()}")
print(f"  - Categorías únicas: {df_bd_completa['category_name'].nunique()}")
print(f"  - Estados únicos: {df_bd_completa['customer_state'].nunique()}")
[2/6] Realizando Feature Engineering...
Variables temporales creadas
  - total_sale: $71.99 - $21599.98
  - Promedio de venta: $1628.36
  - Mediana de venta: $845.98
Variables categóricas codificadas
  - Brands únicos: 9
  - Categorías únicas: 7
  - Estados únicos: 3

Dataset y preparación: Un JOIN masivo que nos deja ~4,700 líneas de órdenes con 19 columnas. De ahí sacamos features temporales (mes, día de la semana, trimestre), codificamos marca y categoría, y usamos StandardScaler con un split 80/20. Todo listo para entrenar modelos de regresión y ver si podemos predecir ventas.

Advertencia: Posible Data Leakage — Las variables quantity, list_price y discount están directamente relacionadas con total_sale (total_sale = quantity × list_price × (1 - discount)). Incluirlas podría causar métricas artificialmente infladas.

In [167]:
# ==========================================
# 3. PREPARACIÓN DE DATOS
# ==========================================
print("\n[3/6] Preparando datos para entrenamiento...")

features = ['quantity', 'list_price', 'discount', 'model_year', 
            'month', 'day_of_week', 'quarter',
            'brand_encoded', 'category_encoded', 'state_encoded', 'status_encoded']

X = df_bd_completa[features]
y = df_bd_completa['total_sale']

# Verificar valores nulos
null_counts = X.isnull().sum()
if null_counts.sum() > 0:
    print(f"Advertencia: Valores nulos encontrados:")
    print(null_counts[null_counts > 0])
else:
    print(f"No hay valores nulos en las features")

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Datos divididos:")
print(f"  - Entrenamiento: {X_train.shape[0]:,} registros ({(X_train.shape[0]/len(X))*100:.1f}%)")
print(f"  - Prueba: {X_test.shape[0]:,} registros ({(X_test.shape[0]/len(X))*100:.1f}%)")
[3/6] Preparando datos para entrenamiento...
No hay valores nulos en las features
Datos divididos:
  - Entrenamiento: 3,777 registros (80.0%)
  - Prueba: 945 registros (20.0%)
In [168]:
# Escalado
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Features escaladas con StandardScaler")

# ==========================================
# 4. ENTRENAMIENTO - RANDOM FOREST
# ==========================================
print("\n[4/6] Entrenando modelo Random Forest...")
print("-" * 80)
rf_model = RandomForestRegressor(
    n_estimators=100, 
    random_state=42, 
    n_jobs=-1,
    verbose=0
)
rf_model.fit(X_train_scaled, y_train)

print(f"Random Forest entrenado")
print(f"  - Número de árboles: {rf_model.n_estimators}")
print(f"  - Profundidad máxima: {rf_model.max_depth}")
Features escaladas con StandardScaler

[4/6] Entrenando modelo Random Forest...
--------------------------------------------------------------------------------
Random Forest entrenado
  - Número de árboles: 100
  - Profundidad máxima: None
In [169]:
# Predicciones
y_pred_rf = rf_model.predict(X_test_scaled)

# Métricas
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)
mape_rf = np.mean(np.abs((y_test - y_pred_rf) / y_test)) * 100

print("\nRESULTADOS - RANDOM FOREST")
print("=" * 80)
print(f"MAE (Error Absoluto Medio):        ${mae_rf:>10.2f}")
print(f"RMSE (Raíz Error Cuadrático):      ${rmse_rf:>10.2f}")
print(f"R² (Coeficiente Determinación):     {r2_rf:>10.4f}  ({r2_rf*100:.2f}%)")
print(f"MAPE (Error Porcentual Absoluto):   {mape_rf:>10.2f}%")
print("=" * 80)
RESULTADOS - RANDOM FOREST
================================================================================
MAE (Error Absoluto Medio):        $      6.30
RMSE (Raíz Error Cuadrático):      $     39.33
R² (Coeficiente Determinación):         0.9996  (99.96%)
MAPE (Error Porcentual Absoluto):         0.28%
================================================================================
In [170]:
# Análisis de errores
errors_rf = np.abs(y_test - y_pred_rf)
print(f"\nAnálisis de Errores:")
print(f"  - Error mínimo:  ${errors_rf.min():.2f}")
print(f"  - Error máximo:  ${errors_rf.max():.2f}")
print(f"  - Error mediano: ${errors_rf.median():.2f}")
print(f"  - Desv. estándar: ${errors_rf.std():.2f}")
Análisis de Errores:
  - Error mínimo:  $0.00
  - Error máximo:  $982.10
  - Error mediano: $0.04
  - Desv. estándar: $38.85
In [171]:
# ==========================================
# 5. ENTRENAMIENTO - XGBOOST
# ==========================================
print("\n[5/6] Entrenando modelo XGBoost...")
print("-" * 80)

xgb_model = xgb.XGBRegressor(
    n_estimators=100, 
    learning_rate=0.1, 
    random_state=42,
    verbosity=0
)
xgb_model.fit(X_train_scaled, y_train)

print(f"XGBoost entrenado")
print(f"  - Número de estimadores: {xgb_model.n_estimators}")
print(f"  - Tasa de aprendizaje: {xgb_model.learning_rate}")
[5/6] Entrenando modelo XGBoost...
--------------------------------------------------------------------------------
XGBoost entrenado
  - Número de estimadores: 100
  - Tasa de aprendizaje: 0.1
In [172]:
# Predicciones
y_pred_xgb = xgb_model.predict(X_test_scaled)

# Métricas
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)
mape_xgb = np.mean(np.abs((y_test - y_pred_xgb) / y_test)) * 100

print("\nRESULTADOS - XGBOOST")
print("=" * 80)
print(f"MAE (Error Absoluto Medio):        ${mae_xgb:>10.2f}")
print(f"RMSE (Raíz Error Cuadrático):      ${rmse_xgb:>10.2f}")
print(f"R² (Coeficiente Determinación):     {r2_xgb:>10.4f}  ({r2_xgb*100:.2f}%)")
print(f"MAPE (Error Porcentual Absoluto):   {mape_xgb:>10.2f}%")
print("=" * 80)
RESULTADOS - XGBOOST
================================================================================
MAE (Error Absoluto Medio):        $      5.86
RMSE (Raíz Error Cuadrático):      $     19.25
R² (Coeficiente Determinación):         0.9999  (99.99%)
MAPE (Error Porcentual Absoluto):         0.43%
================================================================================

Advertencia: Se confirma el Data Leakage

Interpretación: Los números cantan. R² del 99.9%, MAE de pocos dólares… demasiado bonito para ser cierto. El modelo está aprendiendo la fórmula directa: quantity × list_price × (1 - discount). En producción no tendremos esas variables antes de la venta. Lección clave: siempre validar que las features estén disponibles en el momento de la predicción.

In [173]:
# Análisis de errores
errors_xgb = np.abs(y_test - y_pred_xgb)
print(f"\nAnálisis de Errores:")
print(f"  - Error mínimo:  ${errors_xgb.min():.2f}")
print(f"  - Error máximo:  ${errors_xgb.max():.2f}")
print(f"  - Error mediano: ${errors_xgb.median():.2f}")
print(f"  - Desv. estándar: ${errors_xgb.std():.2f}")
Análisis de Errores:
  - Error mínimo:  $0.00
  - Error máximo:  $296.26
  - Error mediano: $1.89
  - Desv. estándar: $18.34
In [174]:
# ==========================================
# 6. COMPARACIÓN DE MODELOS
# ==========================================
print("\n[6/6] Comparación de Modelos")
print("=" * 80)

comparison = pd.DataFrame({
    'Métrica': ['MAE ($)', 'RMSE ($)', 'R²', 'MAPE (%)'],
    'Random Forest': [mae_rf, rmse_rf, r2_rf, mape_rf],
    'XGBoost': [mae_xgb, rmse_xgb, r2_xgb, mape_xgb],
    'Diferencia': [
        mae_rf - mae_xgb,
        rmse_rf - rmse_xgb,
        r2_rf - r2_xgb,
        mape_rf - mape_xgb
    ]
})

print(comparison.to_string(index=False))
print("=" * 80)
[6/6] Comparación de Modelos
================================================================================
 Métrica  Random Forest   XGBoost  Diferencia
 MAE ($)       6.300957  5.864546    0.436411
RMSE ($)      39.333542 19.248571   20.084971
      R²       0.999622  0.999909   -0.000288
MAPE (%)       0.275179  0.427601   -0.152422
================================================================================
In [175]:
# Mejora porcentual
mejora_mae = ((mae_rf - mae_xgb) / mae_rf) * 100
mejora_rmse = ((rmse_rf - rmse_xgb) / rmse_rf) * 100

print(f"\nMODELO GANADOR: XGBoost")
print(f"  - MAE mejorado en:  {mejora_mae:.2f}%")
print(f"  - RMSE mejorado en: {mejora_rmse:.2f}%")
MODELO GANADOR: XGBoost
  - MAE mejorado en:  6.93%
  - RMSE mejorado en: 51.06%
In [176]:
# ==========================================
# 7. FEATURE IMPORTANCE
# ==========================================
print("\nIMPORTANCIA DE FEATURES - Random Forest")
print("-" * 80)

feature_importance = pd.DataFrame({
    'Feature': features,
    'Importancia': rf_model.feature_importances_
}).sort_values('Importancia', ascending=False)

for idx, row in feature_importance.iterrows():
    bar = '█' * int(row['Importancia'] * 100)
    print(f"{row['Feature']:20s} | {bar} {row['Importancia']:.4f}")

print("\nIMPORTANCIA DE FEATURES - XGBoost")
print("-" * 80)

feature_importance_xgb = pd.DataFrame({
    'Feature': features,
    'Importancia': xgb_model.feature_importances_
}).sort_values('Importancia', ascending=False)

for idx, row in feature_importance_xgb.iterrows():
    bar = '█' * int(row['Importancia'] * 100)
    print(f"{row['Feature']:20s} | {bar} {row['Importancia']:.4f}")
IMPORTANCIA DE FEATURES - Random Forest
--------------------------------------------------------------------------------
list_price           | ████████████████████████████████████████████████████████████████████████████████████ 0.8449
quantity             | ██████████████ 0.1490
discount             |  0.0053
model_year           |  0.0002
status_encoded       |  0.0001
month                |  0.0001
day_of_week          |  0.0001
category_encoded     |  0.0001
brand_encoded        |  0.0001
quarter              |  0.0000
state_encoded        |  0.0000

IMPORTANCIA DE FEATURES - XGBoost
--------------------------------------------------------------------------------
list_price           | ████████████████████████████████████████████████████████ 0.5635
quantity             | ███████████████████████████████████████████ 0.4325
discount             |  0.0037
day_of_week          |  0.0001
brand_encoded        |  0.0000
model_year           |  0.0000
category_encoded     |  0.0000
status_encoded       |  0.0000
state_encoded        |  0.0000
month                |  0.0000
quarter              |  0.0000
In [177]:
# ==========================================
# 8. EJEMPLOS DE PREDICCIONES
# ==========================================
print("\nEJEMPLOS DE PREDICCIONES (5 casos aleatorios)")
print("=" * 80)

sample_indices = np.random.choice(len(y_test), 5, replace=False)
samples = pd.DataFrame({
    'Real': y_test.iloc[sample_indices].values,
    'Pred_RF': y_pred_rf[sample_indices],
    'Pred_XGB': y_pred_xgb[sample_indices],
    'Error_RF': np.abs(y_test.iloc[sample_indices].values - y_pred_rf[sample_indices]),
    'Error_XGB': np.abs(y_test.iloc[sample_indices].values - y_pred_xgb[sample_indices])
})

for i, (idx, row) in enumerate(samples.iterrows(), 1):
    print(f"\nCaso {i}:")
    print(f"  Valor Real:        ${row['Real']:>8.2f}")
    print(f"  Predicción RF:     ${row['Pred_RF']:>8.2f}  (error: ${row['Error_RF']:>6.2f})")
    print(f"  Predicción XGB:    ${row['Pred_XGB']:>8.2f}  (error: ${row['Error_XGB']:>6.2f})")

print("\n" + "=" * 80)
print("PROCESO COMPLETADO EXITOSAMENTE")
print("=" * 80)
EJEMPLOS DE PREDICCIONES (5 casos aleatorios)
================================================================================

Caso 1:
  Valor Real:        $  511.49
  Predicción RF:     $  511.49  (error: $  0.00)
  Predicción XGB:    $  507.85  (error: $  3.64)

Caso 2:
  Valor Real:        $  989.98
  Predicción RF:     $  989.98  (error: $  0.00)
  Predicción XGB:    $  991.13  (error: $  1.15)

Caso 3:
  Valor Real:        $  815.10
  Predicción RF:     $  815.19  (error: $  0.09)
  Predicción XGB:    $  815.53  (error: $  0.43)

Caso 4:
  Valor Real:        $  959.98
  Predicción RF:     $  959.98  (error: $  0.00)
  Predicción XGB:    $  962.50  (error: $  2.52)

Caso 5:
  Valor Real:        $ 4749.99
  Predicción RF:     $ 4743.34  (error: $  6.65)
  Predicción XGB:    $ 4749.02  (error: $  0.97)

================================================================================
PROCESO COMPLETADO EXITOSAMENTE
================================================================================
In [178]:
# Agregación temporal
monthly_sales = df_bd_completa.groupby([
    pd.Grouper(key='order_date', freq='M'),
    'category_name',
    'brand_name'
]).agg({
    'quantity': 'sum',
    'total_sale': 'sum'
}).reset_index()

print(f"Nuevo dataset: {monthly_sales.shape}")
print(monthly_sales.head())

# Ahora predice total_sale o quantity agregado
# Esto tiene más señal porque reduces el ruido
Nuevo dataset: (457, 5)
  order_date        category_name   brand_name  quantity  total_sale
0 2016-01-31    Children Bicycles      Electra      30.0   7515.3313
1 2016-01-31     Comfort Bicycles      Electra      18.0   9056.3361
2 2016-01-31    Cruisers Bicycles      Electra      60.0  25429.5610
3 2016-01-31    Cruisers Bicycles  Pure Cycles      22.0   8627.6500
4 2016-01-31  Cyclocross Bicycles        Surly      18.0  26006.9925
C:\Users\57317\AppData\Local\Temp\ipykernel_33916\639964104.py:3: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  pd.Grouper(key='order_date', freq='M'),
In [179]:
# Agregación temporal
monthly_sales = df_bd_completa.groupby([
    pd.Grouper(key='order_date', freq='M'),
    'category_name',
    'brand_name'
]).agg({
    'quantity': 'sum',
    'total_sale': 'sum'
}).reset_index()

print(f"Nuevo dataset: {monthly_sales.shape}")
print(monthly_sales.head())

# Ahora predice total_sale o quantity agregado
# Esto tiene más señal porque reduces el ruido
Nuevo dataset: (457, 5)
  order_date        category_name   brand_name  quantity  total_sale
0 2016-01-31    Children Bicycles      Electra      30.0   7515.3313
1 2016-01-31     Comfort Bicycles      Electra      18.0   9056.3361
2 2016-01-31    Cruisers Bicycles      Electra      60.0  25429.5610
3 2016-01-31    Cruisers Bicycles  Pure Cycles      22.0   8627.6500
4 2016-01-31  Cyclocross Bicycles        Surly      18.0  26006.9925
C:\Users\57317\AppData\Local\Temp\ipykernel_33916\639964104.py:3: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  pd.Grouper(key='order_date', freq='M'),
In [180]:
print("=" * 80)
print("DIAGNÓSTICO COMPLETO DEL DATASET")
print("=" * 80)

# 1. Distribución de quantity
print("\n1) DISTRIBUCIÓN DE QUANTITY")
print(df_bd_completa['quantity'].describe())
print("\nConteo de valores:")
print(df_bd_completa['quantity'].value_counts().sort_index().head(10))

# 2. Correlaciones
print("\n2) CORRELACIONES CON QUANTITY")
numeric_cols = ['quantity', 'model_year', 'month', 'day_of_week', 'quarter']
correlations = df_bd_completa[numeric_cols].corr()['quantity'].sort_values(ascending=False)
print(correlations)

# 3. Ventas por categoría
print("\n3) QUANTITY PROMEDIO POR CATEGORÍA")
category_stats = df_bd_completa.groupby('category_name')['quantity'].agg(['mean', 'std', 'count'])
print(category_stats.sort_values('mean', ascending=False))

# 4. Análisis temporal
print("\n4) TENDENCIA TEMPORAL")
monthly = df_bd_completa.groupby(df_bd_completa['order_date'].dt.to_period('M'))['quantity'].agg(['mean', 'sum'])
print(monthly)
================================================================================
DIAGNÓSTICO COMPLETO DEL DATASET
================================================================================

1) DISTRIBUCIÓN DE QUANTITY
count    4722.000000
mean        1.498941
std         0.500052
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         2.000000
Name: quantity, dtype: float64

Conteo de valores:
quantity
1.0    2366
2.0    2356
Name: count, dtype: int64

2) CORRELACIONES CON QUANTITY
quantity       1.000000
month         -0.000247
quarter       -0.001865
model_year    -0.003884
day_of_week   -0.016245
Name: quantity, dtype: float64

3) QUANTITY PROMEDIO POR CATEGORÍA
                         mean       std  count
category_name                                 
Cyclocross Bicycles  1.539062  0.499448    256
Comfort Bicycles     1.513966  0.500271    537
Children Bicycles    1.507673  0.500261    782
Cruisers Bicycles    1.497097  0.500173   1378
Road Bikes           1.494652  0.500641    374
Electric Bikes       1.485849  0.500983    212
Mountain Bikes       1.483516  0.499940   1183

4) TENDENCIA TEMPORAL
                mean    sum
order_date                 
2016-01     1.503401  221.0
2016-02     1.506757  223.0
2016-03     1.521429  213.0
2016-04     1.491525  176.0
2016-05     1.534247  224.0
2016-06     1.530769  199.0
2016-07     1.517986  211.0
2016-08     1.502994  251.0
2016-09     1.463542  281.0
2016-10     1.468208  254.0
2016-11     1.459677  181.0
2016-12     1.536913  229.0
2017-01     1.458599  229.0
2017-02     1.511494  263.0
2017-03     1.465347  296.0
2017-04     1.493976  248.0
2017-05     1.478528  241.0
2017-06     1.533679  296.0
2017-07     1.491018  249.0
2017-08     1.502618  287.0
2017-09     1.549020  237.0
2017-10     1.541667  296.0
2017-11     1.460606  241.0
2017-12     1.469388  216.0
2018-01     1.487654  241.0
2018-02     1.495327  160.0
2018-03     1.502591  290.0
2018-04     1.506494  580.0
2018-06     1.000000    1.0
2018-07     1.222222   11.0
2018-08     1.500000    9.0
2018-09     1.333333    4.0
2018-10     1.600000    8.0
2018-11     1.600000    8.0
2018-12     1.333333    4.0

La realidad del negocio:

Este dataset es de una tienda de bicicletas donde la mayoría de órdenes son de 1 unidad (un cliente compra 1 bici), algunas de 2 (parejas, familias), no hay patrones temporales fuertes (la gente compra cuando las necesita) ni por categoría. No es un dataset defectuoso — simplemente refleja un negocio B2C típico. Como ingeniero de datos, entender este contexto evita buscar patrones donde no existen.

Series temporales aplicadas a los datos

Probamos ARIMA y SARIMA sobre ventas mensuales. Con pocos datos y poca estacionalidad clara, los modelos clásicos de series temporales se vuelven inestables. Los MAE (~$54k y ~$171k) muestran que para este tipo de negocio, predecir por mes no es la mejor apuesta. Alternativas: agregar más historial, usar Prophet o modelos ensemble con features externas.

In [181]:
# 1. FILTRADO POR ESTADO DE PEDIDO
# Basado en tu esquema, asumimos que '4' es 'Completed'. 
# Ajusta el número según la lógica de tu negocio.
df_filtrado = df_bd_completa[df_bd_completa['order_status'] == 4].copy()

# 2. AGREGACIÓN MENSUAL
monthly_sales = df_filtrado.groupby(df_filtrado['order_date'].dt.to_period('M')).agg({
    'total_sale': 'sum'
}).reset_index()
monthly_sales['order_date'] = monthly_sales['order_date'].dt.to_timestamp()

# 3. DETECCIÓN Y CORRECCIÓN DEL PICO (OUTLIERS)
# Usamos el Z-score para identificar si ese valor de 800k es estadísticamente extremo
z_scores = stats.zscore(monthly_sales['total_sale'])
abs_z_scores = np.abs(z_scores)
# Si el valor es mayor a 3 desviaciones estándar, lo reemplazamos por la mediana
outliers = (abs_z_scores > 3)

if outliers.any():
    print(f"Advertencia: Se detectaron {sum(outliers)} meses con ventas atípicas. Ajustando a la mediana...")
    median_val = monthly_sales['total_sale'].median()
    monthly_sales.loc[outliers, 'total_sale'] = median_val

# --- EL RESTO DEL MODELO ---

print("=" * 80)
print("PREDICCIÓN DE VENTAS MENSUALES (DATOS LIMPIOS)")
print("=" * 80)

# Split temporal
train_size = int(len(monthly_sales) * 0.8)
train = monthly_sales['total_sale'][:train_size]
test = monthly_sales['total_sale'][train_size:]

# Modelo ARIMA (Ajustado a 1,1,1 o puedes probar 1,0,1 si los datos son más estables)
model = ARIMA(train, order=(1,1,1))
fitted_model = model.fit()

# Predicciones
predictions = fitted_model.forecast(steps=len(test))

# Métricas
mae = mean_absolute_error(test, predictions)
rmse = np.sqrt(mean_squared_error(test, predictions))
r2 = r2_score(test, predictions)

print(f"MAE:  ${mae:,.2f}")
print(f"R²:   {r2:.4f}")

# Visualización mejorada
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['order_date'][:train_size], train, label='Entrenamiento (Limpio)', marker='o')
plt.plot(monthly_sales['order_date'][train_size:], test, label='Real (Test)', marker='o')
plt.plot(monthly_sales['order_date'][train_size:], predictions, label='Predicción', color='red', linestyle='--')
plt.title('Predicción de Ventas Mensuales después de Limpieza')
plt.legend()
plt.show()
================================================================================
PREDICCIÓN DE VENTAS MENSUALES (DATOS LIMPIOS)
================================================================================
MAE:  $53,971.60
R²:   -0.0629
c:\Users\57317\anaconda3\envs\notebook_ejecutor_DtaScie\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
  warn('Non-invertible starting MA parameters found.'
No description has been provided for this image
In [182]:
import statsmodels.api as sm

# 1. PREPARACIÓN DE DATOS (Usando el DataFrame limpio anterior)
# Aseguramos que el índice sea temporal para que el modelo entienda la frecuencia
monthly_sales = monthly_sales.set_index('order_date')
monthly_sales.index.freq = 'MS'  # MS = Month Start

# 2. DEFINICIÓN DEL MODELO SARIMA
# order: (p, d, q) -> Componente no estacional
# seasonal_order: (P, D, Q, s) -> Componente estacional (s=12 para meses del año)
sarima_model = sm.tsa.statespace.SARIMAX(
    train, 
    order=(1, 1, 1), 
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
)

sarima_results = sarima_model.fit(disp=False)

# 3. PREDICCIONES
# Predecimos sobre el rango del set de test
pred_sarima = sarima_results.get_forecast(steps=len(test))
pred_values = pred_sarima.predicted_mean
conf_int = pred_sarima.conf_int() # Intervalos de confianza

# 4. MÉTRICAS
mae_s = mean_absolute_error(test, pred_values)
r2_s = r2_score(test, pred_values)

print(f"\nRESULTADOS - SARIMA")
print("=" * 80)
print(f"MAE:  ${mae_s:,.2f}")
print(f"R²:   {r2_s:.4f}")
print("=" * 80)

# 5. VISUALIZACIÓN
plt.figure(figsize=(12, 6))
plt.plot(train.index, train, label='Entrenamiento', marker='o')
plt.plot(test.index, test, label='Real (Test)', marker='o')
plt.plot(test.index, pred_values, label='Predicción SARIMA', color='red', linestyle='--', marker='x')

# Opcional: Mostrar sombreado de incertidumbre
plt.fill_between(test.index, conf_int.iloc[:, 0], conf_int.iloc[:, 1], color='pink', alpha=0.3)

plt.title('Predicción de Ventas Mensuales con Modelo SARIMA')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
c:\Users\57317\anaconda3\envs\notebook_ejecutor_DtaScie\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
  warn('Too few observations to estimate starting parameters%s.'
RESULTADOS - SARIMA
================================================================================
MAE:  $171,169.86
R²:   -7.3712
================================================================================
No description has been provided for this image

SARIMA: No se observan buenos resultados. Con datos mensuales limitados, el modelo de estacionalidad no encuentra señales útiles. El MAE alto confirma que para este negocio B2C, la estacionalidad no es el driver principal.

In [183]:
# Agregación mensual
monthly_sales = df_bd_completa.groupby(df_bd_completa['order_date'].dt.to_period('M')).agg({
    'total_sale': 'sum',
    'quantity': 'sum',
    'order_id': 'nunique'
}).reset_index()

monthly_sales['order_date'] = monthly_sales['order_date'].dt.to_timestamp()

print("=" * 80)
print("PREDICCIÓN DE VENTAS MENSUALES - SERIES TEMPORALES")
print("=" * 80)

# Visualizar tendencia
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['order_date'], monthly_sales['total_sale'], marker='o')
plt.title('Ventas Totales Mensuales')
plt.xlabel('Fecha')
plt.ylabel('Ventas ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Split temporal (último 20% para test)
train_size = int(len(monthly_sales) * 0.8)
train = monthly_sales['total_sale'][:train_size]
test = monthly_sales['total_sale'][train_size:]

# Modelo ARIMA
model = ARIMA(train, order=(1,1,1))
fitted_model = model.fit()

# Predicciones
predictions = fitted_model.forecast(steps=len(test))

# Métricas
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(test, predictions)
rmse = np.sqrt(mean_squared_error(test, predictions))
r2 = r2_score(test, predictions)

print(f"\nRESULTADOS - ARIMA")
print("=" * 80)
print(f"MAE:  ${mae:,.2f}")
print(f"RMSE: ${rmse:,.2f}")
print(f"R²:   {r2:.4f}")
print("=" * 80)

# Visualizar predicciones
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['order_date'][:train_size], train, label='Entrenamiento', marker='o')
plt.plot(monthly_sales['order_date'][train_size:], test, label='Real', marker='o')
plt.plot(monthly_sales['order_date'][train_size:], predictions, label='Predicción', marker='x', linestyle='--')
plt.title('Predicción de Ventas Mensuales')
plt.xlabel('Fecha')
plt.ylabel('Ventas ($)')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\nINSIGHTS:")
print(f"- Venta mensual promedio: ${monthly_sales['total_sale'].mean():,.2f}")
print(f"- Tendencia: {'Creciente' if monthly_sales['total_sale'].iloc[-6:].mean() > monthly_sales['total_sale'].iloc[:6].mean() else 'Estable'}")
================================================================================
PREDICCIÓN DE VENTAS MENSUALES - SERIES TEMPORALES
================================================================================
No description has been provided for this image
RESULTADOS - ARIMA
================================================================================
MAE:  $827,340.07
RMSE: $827,392.47
R²:   -47799.7705
================================================================================
No description has been provided for this image
INSIGHTS:
- Venta mensual promedio: $219,689.04
- Tendencia: Estable

ARIMA: Resultados aún peores que SARIMA. Sin componente estacional, el modelo se queda corto. Conclusión: para forecasting de ventas en este contexto, mejor apostar por segmentación (RFM) o modelos que usen features de negocio más que solo la serie temporal.

In [184]:
# Cargar solo la tabla orders con customer_id
query_orders = """
SELECT 
    order_id,
    customer_id,
    order_date,
    order_status
FROM orders
"""

orders_df = pd.read_sql(query_orders, engine)

print(f"Orders cargados: {orders_df.shape}")
print(orders_df.head())
Orders cargados: (1615, 4)
   order_id  customer_id  order_date  order_status
0         1          259  2016-01-01           4.0
1         2         1212  2016-01-01           4.0
2         3          523  2016-01-02           4.0
3         4          175  2016-01-03           4.0
4         5         1324  2016-01-03           4.0
In [185]:
# Hacer merge con tu df principal
# Asumiendo que df ya tiene 'order_id'
df_with_customer = df_bd_completa.merge(
    orders_df[['order_id', 'customer_id']], 
    on='order_id', 
    how='left'
)
print(f"\nMerge completado: {df_with_customer.shape}")
print(f"customer_id agregado: {'customer_id' in df_with_customer.columns}")
print(f"Clientes únicos: {df_with_customer['customer_id'].nunique()}")
# Verificar que no haya nulos
print(f"\nAdvertencia: Valores nulos en customer_id: {df_with_customer['customer_id'].isnull().sum()}")
# Actualizar df
df_bd_completa = df_with_customer
Merge completado: (4722, 28)
customer_id agregado: True
Clientes únicos: 1445

Advertencia: Valores nulos en customer_id: 0
In [186]:
# Crear perfil de clientes
customer_profile = df_bd_completa.groupby('customer_id').agg({
    'total_sale': ['sum', 'mean', 'count'],
    'quantity': 'sum',
    'discount': 'mean',
    'order_date': lambda x: (x.max() - x.min()).days  # Días entre primera y última compra
}).reset_index()

customer_profile.columns = ['customer_id', 'total_spent', 'avg_order', 
                            'num_orders', 'total_items', 'avg_discount', 'customer_lifetime_days']

print("=" * 80)
print("SEGMENTACIÓN DE CLIENTES - K-MEANS CLUSTERING")
print("=" * 80)

# Preparar features
features = ['total_spent', 'num_orders', 'avg_order', 'avg_discount']
X = customer_profile[features]

# Escalar
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
customer_profile['segment'] = kmeans.fit_predict(X_scaled)

# Analizar segmentos
print("\nCARACTERÍSTICAS DE LOS SEGMENTOS:")
print("=" * 80)

segment_analysis = customer_profile.groupby('segment').agg({
    'customer_id': 'count',
    'total_spent': 'mean',
    'num_orders': 'mean',
    'avg_order': 'mean',
    'avg_discount': 'mean'
}).round(2)

segment_analysis.columns = ['Cantidad_Clientes', 'Gasto_Promedio', 
                            'Num_Ordenes_Prom', 'Ticket_Promedio', 'Descuento_Prom']

print(segment_analysis)

# Nombrar segmentos
segment_names = {
    segment_analysis['Gasto_Promedio'].idxmax(): 'VIP (Alto Valor)',
    segment_analysis['Num_Ordenes_Prom'].idxmax(): 'Frecuentes',
    segment_analysis['Gasto_Promedio'].idxmin(): 'Ocasionales',
}

print("\nINTERPRETACIÓN DE SEGMENTOS:")
for seg, name in segment_names.items():
    count = segment_analysis.loc[seg, 'Cantidad_Clientes']
    value = segment_analysis.loc[seg, 'Gasto_Promedio']
    print(f"  Segmento {seg} - {name}: {count} clientes, ${value:,.2f} promedio")

# Visualización
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(customer_profile['total_spent'], customer_profile['num_orders'], 
            c=customer_profile['segment'], cmap='viridis', alpha=0.6)
plt.xlabel('Gasto Total ($)')
plt.ylabel('Número de Órdenes')
plt.title('Segmentación de Clientes')
plt.colorbar(label='Segmento')

plt.subplot(1, 2, 2)
segment_analysis['Cantidad_Clientes'].plot(kind='bar')
plt.xlabel('Segmento')
plt.ylabel('Cantidad de Clientes')
plt.title('Distribución de Segmentos')
plt.tight_layout()
plt.show()
================================================================================
SEGMENTACIÓN DE CLIENTES - K-MEANS CLUSTERING
================================================================================

CARACTERÍSTICAS DE LOS SEGMENTOS:
================================================================================
         Cantidad_Clientes  Gasto_Promedio  Num_Ordenes_Prom  Ticket_Promedio  \
segment                                                                         
0                      425         3235.81              3.08          1037.09   
1                      602         2672.48              2.63          1000.37   
2                       85        16751.80              8.11          2088.37   
3                      333         9853.34              3.43          3152.43   

         Descuento_Prom  
segment                  
0                  0.15  
1                  0.08  
2                  0.10  
3                  0.10  

INTERPRETACIÓN DE SEGMENTOS:
  Segmento 2 - Frecuentes: 85 clientes, $16,751.80 promedio
  Segmento 1 - Ocasionales: 602 clientes, $2,672.48 promedio
No description has been provided for this image

RFM: estabilizar scores (importante): Usamos rank-based binning para evitar el colapso de cuantiles por la distribución sesgada de frecuencia. Sin esto, la mayoría de clientes quedarían en el mismo bucket y la segmentación sería inútil.

In [187]:
# RFM Analysis (Recency, Frequency, Monetary)
import datetime

reference_date = df_bd_completa['order_date'].max() + pd.Timedelta(days=1)

# Force 5 stable bins using ranking
rfm['R_score'] = pd.qcut(
    rfm['recency'].rank(method='first'),
    5,
    labels=[5,4,3,2,1]
).astype(int)

rfm['F_score'] = pd.qcut(
    rfm['frequency'].rank(method='first'),
    5,
    labels=[1,2,3,4,5]
).astype(int)

rfm['M_score'] = pd.qcut(
    rfm['monetary'].rank(method='first'),
    5,
    labels=[1,2,3,4,5]
).astype(int)

rfm['RFM_score'] = (
    rfm['R_score'].astype(str) +
    rfm['F_score'].astype(str) +
    rfm['M_score'].astype(str)
)

print("=" * 80)
print("ANÁLISIS RFM - IDENTIFICACIÓN DE CLIENTES EN RIESGO")
print("=" * 80)

# Clientes en riesgo (alta recencia, baja frecuencia)
at_risk = rfm[(rfm['recency'] > rfm['recency'].median()) & 
              (rfm['frequency'] < rfm['frequency'].median())]

print(f"\nAdvertencia: Clientes en riesgo de churn: {len(at_risk)} ({len(at_risk)/len(rfm)*100:.1f}%)")
print(f"Valor en riesgo: ${at_risk['monetary'].sum():,.2f}")

# Top clientes
champions = rfm[(rfm['R_score'].astype(int) >= 4) & 
                (rfm['F_score'].astype(int) >= 4) & 
                (rfm['M_score'].astype(int) >= 4)]

print(f"\nClientes Champions: {len(champions)} ({len(champions)/len(rfm)*100:.1f}%)")
print(f"Valor generado: ${champions['monetary'].sum():,.2f}")
================================================================================
ANÁLISIS RFM - IDENTIFICACIÓN DE CLIENTES EN RIESGO
================================================================================

Advertencia: Clientes en riesgo de churn: 0 (0.0%)
Valor en riesgo: $0.00

Clientes Champions: 179 (12.4%)
Valor generado: $2,251,597.17

Los 8 segmentos RFM clásicos: At Risk, Loyal Customers, Champions, Potential Loyalists, New Customers, Promising, Hibernating, Can't Lose Them. Cada uno tiene una estrategia de negocio asociada: retención, reactivación, cross-sell, etc. La tabla y gráficos muestran cuántos clientes hay en cada segmento y cuánto contribuyen en ventas.

In [188]:
def rfm_segment(row):
    if row['R_score'] >= 4 and row['F_score'] >= 4 and row['M_score'] >= 4:
        return 'Champions'
    elif row['R_score'] >= 3 and row['F_score'] >= 4:
        return 'Loyal Customers'
    elif row['R_score'] >= 4 and row['F_score'] <= 2:
        return 'Potential Loyalists'
    elif row['R_score'] >= 3 and row['F_score'] <= 2:
        return 'Recent Customers'
    elif row['R_score'] <= 2 and row['F_score'] >= 3:
        return 'At Risk'
    elif row['R_score'] <= 2 and row['F_score'] <= 2 and row['M_score'] >= 3:
        return 'Hibernating'
    elif row['R_score'] == 1 and row['F_score'] == 1:
        return 'Lost'
    else:
        return 'Others'
rfm['Segment'] = rfm.apply(rfm_segment, axis=1)

Segment summary (business-ready table)

In [189]:
segment_summary = (
    rfm.groupby('Segment')
       .agg(
           customers=('customer_id', 'count'),
           avg_recency=('recency', 'mean'),
           avg_frequency=('frequency', 'mean'),
           avg_monetary=('monetary', 'mean'),
           total_revenue=('monetary', 'sum')
       )
       .sort_values('customers', ascending=False)
)

segment_summary
Out[189]:
customers avg_recency avg_frequency avg_monetary total_revenue
Segment
At Risk 326 894.883436 1.003067 3843.062492 1.252838e+06
Others 259 642.872587 1.000000 3498.113597 9.060114e+05
Loyal Customers 200 489.310000 1.110000 3462.218955 6.924438e+05
Potential Loyalists 186 413.489247 1.000000 5493.817505 1.021850e+06
Champions 179 306.536313 1.821229 12578.755165 2.251597e+06
Recent Customers 140 631.192857 1.000000 5339.528969 7.475341e+05
Hibernating 129 911.410853 1.000000 6032.945445 7.782500e+05
Lost 26 1005.000000 1.000000 1484.297077 3.859172e+04
In [190]:
rfm[['R_score','F_score','M_score']].hist(bins=5, figsize=(10,4))
plt.suptitle('Distribution of RFM Scores')
plt.show()
No description has been provided for this image
In [191]:
rfm['Segment'].value_counts().plot(kind='bar', figsize=(10,4))
plt.title('Customer Distribution by RFM Segment')
plt.ylabel('Number of Customers')
plt.show()
No description has been provided for this image
In [192]:
(
    rfm.groupby('Segment')['monetary']
       .sum()
       .sort_values(ascending=False)
       .plot(kind='bar', figsize=(10,4))
)

plt.title('Revenue Contribution by RFM Segment')
plt.ylabel('Total Revenue')
plt.show()
No description has been provided for this image
In [193]:
plt.figure(figsize=(8,5))
plt.scatter(rfm['R_score'], rfm['F_score'], alpha=0.4)
plt.xlabel('Recency Score')
plt.ylabel('Frequency Score')
plt.title('Customer Distribution: Recency vs Frequency')
plt.show()
No description has been provided for this image

“We used a rank-based RFM segmentation to avoid quantile collapse caused by skewed frequency distributions. This allowed us to create stable, interpretable customer segments and validate them using revenue contribution and score distributions.”

Conclusiones del Ingeniero de Datos

1. Conexión SQL + Python: SQLAlchemy permite extraer datos desde PostgreSQL de forma limpia y reproducible. Los JOINs se ejecutan en el motor SQL. El flujo ETL (extraer → transformar en pandas → modelar) es sólido para pipelines de datos.

2. Data Leakage: Incluir quantity, list_price y discount para predecir total_sale genera métricas engañosas (R² 99.9%). En un escenario real no estarían disponibles antes de la transacción. Validar la disponibilidad temporal de cada feature.

3. Series temporales: ARIMA/SARIMA no rindieron bien con pocos puntos mensuales. Para negocios B2C, la segmentación RFM aporta más valor operativo que el forecasting puro.

4. RFM como producto: La segmentación RFM es interpretable y accionable. Los resultados se escriben a tablas PostgreSQL listas para dashboards y campañas. Esto sí escala a producción.

5. Próximos pasos: Remover features con leakage; integrar RFM en pipeline programado; exponer segmentos vía API o vistas materializadas.

Escribir los resultados en SQL

Persistencia en base de datos: Los segmentos RFM se persisten en PostgreSQL. El negocio puede consumirlos desde BI, CRM o campañas. El ciclo se cierra: SQL → Python → modelos → SQL.

In [195]:
segment_summary = segment_summary.reset_index()

segment_summary['model_version'] = 'rfm_kmeans_v1'
In [196]:
segment_summary.head()
Out[196]:
Segment customers avg_recency avg_frequency avg_monetary total_revenue model_version
0 At Risk 326 894.883436 1.003067 3843.062492 1.252838e+06 rfm_kmeans_v1
1 Others 259 642.872587 1.000000 3498.113597 9.060114e+05 rfm_kmeans_v1
2 Loyal Customers 200 489.310000 1.110000 3462.218955 6.924438e+05 rfm_kmeans_v1
3 Potential Loyalists 186 413.489247 1.000000 5493.817505 1.021850e+06 rfm_kmeans_v1
4 Champions 179 306.536313 1.821229 12578.755165 2.251597e+06 rfm_kmeans_v1
In [201]:
engine = create_engine(
    "postgresql://postgres:3061@localhost:5432/proyecto_portafolio"
)

segment_summary.to_sql(
    "customer_rfm_segments",
    engine,
    if_exists="append",   # permite múltiples ejecuciones
    index=False
)
customer_profile[['customer_id', 'segment']].to_sql(
    "customer_segments_detail",
    engine,
    if_exists="replace",
    index=False
)
Out[201]:
445