Skip to content

Categorical Features with no 0 label leads to partial_dependence ValueError #285

@tatkeller

Description

@tatkeller

Hi there,

I created a model that had categorical features above the value of 0 (range of n to m, where n>0 and m>0). I wanted to plot the partial dependence for my model, but ran into a ValueError (error recreated below). The problem is that generate_X_grid creates a matrix that looks like this:

[[0,0,0, ..., 0, i, 0, ..., 0,0,0],
[0,0,0, ..., 0, i, 0, ..., 0,0,0],
...,
[0,0,0, ..., 0, i, 0, ..., 0,0,0]]

And for models that have been trained with categorical features that do not have '0' as a category, this will raise an error when calling the partial dependence function.

Here is a recreation of the error using the Quick start example code:

Input:

from pygam.datasets import wage

X, y = wage()

from pygam import LinearGAM, s, f

gam = LinearGAM(f(0) + s(1) + f(2)).fit(X, y) ##Use f(0) to make the 0th term categorical. The 0th term contains no value equal to  0

import matplotlib.pyplot as plt

for i, term in enumerate(gam.terms):
    if term.isintercept:
        continue

    XX = gam.generate_X_grid(term=i)
    pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

    #plt.figure()
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(repr(term))
    plt.show()

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-0e5df89ff530> in <module>()
      7     XX = gam.generate_X_grid(term=i)
      8     print(XX)
----> 9     pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)
     10 
     11     #plt.figure()

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/pygam.py in partial_dependence(self, term, X, width, quantiles, meshgrid)
   1542                         features=self.feature, verbose=self.verbose)
   1543 
-> 1544         modelmat = self._modelmat(X, term=term)
   1545         pdep = self._linear_predictor(modelmat=modelmat, term=term)
   1546         out = [pdep]

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/pygam.py in _modelmat(self, X, term)
    455         X = check_X(X, n_feats=self.statistics_['m_features'],
    456                     edge_knots=self.edge_knots_, dtypes=self.dtype,
--> 457                     features=self.feature, verbose=self.verbose)
    458 
    459         return self.terms.build_columns(X, term=term)

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/utils.py in check_X(X, n_feats, min_samples, edge_knots, dtypes, features, verbose)
    301                                      'feature {}. Expected data on [{}, {}], '\
    302                                      'but found data on [{}, {}]'\
--> 303                                      .format(i, min_, max_, x.min(), x.max()))
    304 
    305     return X

ValueError: X data is out of domain for categorical feature 0. Expected data on [2003.0, 2009.0], but found data on [0.0, 0.0]

The versions that I used are:
pyGAM=0.8.0
Python=3.6.12

For now I will work around this by subtracting the respective minimum value from each categorical value changing the category range values from (n,m) to (n-n, m-n)==(0,m-n).

Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions