Be careful with Inf values when using Pandas to calcualte the correlation between variables
- 1 minBe careful with Inf values when using Pandas to calcualte the correlation between variables.
A wired thing had happened to me when I was exploring pairwise correlations among different variables stored in pandas.DataFrame
. My gold is to get the pairwise Pearson coefficients of variables in pandas.DataFrame
A with variables in pandas.DataFrame
B. There are multiple ways to perform such an analysis. I originally used A.apply(lambda v: B.corrwith(v))
. There were a few unexpected NAs present on the output. However, those NAs disappears if I implemented the calculation via A.merge(B).corr()
. So, why is there a discrepancy?
Look in the the difference between corr
and corrwith
method.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[3,2,np.Inf]})
corr
function tolerates the Inf values
df.corr()
A | B | |
---|---|---|
A | 1.0 | -1.0 |
B | -1.0 | 1.0 |
corrwith
cannot handle Inf values
df.A.to_frame().corrwith(df.B)
A NaN
dtype: float64
A Workaround when Inf values can be ignored.
df.A.to_frame().corrwith(df.B.replace(np.Inf,np.nan))
A -1.0
dtype: float64