Be careful with Inf values when using Pandas to calcualte the correlation between variables
- 1 minBe careful with Inf values when using Pandas to calcualte the correlation between variables.
A wired thing had happened to me when I was exploring pairwise correlations among different variables stored in pandas.DataFrame. My gold is to get the pairwise Pearson coefficients of variables in pandas.DataFrame A with variables in pandas.DataFrame B. There are multiple ways to perform such an analysis. I originally used A.apply(lambda v: B.corrwith(v)). There were a few unexpected NAs present on the output. However, those NAs disappears if I implemented the calculation via A.merge(B).corr(). So, why is there a discrepancy?
Look in the the difference between corr and corrwith method.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[3,2,np.Inf]})
corr function tolerates the Inf values
df.corr()
| A | B | |
|---|---|---|
| A | 1.0 | -1.0 |
| B | -1.0 | 1.0 |
corrwith cannot handle Inf values
df.A.to_frame().corrwith(df.B)
A NaN
dtype: float64
A Workaround when Inf values can be ignored.
df.A.to_frame().corrwith(df.B.replace(np.Inf,np.nan))
A -1.0
dtype: float64