Be careful with Inf values when using Pandas to calcualte the correlation between variables

Monday. June 01, 2020 - 1 min

Be careful with Inf values when using Pandas to calcualte the correlation between variables.

A wired thing had happened to me when I was exploring pairwise correlations among different variables stored in pandas.DataFrame. My gold is to get the pairwise Pearson coefficients of variables in pandas.DataFrame A with variables in pandas.DataFrame B. There are multiple ways to perform such an analysis. I originally used A.apply(lambda v: B.corrwith(v)). There were a few unexpected NAs present on the output. However, those NAs disappears if I implemented the calculation via A.merge(B).corr(). So, why is there a discrepancy?

Look in the the difference between `corr` and `corrwith` method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[3,2,np.Inf]})

`corr` function tolerates the Inf values

df.corr()

	A	B
A	1.0	-1.0
B	-1.0	1.0

`corrwith` cannot handle Inf values

df.A.to_frame().corrwith(df.B)

A   NaN
dtype: float64

A Workaround when Inf values can be ignored.

df.A.to_frame().corrwith(df.B.replace(np.Inf,np.nan))

A   -1.0
dtype: float64

Jingxin Fu, Ph.D.

Research Fellow interested in data mining on cancer genomics