Be careful with Inf values when using Pandas to calcualte the correlation between variables

- 1 min

Be careful with Inf values when using Pandas to calcualte the correlation between variables.

A wired thing had happened to me when I was exploring pairwise correlations among different variables stored in pandas.DataFrame. My gold is to get the pairwise Pearson coefficients of variables in pandas.DataFrame A with variables in pandas.DataFrame B. There are multiple ways to perform such an analysis. I originally used A.apply(lambda v: B.corrwith(v)). There were a few unexpected NAs present on the output. However, those NAs disappears if I implemented the calculation via A.merge(B).corr(). So, why is there a discrepancy?

Look in the the difference between corr and corrwith method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[3,2,np.Inf]})

corr function tolerates the Inf values

df.corr()
A B
A 1.0 -1.0
B -1.0 1.0

corrwith cannot handle Inf values

df.A.to_frame().corrwith(df.B)
A   NaN
dtype: float64

A Workaround when Inf values can be ignored.

df.A.to_frame().corrwith(df.B.replace(np.Inf,np.nan))
A   -1.0
dtype: float64
Jingxin Fu, Ph.D.

Jingxin Fu, Ph.D.

Research Fellow interested in data mining on cancer genomics

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora