Be careful with Inf values when using Pandas to calcualte the correlation between variables

- 1 min

Be careful with Inf values when using Pandas to calcualte the correlation between variables.

A wired thing had happened to me when I was exploring pairwise correlations among different variables stored in pandas.DataFrame. My gold is to get the pairwise Pearson coefficients of variables in pandas.DataFrame A with variables in pandas.DataFrame B. There are multiple ways to perform such an analysis. I originally used A.apply(lambda v: B.corrwith(v)). There were a few unexpected NAs present on the output. However, those NAs disappears if I implemented the calculation via A.merge(B).corr(). So, why is there a discrepancy?

Look in the the difference between corr and corrwith method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[3,2,np.Inf]})

corr function tolerates the Inf values

df.corr()
A B
A 1.0 -1.0
B -1.0 1.0

corrwith cannot handle Inf values

df.A.to_frame().corrwith(df.B)
A   NaN
dtype: float64

A Workaround when Inf values can be ignored.

df.A.to_frame().corrwith(df.B.replace(np.Inf,np.nan))
A   -1.0
dtype: float64
Jingxin Fu, Ph.D.

Jingxin Fu, Ph.D.

Research Fellow interested in data mining on cancer genomics

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora