28 Jan 2019

How to write your favorite functions R in Python?


One of the great modern battles of the data science and machine learning is “Python vs. R”. There is no doubt that both have gained enormous ground in recent years to become top choice of programming languages for data science, predictive analytics, and machine learning. In fact, in a recent article from IEEE, Python overtook C++ as the top programming language of 2018 and R has firmly secured its spot in top 10.

However, there are some fundamental differences between these two. R was developed primarily as a tool for statistical analysis and quick prototyping of a data analysis problem. Python, on the other hand, was developed as a general purpose modern object-oriented language in the same vein as C++ or Java but with a simpler learning curve and more flexible demeanor. Consequently, R continues to be extremely popular among statisticians, quantitative biologists, physicists, and economists alike whereas Python has has slowly emerged as the top language of choice for day-to-day scripting, automation, backend web-development, analytics, and general machine learning frameworks with extensive support base and open source development community work.

Mimicking functional programming in Python environment?
Functional programming nature of R provides users with extremely simple and compact interface for quick calculations of probabilities and essential descriptive/inferential statistics for a data analysis problem. For example, wouldn’t it be great to be able to answer the following question with just a single compact function call?

How to calculate the mean/median/mode of a data vector?
How to calculate cumulative probability of some event following a normal distribution? how about if the distribution is Poisson?
How to calculate the inter-quartile range of a series of data points?
How to generate few random numbers following a Student’s t-distribution?
R programming environment allows you do just that.

On the other hand, Python scripting ability allows an analyst to use those statistics in a wide variety of analytics pipeline with limitless sophistication and creativity.

To combine the advantage of both worlds, one needs a simple Python-based wrapper library which contains most commonly used functions pertaining to probability distributions and descriptive statistics defined in R-style so that users can call those functions real fast without having to go to the proper Python statistical libraries and figure out the whole list of methods and arguments.

A Python wrapper script for most convenient R-functions
I wrote a Python script to define the most convenient and widely used R-functions in simple statistical analysis — in Python. After importing this script you will be able to use those R-functions naturally just like in a R programming environment.

Goal of this script is to provide simple Python sub-routines mimicking R-style statistical functions for quickly calculating density/point estimates, cumulative distributions, quantiles, and generating random variates for various important probability distributions.

To maintain the spirit of R styling, no class hierarchy was used and just raw functions are defined in this file so that user can import this one Python script and use all the functions whenever he/she needs them with a single name call.
Note, I use the word mimic. Under no circumstance, I am claiming to emulate the true functional programming paradigm of R which consists of deep environmental setup and complex inter-relationships between those environments and objects. This script just allows me (and I hope countless other Python users too) to quickly fire up a Python program or Jupyter notebook, import the script, and start doing simple descriptive statistics in no time. That’s the goal, nothing more, nothing less.

Or, you may have coded in R in your grad school and just starting out to learn and use Python for data analysis. You will be happy to see and use some of the same well-known functions in your Jupyter notebook in the similar manner that you have used in R environment.

Whatever the reason may be, it is fun :-)

Simple Examples
To start just import the script and start working with lists of numbers as if they were data vectors in R.

from R_functions import *
lst=[20,12,16,32,27,65,44,45,22,18]
<more code, more statistics...>
For example, you want to calculate Tuckey five number summary from a vector of data points. You just call one simple function fivenum and pass on the vector. It will return the five-number summary in a Numpy array.

lst=[20,12,16,32,27,65,44,45,22,18]
fivenum(lst)
> array([12. , 18.5, 24.5, 41. , 65. ])
Or, you want to know the answer to the following question.

Suppose a machine outputs 10 finished goods per hour on average with a standard deviation of 2. The output pattern follows a near normal distribution. What is the probability that the machine will output at least 7 but no more than 12 units in the next hour?
The answer is essentially this,


You can obtain the answer with just one line of code using pnorm…

pnorm(12,10,2)-pnorm(7,10,2)
> 0.7745375447996848
Or, the following,

Suppose you have a loaded coin with probability of turning head up 60% every time you toss it. You are playing a game of 10 tosses. How do you plot and map out the chances of all the possible number of wins (from 0 to 10) with this coin?
You can obtain a nice bar chart with just few lines of code and using just one function dbinom…

probs=[]
import matplotlib.pyplot as plt
for i in range(11):
    probs.append(dbinom(i,10,0.6))
plt.bar(range(11),height=probs)
plt.grid(True)
plt.show()

Simple interface for probability calculations
R is amazing to offer an extremely simplified and intuitive interface for quick calculation from essential probability distributions. The interface goes like this…

d{distirbution} — gives the density function value at a point x
p{distirbution} — gives the cumulative value at a point x
q{distirbution} — gives the quantile function value at a probability p
r{distirbution} — generates one or multiple random variate
In our implementation, we stick to this interface and associated argument list so that you can execute these functions exactly like in a R environment.

Currently implemented functions
Currently, following R-style functions are implemented in the script for fast calling.

Mean, median, variance, standard deviation
Tuckey five-number summary, IQR
Covariance of a matrix or between two vectors
Density, cumulative probability, quantile function, and random variate generation for following distributions — normal, uniform, binomial, Poisson, F, Student’s-t, Chi-square, Beta, and Gamma.
Work in progress…
Obviously, this is a work in progress and I plan to add some more convenient R-functions to this script. For example, in R single line of command lm can get you a ordinary least-square fitted model to a numerical data set with all the necessary inferential statistics (P-values, standard error, etc.). This is powerfully brief and compact! On the other hand, standard linear regression problems in Python is often tackaled using Scikit-learn which needs bit more scripting to accomplish this. I plan to incorporate this single function linear model fitting feature using Python’s statsmodels backend.

If you like this script and find use for it in you work, please star/fork my GitHub repo and spread the news.
Previous Post
Next Post

0 comments: