# Easily Profile Python Code in Jupyter

line_profiler is an excellent tool that can help you quickly profile your python code and find where the performance bottlenecks are. In this blog post I will walk through a simple example and a few tips about using this tool within the Jupyter notebook.

## Installation¶

To install line_profiler with Anaconda, simply do:

conda install line_profiler

or if you use pip, install with:

pip install line-profiler

## Using it with Jupyter notebook¶

To use line_profiler, normally you'd need to modify your code and decorate the functions you want to profile with @profile.

@profile
def slow_function(a, b, c):
...

However one trick that I find especially useful is to use the line_profiler extension within Jupyter. This eliminates the need to modify the code with the @profile decorator, and instead you simply load the extension by doing

In [1]:
%load_ext line_profiler


## Example¶

Here is a simple (and somewhat contrived) example to illustrate how to use this.

Suppose you have a sequence of coordinates in (latitude / longitude) that represents a random walk from an origin. At each step the latitude and longitude change is determined by drawing from a Normal distribution:

In [2]:
import numpy as np
import pandas as pd

In [3]:
origin = {'lat': 34, 'lon': -120}
np.random.seed(1)
n = 100000
changes = np.random.randn(n, 2) / 30
changes[0] = [0, 0]
trace = pd.DataFrame.from_records(changes, columns=['lat', 'lon']).cumsum()
trace['lat'] += origin['lat']
trace['lon'] += origin['lon']


Out[3]:
lat lon
0 34.000000 -120.000000
1 33.982394 -120.035766
2 34.011241 -120.112484
3 34.069402 -120.137857
4 34.080036 -120.146169

And suppose you are interested in computing the maximum distance from the origin for the duration of the random walk. To compute distances between two points on Earth, we can use the haversine formula

In [4]:
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in lat/lon)
"""
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
return distance_km


this is what the trace of the random walk looks like:

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import matplotlib as mpl
mpl.rcParams['axes.formatter.useoffset'] = False

In [6]:
plt.plot(trace.lat, trace.lon)

Out[6]:
[<matplotlib.lines.Line2D at 0x1174b9780>]

We can then implement a few funtions to find the maximum distance from the random walk to the origin:

In [7]:
def get_distances(trace, origin):
distances = {}
for i in trace.index:
distances[i] = haversine(trace['lat'].loc[i], trace['lon'].loc[i], origin['lat'], origin['lon'])
distances = pd.Series(distances)
return distances

def get_farthest(trace, origin):
distance = get_distances(trace, origin)
max_idx = distance.argmax()
return trace.loc[max_idx], distance.loc[max_idx]

In [8]:
get_farthest(trace, origin)

Out[8]:
(lat     54.648484
lon   -109.343278
Name: 97647, dtype: float64, 2439.5707960599893)

The performance of this implementation is not great, however. We can use the timeit extension to quickly time the runtime performance:

In [9]:
%timeit get_farthest(trace, origin)

1 loop, best of 3: 9.02 s per loop


To further understand why this is slow, we can profile the code line by line with %lprun, where "lp" stands for "line profiler", and the -f argument specifies the function you'd like to profile

In [10]:
%lprun -f get_farthest get_farthest(trace, origin)

Timer unit: 1e-06 s

Total time: 18.9899 s
File:
Function: get_farthest at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
8                                           def get_farthest(trace, origin):
9         1     18985235 18985235.0    100.0      distance = get_distances(trace, origin)
10         1          720    720.0      0.0      max_idx = distance.argmax()
11         1         3905   3905.0      0.0      return trace.loc[max_idx], distance.loc[max_idx]


We therefore see that the time to do argmax() is negligible, but the distance computation get_distances() is the hot spot. We can then further analyze get_distances() by doing:

In [11]:
%lprun -f get_distances get_farthest(trace, origin)

Timer unit: 1e-06 s

Total time: 19.2948 s
File:
Function: get_distances at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1                                           def get_distances(trace, origin):
2         1            2      2.0      0.0      distances = {}
3    100001        99179      1.0      0.5      for i in trace.index:
4    100000     19165708    191.7     99.3          distances[i] = haversine(trace['lat'].loc[i], trace['lon'].loc[i], origin['lat'], origin['lon'])
5         1        29945  29945.0      0.2      distances = pd.Series(distances)
6         1            2      2.0      0.0      return distances


and we see that the overhead of creating the pd.Series is negligible, but the loop with the haversine() computation is the hot spot. So we chase it down further:

In [12]:
%lprun -f haversine get_farthest(trace, origin)

Timer unit: 1e-06 s

Total time: 0.941073 s
File:
Function: haversine at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
2                                           def haversine(lat1, lon1, lat2, lon2):
3                                               """
4                                               Calculate the great circle distance between two points
5                                               on the earth (specified in lat/lon)
6                                               """
8    100000       281846      2.8     29.9      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
9                                               # haversine formula
10    100000        80497      0.8      8.6      dlon = lon2 - lon1
11    100000        60033      0.6      6.4      dlat = lat2 - lat1
12    100000       233382      2.3     24.8      a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
13    100000       112446      1.1     11.9      c = 2 * asin(sqrt(a))
14    100000        56156      0.6      6.0      earth_radius = 6367
15    100000        61922      0.6      6.6      distance_km = earth_radius * c
16    100000        54791      0.5      5.8      return distance_km


Now notice the "Total time" value has dropped from around 19s to 0.94s. This suggests that the expensive part is not actually within haversine() itself, but the loop around it.

## Vectorization¶

The above example illustrates a key concept when using numpy and pandas known as vectorization. Essentially, we want to avoid writing loops in python, and instead give numpy or pandas an entire "vector" (i.e. numpy array) to work with. There are many excellent resources on this topic including Modern Pandas by Tom Augspurger, Why Python is Slow by Jake VanderPlas, and Performance Pandas by Jeff Reback.

In this example, vectorization means being able to pass the entire lat/lon arrays into the haversine() computation somehow, instead of looping through one point at a time.

Luckily it is not too hard to modify haversine() to make it work with arrays:

In [13]:
def haversine_vectorized(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth. Note that lat1/lon1/lat2/lon2 can either be
scalars or numpy arrays.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
c = 2 * np.arcsin(np.sqrt(a))
return distance_km


Note that the code change is fairly minimal. Essentially all we have done here is replacing all the math functions with the numpy equivalent which can operate on entire arrays instead of just scalars.

In fact most of the functions have identical names except for math.asin() and np.arcsin(). Also note that numpy broadcasting will take care of the case where lat1 is an array and lat2 is a scalar.

Now our implementation becomes: (note the loop is gone)

In [14]:
def get_distances_vectorized(trace, origin):
distances = haversine_vectorized(trace['lat'], trace['lon'], origin['lat'], origin['lon'])
return distances

def get_farthest_vectorized(trace, origin):
distance = get_distances_vectorized(trace, origin)
max_idx = distance.argmax()
return trace.loc[max_idx], distance.loc[max_idx]


Let's make sure the results are the same:

In [15]:
get_farthest(trace, origin)

Out[15]:
(lat     54.648484
lon   -109.343278
Name: 97647, dtype: float64, 2439.5707960599893)
In [16]:
get_farthest_vectorized(trace, origin)

Out[16]:
(lat     54.648484
lon   -109.343278
Name: 97647, dtype: float64, 2439.5707960599893)

And we are getting a tremendous increase in performance with vectorization, going from more than 9 seconds to 15 milliseconds!

In [17]:
%timeit get_farthest_vectorized(trace, origin)

100 loops, best of 3: 15.2 ms per loop