Random Points

Easily Profile Python Code in Jupyter

line_profiler is an excellent tool that can help you quickly profile your python code and find where the performance bottlenecks are. In this blog post I will walk through a simple example and a few tips about using this tool within the Jupyter notebook.

Installation

To install line_profiler with Anaconda, simply do:

conda install line_profiler

or if you use pip, install with:

pip install line-profiler

Using it with Jupyter notebook

To use line_profiler, normally you'd need to modify your code and decorate the functions you want to profile with @profile.

@profile
def slow_function(a, b, c):
    ...

However one trick that I find especially useful is to use the line_profiler extension within Jupyter. This eliminates the need to modify the code with the @profile decorator, and instead you simply load the extension by doing

In [1]:
%load_ext line_profiler

Example

Here is a simple (and somewhat contrived) example to illustrate how to use this.

Suppose you have a sequence of coordinates in (latitude / longitude) that represents a random walk from an origin. At each step the latitude and longitude change is determined by drawing from a Normal distribution:

In [2]:
import numpy as np
import pandas as pd
In [3]:
origin = {'lat': 34, 'lon': -120}
np.random.seed(1)
n = 100000
changes = np.random.randn(n, 2) / 30
changes[0] = [0, 0]
trace = pd.DataFrame.from_records(changes, columns=['lat', 'lon']).cumsum()
trace['lat'] += origin['lat']
trace['lon'] += origin['lon']

trace.head()
Out[3]:
lat lon
0 34.000000 -120.000000
1 33.982394 -120.035766
2 34.011241 -120.112484
3 34.069402 -120.137857
4 34.080036 -120.146169

And suppose you are interested in computing the maximum distance from the origin for the duration of the random walk. To compute distances between two points on Earth, we can use the haversine formula

In [4]:
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in lat/lon)
    """
    # convert to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * asin(sqrt(a))
    earth_radius = 6367
    distance_km = earth_radius * c
    return distance_km

this is what the trace of the random walk looks like:

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import matplotlib as mpl
mpl.rcParams['axes.formatter.useoffset'] = False
In [6]:
plt.plot(trace.lat, trace.lon)
Out[6]:
[<matplotlib.lines.Line2D at 0x1174b9780>]

We can then implement a few funtions to find the maximum distance from the random walk to the origin:

In [7]:
def get_distances(trace, origin):
    distances = {}
    for i in trace.index:
        distances[i] = haversine(trace['lat'].loc[i], trace['lon'].loc[i], origin['lat'], origin['lon'])
    distances = pd.Series(distances)
    return distances

def get_farthest(trace, origin):
    distance = get_distances(trace, origin)
    max_idx = distance.argmax()
    return trace.loc[max_idx], distance.loc[max_idx]
In [8]:
get_farthest(trace, origin)
Out[8]:
(lat     54.648484
 lon   -109.343278
 Name: 97647, dtype: float64, 2439.5707960599893)

The performance of this implementation is not great, however. We can use the timeit extension to quickly time the runtime performance:

In [9]:
%timeit get_farthest(trace, origin)
1 loop, best of 3: 9.02 s per loop

To further understand why this is slow, we can profile the code line by line with %lprun, where "lp" stands for "line profiler", and the -f argument specifies the function you'd like to profile

In [10]:
%lprun -f get_farthest get_farthest(trace, origin)
Timer unit: 1e-06 s

Total time: 18.9899 s
File: 
Function: get_farthest at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           def get_farthest(trace, origin):
     9         1     18985235 18985235.0    100.0      distance = get_distances(trace, origin)
    10         1          720    720.0      0.0      max_idx = distance.argmax()
    11         1         3905   3905.0      0.0      return trace.loc[max_idx], distance.loc[max_idx]

We therefore see that the time to do argmax() is negligible, but the distance computation get_distances() is the hot spot. We can then further analyze get_distances() by doing:

In [11]:
%lprun -f get_distances get_farthest(trace, origin)
Timer unit: 1e-06 s

Total time: 19.2948 s
File: 
Function: get_distances at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def get_distances(trace, origin):
     2         1            2      2.0      0.0      distances = {}
     3    100001        99179      1.0      0.5      for i in trace.index:
     4    100000     19165708    191.7     99.3          distances[i] = haversine(trace['lat'].loc[i], trace['lon'].loc[i], origin['lat'], origin['lon'])
     5         1        29945  29945.0      0.2      distances = pd.Series(distances)
     6         1            2      2.0      0.0      return distances

and we see that the overhead of creating the pd.Series is negligible, but the loop with the haversine() computation is the hot spot. So we chase it down further:

In [12]:
%lprun -f haversine get_farthest(trace, origin)
Timer unit: 1e-06 s

Total time: 0.941073 s
File: 
Function: haversine at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def haversine(lat1, lon1, lat2, lon2):
     3                                               """
     4                                               Calculate the great circle distance between two points 
     5                                               on the earth (specified in lat/lon)
     6                                               """
     7                                               # convert to radians 
     8    100000       281846      2.8     29.9      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
     9                                               # haversine formula
    10    100000        80497      0.8      8.6      dlon = lon2 - lon1
    11    100000        60033      0.6      6.4      dlat = lat2 - lat1
    12    100000       233382      2.3     24.8      a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    13    100000       112446      1.1     11.9      c = 2 * asin(sqrt(a))
    14    100000        56156      0.6      6.0      earth_radius = 6367
    15    100000        61922      0.6      6.6      distance_km = earth_radius * c
    16    100000        54791      0.5      5.8      return distance_km

Now notice the "Total time" value has dropped from around 19s to 0.94s. This suggests that the expensive part is not actually within haversine() itself, but the loop around it.

Vectorization

The above example illustrates a key concept when using numpy and pandas known as vectorization. Essentially, we want to avoid writing loops in python, and instead give numpy or pandas an entire "vector" (i.e. numpy array) to work with. There are many excellent resources on this topic including Modern Pandas by Tom Augspurger, Why Python is Slow by Jake VanderPlas, and Performance Pandas by Jeff Reback.

In this example, vectorization means being able to pass the entire lat/lon arrays into the haversine() computation somehow, instead of looping through one point at a time.

Luckily it is not too hard to modify haversine() to make it work with arrays:

In [13]:
def haversine_vectorized(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points 
    on the earth. Note that lat1/lon1/lat2/lon2 can either be
    scalars or numpy arrays.
    """
    # convert to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    earth_radius = 6367
    distance_km = earth_radius * c
    return distance_km

Note that the code change is fairly minimal. Essentially all we have done here is replacing all the math functions with the numpy equivalent which can operate on entire arrays instead of just scalars.

In fact most of the functions have identical names except for math.asin() and np.arcsin(). Also note that numpy broadcasting will take care of the case where lat1 is an array and lat2 is a scalar.

Now our implementation becomes: (note the loop is gone)

In [14]:
def get_distances_vectorized(trace, origin):
    distances = haversine_vectorized(trace['lat'], trace['lon'], origin['lat'], origin['lon'])
    return distances

def get_farthest_vectorized(trace, origin):
    distance = get_distances_vectorized(trace, origin)
    max_idx = distance.argmax()
    return trace.loc[max_idx], distance.loc[max_idx]

Let's make sure the results are the same:

In [15]:
get_farthest(trace, origin)
Out[15]:
(lat     54.648484
 lon   -109.343278
 Name: 97647, dtype: float64, 2439.5707960599893)
In [16]:
get_farthest_vectorized(trace, origin)
Out[16]:
(lat     54.648484
 lon   -109.343278
 Name: 97647, dtype: float64, 2439.5707960599893)

And we are getting a tremendous increase in performance with vectorization, going from more than 9 seconds to 15 milliseconds!

In [17]:
%timeit get_farthest_vectorized(trace, origin)
100 loops, best of 3: 15.2 ms per loop

Comments