line_profiler is an excellent tool that can help you quickly profile your python code and find where the performance bottlenecks are. In this blog post I will walk through a simple example and a few tips about using this tool within the Jupyter notebook.
Installation¶
To install line_profiler
with Anaconda, simply do:
conda install line_profiler
or if you use pip
, install with:
pip install line-profiler
Using it with Jupyter notebook¶
To use line_profiler
, normally you'd need to modify your code and decorate the functions you want to profile with @profile
.
@profile
def slow_function(a, b, c):
...
However one trick that I find especially useful is to use the line_profiler
extension within Jupyter. This eliminates the need to modify the code with the @profile
decorator, and instead you simply load the extension by doing
%load_ext line_profiler
Example¶
Here is a simple (and somewhat contrived) example to illustrate how to use this.
Suppose you have a sequence of coordinates in (latitude / longitude) that represents a random walk from an origin. At each step the latitude and longitude change is determined by drawing from a Normal distribution:
import numpy as np
import pandas as pd
origin = {'lat': 34, 'lon': -120}
np.random.seed(1)
n = 100000
changes = np.random.randn(n, 2) / 30
changes[0] = [0, 0]
trace = pd.DataFrame.from_records(changes, columns=['lat', 'lon']).cumsum()
trace['lat'] += origin['lat']
trace['lon'] += origin['lon']
trace.head()
And suppose you are interested in computing the maximum distance from the origin for the duration of the random walk. To compute distances between two points on Earth, we can use the haversine formula
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in lat/lon)
"""
# convert to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
earth_radius = 6367
distance_km = earth_radius * c
return distance_km
this is what the trace of the random walk looks like:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import matplotlib as mpl
mpl.rcParams['axes.formatter.useoffset'] = False
plt.plot(trace.lat, trace.lon)
We can then implement a few funtions to find the maximum distance from the random walk to the origin:
def get_distances(trace, origin):
distances = {}
for i in trace.index:
distances[i] = haversine(trace['lat'].loc[i], trace['lon'].loc[i], origin['lat'], origin['lon'])
distances = pd.Series(distances)
return distances
def get_farthest(trace, origin):
distance = get_distances(trace, origin)
max_idx = distance.argmax()
return trace.loc[max_idx], distance.loc[max_idx]
get_farthest(trace, origin)
The performance of this implementation is not great, however. We can use the timeit extension to quickly time the runtime performance:
%timeit get_farthest(trace, origin)
To further understand why this is slow, we can profile the code line by line with %lprun
, where "lp" stands for "line profiler", and the -f
argument specifies the function you'd like to profile
%lprun -f get_farthest get_farthest(trace, origin)
Timer unit: 1e-06 s Total time: 18.9899 s File:Function: get_farthest at line 8 Line # Hits Time Per Hit % Time Line Contents ============================================================== 8 def get_farthest(trace, origin): 9 1 18985235 18985235.0 100.0 distance = get_distances(trace, origin) 10 1 720 720.0 0.0 max_idx = distance.argmax() 11 1 3905 3905.0 0.0 return trace.loc[max_idx], distance.loc[max_idx]
We therefore see that the time to do argmax()
is negligible, but the distance computation get_distances()
is the hot spot. We can then further analyze get_distances()
by doing:
%lprun -f get_distances get_farthest(trace, origin)
Timer unit: 1e-06 s Total time: 19.2948 s File:Function: get_distances at line 1 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def get_distances(trace, origin): 2 1 2 2.0 0.0 distances = {} 3 100001 99179 1.0 0.5 for i in trace.index: 4 100000 19165708 191.7 99.3 distances[i] = haversine(trace['lat'].loc[i], trace['lon'].loc[i], origin['lat'], origin['lon']) 5 1 29945 29945.0 0.2 distances = pd.Series(distances) 6 1 2 2.0 0.0 return distances
and we see that the overhead of creating the pd.Series
is negligible, but the loop with the haversine()
computation is the hot spot. So we chase it down further:
%lprun -f haversine get_farthest(trace, origin)
Timer unit: 1e-06 s Total time: 0.941073 s File:Function: haversine at line 2 Line # Hits Time Per Hit % Time Line Contents ============================================================== 2 def haversine(lat1, lon1, lat2, lon2): 3 """ 4 Calculate the great circle distance between two points 5 on the earth (specified in lat/lon) 6 """ 7 # convert to radians 8 100000 281846 2.8 29.9 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) 9 # haversine formula 10 100000 80497 0.8 8.6 dlon = lon2 - lon1 11 100000 60033 0.6 6.4 dlat = lat2 - lat1 12 100000 233382 2.3 24.8 a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2 13 100000 112446 1.1 11.9 c = 2 * asin(sqrt(a)) 14 100000 56156 0.6 6.0 earth_radius = 6367 15 100000 61922 0.6 6.6 distance_km = earth_radius * c 16 100000 54791 0.5 5.8 return distance_km
Now notice the "Total time" value has dropped from around 19s to 0.94s. This suggests that the expensive part is not actually within haversine()
itself, but the loop around it.
Vectorization¶
The above example illustrates a key concept when using numpy
and pandas
known as vectorization. Essentially, we want to avoid writing loops in python, and instead give numpy
or pandas
an entire "vector" (i.e. numpy
array) to work with. There are many excellent resources on this topic including Modern Pandas by Tom Augspurger, Why Python is Slow by Jake VanderPlas, and Performance Pandas by Jeff Reback.
In this example, vectorization means being able to pass the entire lat/lon arrays into the haversine()
computation somehow, instead of looping through one point at a time.
Luckily it is not too hard to modify haversine()
to make it work with arrays:
def haversine_vectorized(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth. Note that lat1/lon1/lat2/lon2 can either be
scalars or numpy arrays.
"""
# convert to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
c = 2 * np.arcsin(np.sqrt(a))
earth_radius = 6367
distance_km = earth_radius * c
return distance_km
Note that the code change is fairly minimal. Essentially all we have done here is replacing all the math
functions with the numpy
equivalent which can operate on entire arrays instead of just scalars.
In fact most of the functions have identical names except for math.asin()
and np.arcsin()
. Also note that numpy
broadcasting will take care of the case where lat1
is an array and lat2
is a scalar.
Now our implementation becomes: (note the loop is gone)
def get_distances_vectorized(trace, origin):
distances = haversine_vectorized(trace['lat'], trace['lon'], origin['lat'], origin['lon'])
return distances
def get_farthest_vectorized(trace, origin):
distance = get_distances_vectorized(trace, origin)
max_idx = distance.argmax()
return trace.loc[max_idx], distance.loc[max_idx]
Let's make sure the results are the same:
get_farthest(trace, origin)
get_farthest_vectorized(trace, origin)
And we are getting a tremendous increase in performance with vectorization, going from more than 9 seconds to 15 milliseconds!
%timeit get_farthest_vectorized(trace, origin)