Profiling Python code for performance

This year's GSoC is about speeding things up. SimpleCV is awesome, but it sometimes comes off as slow. The objective of my project is identifying bottlenecks and making things faster!

Our initial idea was to sweep the code and start changing little things, but that turned out to be impractical. Instead, I dove into python profiling and got a great look at the internals of SimpleCV. There are some pretty decent python profiling tools out there, which enable me to run code and find out how long each line takes.

We have a reasonably big tests file, whose execution can also be profiled. These first results were quite interesting and gave me a lot to start with: Quick functions that were called ridiculously often (where small changes can make great improvements) and functions that were just way too slow (there must be something wrong!).

Image.find_motion was the first to be tackled:

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.169 0.169 14.529 14.529 <string>:1(<module>) 249796 0.488 0.000 12.669 0.000 Detection.py:1477(__init__) 249796 0.247 0.000 0.592 0.000 Detection.py:1527(normalizeTo) 499592 0.229 0.000 0.267 0.000 Detection.py:1544(magnitude) 249796 0.078 0.000 0.207 0.000 Detection.py:1550(unitVector) 249796 0.310 0.000 12.181 0.000 Features.py:1185(__init__) 249796 2.367 0.000 11.871 0.000 Features.py:1608(_updateExtents) 2 0.000 0.000 0.000 0.000 ImageClass.py:1118(__del__) 4 0.000 0.000 0.000 0.000 ImageClass.py:1727(getEmpty) 8 0.000 0.000 0.000 0.000 ImageClass.py:1765(getBitmap) 2 0.000 0.000 0.004 0.002 ImageClass.py:2036(_getGrayscaleBitmap) 4 0.000 0.000 0.000 0.000 ImageClass.py:3785(size) 1 0.873 0.873 14.352 14.352 ImageClass.py:8547(findMotion) 2 0.000 0.000 0.007 0.004 ImageClass.py:879(__init__) 1 0.000 0.000 14.359 14.359 findmotion_profile.py:5(find_motion) 249796 1.426 0.000 5.790 0.000 fromnumeric.py:1774(amax) 249796 0.652 0.000 3.714 0.000 fromnumeric.py:1836(amin) 499592 2.049 0.000 7.426 0.000 fromnumeric.py:32(_wrapit) 499592 0.339 0.000 3.744 0.000 numeric.py:167(asarray) 1 0.179 0.179 0.179 0.179 {cv2.cv.CalcOpticalFlowBM} 6 0.000 0.000 0.000 0.000 {cv2.cv.CreateImage} 2 0.003 0.002 0.003 0.002 {cv2.cv.CvtColor} 4 0.000 0.000 0.000 0.000 {cv2.cv.GetSize} 2 0.007 0.004 0.007 0.004 {cv2.cv.LoadImage} 4 0.000 0.000 0.000 0.000 {cv2.cv.SetZero} 499592 0.122 0.000 0.122 0.000 {getattr} 6 0.000 0.000 0.000 0.000 {isinstance} 499593 0.039 0.000 0.039 0.000 {math.sqrt} 249796 0.034 0.000 0.034 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 8 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects} 249796 0.814 0.000 0.814 0.000 {method 'max' of 'numpy.ndarray' objects} 249796 0.697 0.000 0.697 0.000 {method 'min' of 'numpy.ndarray' objects} 1 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects} 2 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects} 7 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 499592 3.405 0.000 3.405 0.000 {numpy.core.multiarray.array} 635 0.002 0.000 0.002 0.000 {range}

became:

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.175 0.175 4.981 4.981 <string>:1(<module>)
249796 1.217 0.000 2.833 0.000 Detection.py:1477(init)
249796 0.238 0.000 0.586 0.000 Detection.py:1527(normalizeTo)
499592 0.223 0.000 0.267 0.000 Detection.py:1544(magnitude)
249796 0.081 0.000 0.208 0.000 Detection.py:1550(unitVector)
249796 0.264 0.000 1.616 0.000 Features.py:1185(init)
249796 1.353 0.000 1.353 0.000 Features.py:1608(updateExtents)
2 0.000 0.000 0.000 0.000 ImageClass.py:1118(del)
4 0.000 0.000 0.001 0.000 ImageClass.py:1727(getEmpty)
8 0.000 0.000 0.000 0.000 ImageClass.py:1765(getBitmap)
2 0.000 0.000 0.004 0.002 ImageClass.py:2036(getGrayscaleBitmap)

4 0.000 0.000 0.000 0.000 ImageClass.py:3785(size)
1 1.167 1.167 4.799 4.799 ImageClass.py:8547(findMotion)
2 0.000 0.000 0.007 0.004 ImageClass.py:879(init)
1 0.000 0.000 4.806 4.806 findmotionprofile.py:5(findmotion)
1 0.181 0.181 0.181 0.181 {cv2.cv.CalcOpticalFlowBM}
6 0.000 0.000 0.000 0.000 {cv2.cv.CreateImage}
2 0.003 0.002 0.003 0.002 {cv2.cv.CvtColor}
4 0.000 0.000 0.000 0.000 {cv2.cv.GetSize}
2 0.007 0.004 0.007 0.004 {cv2.cv.LoadImage}
4 0.001 0.000 0.001 0.000 {cv2.cv.SetZero}
6 0.000 0.000 0.000 0.000 {isinstance}
499593 0.045 0.000 0.045 0.000 {math.sqrt}
249796 0.026 0.000 0.026 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
8 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
7 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
635 0.001 0.000 0.001 0.000 {range}

That seems a bit complicated, doesn't it? Basically, _updateExtents is called 250k times, and it takes too long. I managed to speed it up a little and findMotion came down from 15 to 5 seconds!

Those are awesome results, but that's unfortunately not always how it works. After this huge improvement, I still haven't been able to find any similar problems. I have been doing line profiling of all the functions that are either called too many times or that take considerably long, but that's how most of them look:

Function: _generatePalette at line 8697
Total time: 0.195119 s

Line # Hits Time Per Hit % Time Line Contents

...
8748 1 3 3.0 0.0 if( self.mPaletteBins != bins or
8749 self.mDoHuePalette != hue ):

8750 1 3 3.0 0.0 total = float(self.width*self.height)
8751 1 1 1.0 0.0 percentages = []
8752 1 1 1.0 0.0 result = None
8753 1 1 1.0 0.0 if( not hue ):
8754 1 1554 1554.0 0.8 pixels = np.array(self.getNumpy()).reshape(-1, 3) #reshape our matrix to 1xN
8755 1 2 2.0 0.0 if( centroids == None ):
8756 1 193554 193554.0 99.2 result = scv.kmeans(pixels,bins)

Once more, this seems way too complicated for a blog post. What it shows us, though, is that 99.2% of the time spent by this function is actually spent inside scv.kmeans. (from scipy). This is what happens in most of the functions I profiled, and it conveys an obvious but important message: SimpleCV's speed is directly related to the speed of the OpenCV/Scipy, and that is, most of the times, exactly where the bottleneck is going to be.

This makes my work much harder, because that's not really the code I'm supposed to play with. I need to find more functions like Image.findMotion, in which the slowness is caused by the way SimpleCV uses the other libraries, and not by the libraries themselves.