Small numpy arrays are very similar to Python scalars but numpy incurs a fair amount of extra overhead for simple operations. For large arrays this doesn't matter, but for code that manipulates a lot of small pieces of data, it can be a serious bottleneck.
For example
For example
In [1]: x = 1.0 In [2]: numpy_x = np.asarray(x) In [3]: timeit x + x 10000000 loops, best of 3: 61 ns per loop In [4]: timeit numpy_x + numpy_x 1000000 loops, best of 3: 1.66 us per loop
This project involved
- profiling simple operations like the above
- determining possible bottlenecks
- devising improved algorithms to solve them, with the goal of getting the numpy time as close as possible to the Python time.
Profiling tools
The very first objective to find bottleneck is profiling for time or space. During project I have used few tools for profiling and visualizing data of numpy execution flow.
Google profiling tool
Setting up Gperftools
Following are the steps used to setup python C level profiler on Ubuntu 13.04. (For any other system, options see [1])
- Make sure to build it from source. Clone svn repository from http://gperftools.googlecode.com/svn/trunk/
- In order to build gperftools checked out from subversion repository you need to have autoconf, automake and libtool installed.
- First, run ./autogen.sh script which generate ./configure and other files. Then run ./configure
- 'make check', to run any self-tests that come with the package. Check is optional but recommended to use
- After all test gets passed, type 'sudo make install' to install the programs and any data files and documentation.
Running CPU profiler
I evoked profiler manually before running sample code. Consider python code to profiled is in num.py file.
$CPUPROFILE=num.py.prof LD_PRELOAD=/usr/lib/libprofiler.so python num.py
Alternatively, include profiler in code as follow
import ctypes import timeit profiler = ctypes.CDLL("libprofiler.so") profiler.ProfilerStart("num.py.prof") timeit.timeit('x+y',number=10000000, setup='import numpy as np;x = np.asarray(1.0);y = np.asarray(2.0);') profiler.ProfilerStop()
To analysis stats use
$pprof --gv ./num.py num.py.prof
![]() |
Callgraph generated by gperftools. Each block represent method with local and cumulative percentage. |
Oprofile
OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. OProfile is released under the GNU GPL.
Setting up Oprofile
- Access the source via Git : git clone git://git.code.sf.net/p/oprofile/oprofile
- Automake and autoconf is needed.
- Run autogen.sh before attempting to build as normal.
Running CPU profiler
$opcontrol --callgraph=16 $opcontrol --start $python num.py $opcontrol --stop $opcontrol --dump $opreport -cgf | gprof2dot.py -f oprofile | dot -Tpng -o output.png
![]() |
Callgraph is visualized with help of script gprof2dot.py |
Perf from linux-tools
Perf provides rich generalized abstractions over hardware specific capabilities. Among others, it provides per task, per CPU and per-workload counters, sampling on top of these and source code event annotation.Setting up perf
$sudo apt-get install linux-tools-common $sudo apt-get install linux-tools-<kernal-version>
Running Profiler and visualizing data as flame-graph
$perf record -a -g -F 1000 ./num.py $perf script | ./stackcollapse-perf.pl > out.perf-folded $cat out.perf-folded | ./flamegraph.pl > perf-numpy.svgThe first command runs perf in sampling mode (polling) at 1000 Hertz (-F 1000; more on this later) across all CPUs (-a), capturing stack traces so that a call graph (-g) of function ancestry can be generated later. The samples are saved in a perf.data
![]() |
Script to visualize above flame graph is at https://github.com/brendangregg/FlameGraph. |
No comments:
Post a Comment