Monday, September 23, 2013

GSoC'13 Project Summary-1 : Numpy's profiling

Small numpy arrays are very similar to Python scalars but numpy incurs a fair amount of extra overhead for simple operations. For large arrays this doesn't matter, but for code that manipulates a lot of small pieces of data, it can be a serious bottleneck.
For example
In [1]: x = 1.0

In [2]: numpy_x = np.asarray(x)

In [3]: timeit x + x
10000000 loops, best of 3: 61 ns per loop

In [4]: timeit numpy_x + numpy_x
1000000 loops, best of 3: 1.66 us per loop

This project involved
  • profiling simple operations like the above
  • determining possible bottlenecks
  • devising improved algorithms to solve them, with the goal of getting the numpy time as close as possible to the Python time.

Profiling tools

The very first objective to find bottleneck is profiling for time or space. During project I have used few tools for profiling and visualizing data of numpy execution flow.

Google profiling tool

This is the suit of different tools provided by Google. It Includes TCMalloc, heap-checker, heap-profiler and cpu-profiler. As need of project was to reduce time, so CPU-Profiler was used.

Setting up Gperftools

Following are the steps used to setup python C level profiler on Ubuntu 13.04. (For any other system, options see [1])
  1. Make sure to build it from source. Clone svn repository from http://gperftools.googlecode.com/svn/trunk/
  2. In order to build gperftools checked out from subversion repository you need to have autoconf, automake and libtool installed.
  3. First, run ./autogen.sh script which generate ./configure and other files. Then run ./configure
  4. 'make check', to run any self-tests that come with the package. Check is optional but recommended to use
  5. After all test gets passed, type 'sudo make install' to install the programs and any data files and documentation.

Running CPU profiler

I evoked profiler manually before running sample code. Consider python code to profiled is in num.py file.

$CPUPROFILE=num.py.prof LD_PRELOAD=/usr/lib/libprofiler.so python num.py

Alternatively, include profiler in code as follow
import ctypes
import timeit
profiler = ctypes.CDLL("libprofiler.so")
profiler.ProfilerStart("num.py.prof")
timeit.timeit('x+y',number=10000000,
       setup='import numpy as np;x = np.asarray(1.0);y = np.asarray(2.0);')
profiler.ProfilerStop()

To analysis stats use

$pprof --gv ./num.py num.py.prof


Callgraph generated by gperftools. Each block represent method with local and cumulative percentage.



Oprofile

OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. OProfile is released under the GNU GPL.

Setting up Oprofile

  1. Access the source via Git : git clone git://git.code.sf.net/p/oprofile/oprofile
  2. Automake and autoconf is needed.
  3. Run autogen.sh before attempting to build as normal.

Running CPU profiler

$opcontrol --callgraph=16
$opcontrol --start
$python num.py
$opcontrol --stop
$opcontrol --dump
$opreport -cgf | gprof2dot.py -f oprofile | dot -Tpng -o output.png

Callgraph is visualized with help of script gprof2dot.py

Perf from linux-tools

Perf provides rich generalized abstractions over hardware specific capabilities. Among others, it provides per task, per CPU and per-workload counters, sampling on top of these and source code event annotation.

Setting up perf

$sudo apt-get install linux-tools-common 
$sudo apt-get install linux-tools-<kernal-version>

Running Profiler and visualizing data as flame-graph

$perf record -a -g -F 1000 ./num.py
$perf  script | ./stackcollapse-perf.pl > out.perf-folded
$cat out.perf-folded | ./flamegraph.pl > perf-numpy.svg
The first command runs perf in sampling mode (polling) at 1000 Hertz (-F 1000; more on this later) across all CPUs (-a), capturing stack traces so that a call graph (-g) of function ancestry can be generated later. The samples are saved in a perf.data

Script to visualize above flame graph is at https://github.com/brendangregg/FlameGraph.

No comments:

Post a Comment