Saturday, September 28, 2013

Summer of 2013 with Numpy and Google

GSoC is a learning process - we learn to work together to solve a big problem by breaking it down into smaller chunks.
This year, 2013, I got chance to work as student contributor at Numpy. And the experience was worth paying the amount of effort. I have joined Numpy, the community that's quite friendly and helpful.

I passed final evaluation but I value experience more than consequence (off-course passed makes me happy). Experience that I had for being a part of this program. Nothing could be more inspiring and motivational than working with people from almost every part of globe connected together just because of there love and passion for code and to develop amazing things.  Also got chance to learn from such experienced and enthusiastic developers.

I have written the summary of project into two post : Profiling and Bottle-necks. All together with my mentors Charles R Harris and whole community successfully achieved what we planned. Project information and final code can be seen here [https://google-melange.appspot.com/gsoc/project/google/gsoc2013/arinkverma/79001]

What I gained this year?

The most challenging aspect is to get work with people across different geographical reasons. People from different region thinks and act differently as they might have different concern. Coming out with product which is accepted by everyone is itself a strenuous task. To work in a team with huge diversity in timezone and thinking was colorful experience.
1 comments

Wednesday, September 25, 2013

GSoC'13 Project Summary-2 : Numpy's Bottlenecks and its Removal

In last post, I have mentioned the tools which I used for profiling numpy. Call-graph, made my life not easier but a bit simpler to detect time consuming methods. But time taken just indicate the possibility of bottlenecks as many of time consuming methods are highly optimized.  Hence, critically reading code along with call-graph is the trick

Identifying bottlenecks

Following types bottlenecks were observed

Overhead for smaller arrays

    • Numpy used to release Global Interpreter Lock or GIL for all two or one operand loops. But for short length array, it used to produce relative overhead instead. 
    • So, releasing GIL for smaller operations was not benefiting at all.

Redundant code doing extra work

    • PyArrayCanCastArrayTo used to check evenif newtype is Null. In PyArrayFromArray, It was found that if argument newtype is Null, it get value of oldtype. So, technically it has to check if casting is possible for same type, which is useless.
    • Numpy used to converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then extract the C value from the NumPy scalar.
    • Clearing the error flags at every function call, then checking it. This happens unconditional even-if there is no need to do.

Improper structure

    • For every single operation calls, numpy has to extract value of buffersize, errormask and name to pack and build error object. These two functions, _extract_pyvals and PyUFunc_GetPyValues together use significant time. It take useless time because all this time is spent on to look up entries in a python dict, extract them, and convert them into C level data. Not once but doing that again and again on every operation. Also which remain unused if no error occurs.
    • loop selection method for scalar operation is inefficient and consume time. It check through all associated dtypes of function one by one from types array. 

Not so effective memory allocation

    • When allocating the return value, numpy allocate memory twice. One for the array object itself, and a second time for the array data.
    • Also, shapes + strides together have 2*ndim elements, but to hold them numpy allocate a memory region sized to hold 3*ndim elements.

0 comments

Monday, September 23, 2013

GSoC'13 Project Summary-1 : Numpy's profiling

Small numpy arrays are very similar to Python scalars but numpy incurs a fair amount of extra overhead for simple operations. For large arrays this doesn't matter, but for code that manipulates a lot of small pieces of data, it can be a serious bottleneck.
For example
In [1]: x = 1.0

In [2]: numpy_x = np.asarray(x)

In [3]: timeit x + x
10000000 loops, best of 3: 61 ns per loop

In [4]: timeit numpy_x + numpy_x
1000000 loops, best of 3: 1.66 us per loop

This project involved
  • profiling simple operations like the above
  • determining possible bottlenecks
  • devising improved algorithms to solve them, with the goal of getting the numpy time as close as possible to the Python time.

Profiling tools

The very first objective to find bottleneck is profiling for time or space. During project I have used few tools for profiling and visualizing data of numpy execution flow.

Google profiling tool

This is the suit of different tools provided by Google. It Includes TCMalloc, heap-checker, heap-profiler and cpu-profiler. As need of project was to reduce time, so CPU-Profiler was used.

Setting up Gperftools

Following are the steps used to setup python C level profiler on Ubuntu 13.04. (For any other system, options see [1])
  1. Make sure to build it from source. Clone svn repository from http://gperftools.googlecode.com/svn/trunk/
  2. In order to build gperftools checked out from subversion repository you need to have autoconf, automake and libtool installed.
  3. First, run ./autogen.sh script which generate ./configure and other files. Then run ./configure
  4. 'make check', to run any self-tests that come with the package. Check is optional but recommended to use
  5. After all test gets passed, type 'sudo make install' to install the programs and any data files and documentation.

0 comments

Saturday, September 14, 2013

Speedup UFUNC_CHECK_STATUS by avoiding heavy clearing flag

UFUNC_CHECK_STATUS is just single macro which do both checking clearing the error flags. It clear error flags every time after checking. We should avoid clear operation if not needed, as it is a bit expensive and take significant amount of time.

The way numpy detect divide-by-zero, overflow, underflow, etc., is that before each ufunc loop it clear the FP error flags, and then after the ufunc loop we see if any have become set. And clear again. I have avoided clear if not needed to save time.

Improvement

Before each ufunc loop when PyUFunc_clearfperr() flag error is checked, then clearing them if necessary. Now, checking results in macro doesn't get ignored unlike before. Earlier time taken by PyUFunc_clearfperr() and PyUFunc_getfperr() combined was around 10%, which is now dropped to 1%, for operation which don't raise any error.

callgraph comparing performance
x = np.asarray([1]); x+x;

More

PR for this change is #3739
0 comments

Wednesday, September 11, 2013

Improvement in _extract_pyvals

Instead of *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval); in _extract_pyvals. It alone takes 10% of time, with every operations.  Its better to avoid packing of string instead assign pointer to struct and stash them to thread local storage. Hence no allocations would be need to for errorobj every-time, which almost goes unused.

Implementation

Error object include name of ufunc and callback pointer.
typedef struct {
    PyObject_HEAD
    char *name;
    PyObject* retval;
} PyErrObject;

Fast path for consecutive same operations, it better to use caching
errvalues = PyDict_GetItem(thedict, PyUFunc_PYVALS_ERROR);    
    if (errvalues == NULL) {
        errvalues = PyObject_New(PyErrObject, &PyErrObject_Type);
        errvalues->name = name;
        errvalues->retval = retval;
        PyDict_SetItem(thedict, PyUFunc_PYVALS_ERROR, (PyObject*)errvalues);
    }
    
    errvalues->name = name;
    errvalues->retval = retval;
    *errobj = errvalues;
    Py_INCREF(errvalues);


Improvement

Callgraph PyErrObject
x = np.asarray([5.0,1.0]); x+x
Time consumption of _extract_pyvals drop to 2% from 9.3%.

More

PR for this enhancement is #3686
0 comments

Monday, September 9, 2013

Scope for improvement in _extract_pyvals

I my last post, I mentioned how two functions, _extract_pyvals and PyUFunc_GetPyValues together use >12% of time. But major culprit is *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval); in  _extract_pyvals. It alone takes 10% of time, with every operations.

Improvement

Caching Py_BuildValue

First approach I take, was to cached Py_BuildValue with Thread storage or dict. With this time consumption of _extract_pyvals dropped to 4% from 12%. 
errorobj caching with PyThreadState_GetDict
But since, TLS is a bit unreliable and risky. So @juliantaylor advised that it should be avoided. Even after many fixes, commit for this didnt managed to pass all test cases.

0 comments

Saturday, August 31, 2013

Almost 12% of useless time consumption by PyUFunc_GetPyValues

For every single operation calls, numpy has to extract value of buffersize, errormask and name to pack and build error object. These two functions, _extract_pyvals and PyUFunc_GetPyValues together use >12% of time.

Callgraph showing contribution of PyUFunc_GetPyValues

I think it take useless time because all this time is spent on to look up entries in a python dict, extract them, and convert them into C level data. Not once but doing that again and again on every operation. Instead these values should be converted once, at time of loading or when they set in first place.
0 comments