Continuing our coverage of MyPy (check parts #1, #2, and #3 of our “A Day With MyPy” series), this time we wanted to show you how we applied what we learned so far, by creating a type stub to a package that we use on a daily basis: NumPy.
Our goals regarding this experiment were:
- To make a tangible contribution to NumPy, a project on which we rely.
- To test the usage of MyPy in our production pipelines.
All the code for the NumPy stub is available on GitHub.
MyPy stubs
When you want to add type annotations to code you don’t own, one solution is to write type stubs which are files with a description of the public interface of the modules with no implementations. Given that MyPy allows mixing dynamic and static typing, we decided to write the declarations for the most popular parts of numpy.
numpy.ndarray
At the core of numpy are the ``ndarray``s, which are multi-dimensional arrays that hold fixed-size items. Given that it’s the most popular part of the library, and that the rest of numpy is built on it, we decided to start by adding types to its interface.
This is when we encountered our first obstacle: Most of numpy is written in C. With a regular package written in Python, we would have walked through the code and the we would have written signatures that match the code, adding the type information. This wasn’t possible with numpy. In some cases we used introspection, but we relied mostly on the reference documentation.
The second problem we faced was numpy’s inherent flexibility. Take this example, for instance:
In[1]:importnumpyasnpIn[2]:np.array('a string')Out[2]:array('a string',dtype='<U8')In[3]:np.array(2)Out[3]:array(2)In[4]:np.array([1,2,3])Out[4]:array([1,2,3])In[5]:np.array((1,2,"3"))Out[5]:array(['1','2','3'],dtype='<U21')
The array
function is used to create array objects, and as you can see, no matter what you use as parameter for the object
argument, it does its best to to convert it to homogeneous values that can go into an ndarray
in return. This is great for users, but it is a source of headaches if you want to add type annotations.
Luckily, our type signature for ndarray
allows us to be explicit about the type of items stored in the arrays:
classndarray(_ArrayLike[_S],Generic[_S]):...
so we can do things like:
my_array=np.array([1,2,3])# type: np.ndarray[int]
You’re probably wondering about the _ArrayLike[_S]
class, as it doesn’t exist on the numpy
namespace. We wrote this fictional class to describe the array interface that is common between arrays and scalars.
Little gotcha regarding type expressions
While testing the stub we found something that might affect other type stubs for structures that work as containers. Take a look at this example:
importnumpyasnpdefdo_something(array:np.ndarray[bool]):returnarray.all()some_array=np.array([True,False])# type: np.ndarray[bool]ifdo_something(some_array):print('done something')
It all seems fine, and mypy doesn’t complain about, but if you try to run it, you’ll get the following error:
$ python test_numpy.py
Traceback (most recent call last):
File "test_numpy.py", line 3, in <module>
def do_something(array: np.ndarray[bool]):
TypeError: 'type' object is not subscriptable
Which makes total sense because ndarray
is not a descendant of Generic
. This is why we have classes like List
or Dict
in the typing
module, so the type declaration doesn’t clash with the actual classes. There’s an easy work around, surrounding the type declaration in quotes:
importnumpyasnpdefdo_something(array:'np.ndarray[bool]'):returnarray.all()some_array=np.array([True,False])# type: np.ndarray[bool]ifdo_something(some_array):print('done something')
This way the type expression is evaluated as a string and no errors are generated. Notice that there was no problem with the second declaration as it was in a comment, and those aren’t evaluated.
Although this is a valid work-around, we will most likely introduce a class named NDarray
(to follow the pattern established by the typing
module) that can be used safely in type declarations.
Problems, problems everywhere
We tried our best to provide meaningful type declarations for mypy’s type inference engine, but the dynamic nature numpy made it difficult sometimes. Take this signature for example:
defall(self,axis:AxesType=None,out:'_ArrayLike[_U]'=None,keepdims:bool=None)->Union['_ArrayLike[_U]','_ArrayLike[bool]']:...
According to the ndarray.all
documentation, it returns True when all array elements along a given axis evaluate to True. It actually returns a numpy.bool_
scalar, hence the _ArrayLike[bool] signature. However, if the out
parameter is passed, the type of the return value would be the same as out
‘s.
The proper way to declare all would have been something like:
@overloaddefall(self,axis:AxesType=None,keepdims:bool=None)->'_ArrayLike[bool]':...@overloaddefall(self,axis:AxesType=None,keepdims:bool=None,*,out:'_ArrayLike[_U]')->'_ArrayLike[_U]':...
But due to a mypy bug we had to go with the former declaration. Once the bug has been dealt with, we’ll improve the declarations to help mypy type inference engine.
We also encountered problems within numpy itself.
In[1]:importnumpyasnpIn[2]:nda=np.random.rand(4,5)<0.5In[3]:ndb=np.arange(5)In[4]:nda.all(axis=0,out=ndb)Out[4]:array([0,1,0,0,0])In[5]:nda.all(0,ndb)(tracebacknotshown)TypeError:datatypenotunderstood
According to the argument specification of ndarray.all, there shouldn’t be any problems with the last sentence. In the implementation, the positional arguments are not exactly the same as in the docs.
With these problems in mind, we tried the stub against some of our own code. Here’s a snippet that shows what we found:
$ mypy --strict-optional --check-untyped-defs lr.py
lr.py:2: error: No library stub file for module 'scipy'
lr.py:2: note: (Stub files are from https://github.com/python/typeshed)
lr.py:5: error: No library stub file for module 'sklearn'
lr.py:7: error: No library stub file for module 'sklearn.utils.fixes'
lr.py:8: error: No library stub file for module 'sklearn.utils.extmath'
lr.py:9: error: No library stub file for module 'sklearn.datasets'
lr.py:10: error: No library stub file for module 'sklearn.linear_model'
lr.py: note: In member "fit" of class "LR":
lr.py:17: error: "module" has no attribute "unique"
lr.py: note: In member "decision_function" of class "LR":
lr.py:33: error: "module" has no attribute "dot"
lr.py: note: In member "predict" of class "LR":
lr.py:39: error: "module" has no attribute "int"
lr.py: note: In member "predict_proba" of class "LR":
lr.py:46: error: "module" has no attribute "dot"
lr.py: note: In member "likelihood" of class "LR":
lr.py:59: error: "module" has no attribute "dot"
lr.py:65: error: "module" has no attribute "sum"
lr.py:65: error: "module" has no attribute "dot"
lr.py:71: error: "module" has no attribute "dot"
Besides the missing stubs for scipy and sklearn (we might tackle those in the future), most of the problems came from the fact that the developer used the array operation functions (like dot or sum) defined on the numpy
namespace instead of the methods defined on the ndarray
class. Here’s an example of this:
defdecision_function(self,X_test):scores=np.dot(X_test,self.weights[:-1].T)+self.weights[-1]returnscores.ravel()iflen(scores.shape)>1andscores.shape[1]==1elsescores
Here, the developer used np.dot
instead of X_test.dot
. We found that this happens quite often (at least in our code), so we’re going to add type declarations for the most common functions defined in the top-level numpy
namespace.
Conclusions
During one of our meetings we reviewed our findings and decided that we could improve the stub with a little bit of user input. So if you think you’re up to it, please take a look at the code and give us your feedback. Even if you think we did everything wrong, that’ll a great help for us, as we aim to provide a meaningful contribution to the NumPy, MyPy and Python communities in general.