I've written the average() and standard_deviation() functions at least a couple of dozen times, because it doesn't make sense to require numpy in order to summarize, say, benchmark timing results.
- reduced import time
NumPy and SciPy were designed with math-heavy users in mind, who start Python once and either work in the REPL for hours or run non-trivial programs. It was not designed for light-weight use in command-line scripts.
"import scipy.stats" takes 0.25 second on my laptop. In part because it brings in 439 new modules to sys.modules. That's crazy-mad for someone who just wants to compute, say, a Student's t-test, when the implementation of that test is only a few dozen lines long. (Partially because it depends on a stddev() as well.)
Sure, 0.25 seconds isn't all that long, but that's also on a fast local disk. In one networked filesystem I worked with (Lustre), the stat calls were so slow that just starting python took over a second. We fixed that by switching to zip import of the Python standard library and deferring imports unless they were needed, but there's no simple solution like that for SciPy.
- less confusing docstring/help
Suppose you read in the documentation that scipy.stats.t implements the Student's t-test as scipy.stats.t.
>>> import scipy.stats
>>> scipy.stats.t
<scipy.stats.distributions.t_gen object at 0x108f87390>
It's a bit confusing to see scipy.stats.distributions.t_gen appear, but okay, it's some implementation thing.
Then you do help(scipy.stats.t) and see
Help on t_gen in module scipy.stats.distributions object:
class t_gen(rv_continuous)
| A Student's T continuous random variable.
|
| %(before_notes)s
|
...
|
| %(example)s
Huh?! What's %(before nodes)s and %(example)s?
The answer is, scipy.stats auto-generates various of the distribution functions, including things like docstrings. Only, help() gets confused about that because help() uses the class docstring while SciPy modifies the generator instance's docstring. Instead, to see the correct docstring you have to do it directly:
>>> print scipy.stats.t.__doc__
A Student's T continuous random variable.
Continuous random variables are defined from a standard form and may
require some shape parameters to complete its specification. Any
optional keyword parameters can be passed to the methods of the RV
object as given below:
Well, help(scipy.optimize.nonlin.Anderson) has the same problem, but you're right in that that failure mode is rare, and that numpy/scipy has good documentation. However, in the context of a stats library, I think it's okay to point out that scipy.stats has some annoying parts. ;)
In all honesty, I seldom use NumPy and rarely use SciPy, so I can't judge that deeply. I know that when I read their respective code bases I get a bit bewildered by the many "import *" and other oddities. It doesn't feel right to me. I know the reason for most of the choices - to reduce API hierarchy and simplify usability for their expected end-users - but their expectations don't match mine.
So I looked at more of the documentation. I started with scipy/integrate/quadpack.py. The docstring for quad() says, in essence, "this docstring isn't long enough, so call quad_explain() to get more documentation." I've never seen that technique used before. The Python documentation says "see this URL" for those cases.
Again, this is a difference in expectations. I argue that NumPy and Python have different end-users in mind. Which is entirely reasonable - they do! But it means that it's very difficult to simply say "add numpy to part of the standard library."
There's also a level of normalization that I would want should numpy be part of the standard library. For example, do out of range input raise ValueError or RuntimeError? scipy/ndimage/filters.py does both, and I don't understand the distinction between one or the other.
Now, in the larger sense, I know the history. RuntimeError was more common in Python, and used as a catch-all exception type. Its existence in numpy reflects its long heritage. It's hard to change that exception type because programs might depend on it.
But it means that integrating all of numpy into the standard library is not going to work: either it breaks existing numpy-based programs, or the merge inherits a large number of oddities that most Python programmers will not be comfortable with.
Actually, I don't think the import * in numpy is anything else than historical artefact. Numpy just happens to be one of the oldest, still widely used python library (considering numpy started as numeric), as you point out. As for import speed, have you considered using lazy import in your script ?
I don't see numpy being integrated in python anytime soon. I don't think it would bring much, and one would have to drop performance enhancement that rely on blas/lapack.
I think installing has improved a lot, and once pip + wheel matures, it should be easy to pip install numpy on windows.
Robert Kern: Your use case isn't so typical and so suffers on the import time end of the balance
Stéfan van der Walt: I.e. most people don't start up NumPy all the time -- they import NumPy, and then do some calculations, which typically take longer than the import time. ... You need fast startup time, but most of our users
need quick access to whichever functions they want (and often use from an interactive terminal).
I went back to the topic last year. Currently 25% of the import time is spent building some functions which are then exec'ed. At every single import. I contributed a patch, which has been hanging around for a year. I came back to it last week. I'll be working on an updated patch.
There's also about 7% of the startup time because numpy.testing imports unittest in order to get TestCase, so people can refer to numpy.testing.TestCase. Even though numpy does nothing to TestCase and some of numpy's own unit tests use unittest.TestCase instead. sigh. And there's nothing to be done to improve that case.
Regarding the age - yes, you're right. BTW, parts of PIL started in 1995, making it the oldest widely used package, I think. Do you know of anything older?
- fewer dependencies for my package
I've written the average() and standard_deviation() functions at least a couple of dozen times, because it doesn't make sense to require numpy in order to summarize, say, benchmark timing results.
- reduced import time
NumPy and SciPy were designed with math-heavy users in mind, who start Python once and either work in the REPL for hours or run non-trivial programs. It was not designed for light-weight use in command-line scripts.
"import scipy.stats" takes 0.25 second on my laptop. In part because it brings in 439 new modules to sys.modules. That's crazy-mad for someone who just wants to compute, say, a Student's t-test, when the implementation of that test is only a few dozen lines long. (Partially because it depends on a stddev() as well.)
Sure, 0.25 seconds isn't all that long, but that's also on a fast local disk. In one networked filesystem I worked with (Lustre), the stat calls were so slow that just starting python took over a second. We fixed that by switching to zip import of the Python standard library and deferring imports unless they were needed, but there's no simple solution like that for SciPy.
- less confusing docstring/help
Suppose you read in the documentation that scipy.stats.t implements the Student's t-test as scipy.stats.t.
It's a bit confusing to see scipy.stats.distributions.t_gen appear, but okay, it's some implementation thing.Then you do help(scipy.stats.t) and see
Huh?! What's %(before nodes)s and %(example)s?The answer is, scipy.stats auto-generates various of the distribution functions, including things like docstrings. Only, help() gets confused about that because help() uses the class docstring while SciPy modifies the generator instance's docstring. Instead, to see the correct docstring you have to do it directly: