More

neuromantik8086 · on Dec 11, 2023

Because endowments aren't piggy banks. They're regulated by UPMIFA [1], which states that universiteis can't draw down more than 7% of the total funds in the endowment unless they can prove that it would be prudent to do so, and the burden of proof is extremely high.

Even without UPMIFA, endowments are a mix of unrestricted and restricted funds, and donor restrictions can and do prevent universities from using money when they might otherwise want to. Even if a university desired to draw down the full 7% allowed without triggering red tape, it's unlikely that they would be able to draw it all without running afoul of donor intent.[2]

If anything, the system is to blame here, not the universities themselves necessarily (not to excuse bad apples in academic administration).

[1] https://en.wikipedia.org/wiki/Uniform_Prudent_Management_of_...

[2] https://en.wikipedia.org/wiki/Donor_intent

neuromantik8086 · on Aug 24, 2020

There are some efforts in this vein within academia, but they are very weak in the United States. The U.S. Research Software Engineer Association (https://us-rse.org/) represents one such attempt at increasing awareness about the need for dedicated software engineers in scientific research and advocates for a formal recognition that software engineers are essential to the scientific process.

In terms of tangible results, Princeton at least has created a dedicated team of software engineers as part of their research computing unit (https://researchcomputing.princeton.edu/software-engineering).

Realistically though even if the necessity of research software engineering were acknowledged at the institutional level at the bulk of universities, there would still be the problem of universities paying way below market rate for software engineering talent...

To some degree, universities alone cannot effect the change needed to establish a professional class of software engineers that collaborate with researchers. Funding agencies such as the NIH and NSF are also responsible, and need to lead in this regard.

geebee · on Aug 24, 2020

Thank you for the link to the Princeton group. That is encouraging. Aside from that, I share your lack of optimism about the prospects for this niche.

Most research programmers, in my experience, work in a lab for a PI. Over time, these programmers have become more valued by their team. However, they often still face a hard cap on career advancement. They generally are paid considerably less than they'd earn in the private sector, with far less opportunity for career growth. I think they often make creative contributions to research that would be "co-author" level worthy if they came from someone in an academic track, but they are frequently left off publications. They don't get the benefits that come with academic careers, such as sabbaticals, and they often work to assignment, with relatively little autonomy. The right career path and degree to build the skills required for this kind of programming is often a mismatch for the research-oriented degrees that are essential to advancement in an academic environment (including leadership roles that aren't research roles).

In short, I think there is a deep need for the emerging "research software engineer" you mention, but at this point, I can't recommend these jobs to someone with the talent to do them. There are a few edge cases (lifestyle, trailing spouse in academic, visa restrictions), but overall, these jobs are not competitive with the pay, career growth, autonomy, and even job security elsewhere (university jobs have a reputation for job security, but many research programmers are paid purely through a grant, so often these are 1-2 year appointments that can be extended only if the grant is renewed).

The Princeton group you linked to is encouraging - working for a unit of software developers who engage with researchers could be an improvement. Academia is still a long, long way away from building the career path that would be necessary to attract and keep talent in this field, though.

neuromantik8086 · on Aug 24, 2020

Just as a quick bit of context here, Konrad Hinsen has a specific agenda that he is trying to push with this challenge. It's not clear from this summary article, but if you look at the original abstract soliciting entries for the challenge (https://www.nature.com/articles/d41586-019-03296-8), it's a bit clearer that Hinsen is using this to challenge the technical merits of Common Workflow Language (https://www.commonwl.org/; currently used in bioinformatics by the Broad Institute via the Cromwell workflow manager).

Hinsen has created his own DSL, Leibniz (https://github.com/khinsen/leibniz ; http://dirac.cnrs-orleans.fr/~hinsen/leibniz-20161124.pdf), which he believes is a better alternative to Common Workflow Language. This reproducibility challenge is in support of this agenda in particular, which is worth keeping in mind; it is not an unbiased thought experiment.

jnxx · on Aug 24, 2020

Konrad Hinsen is an expert in molecular bioinformatics and also has significantly contributed to Numerical Python, for example, and has extensively published around the topic of reproducible science and algorithms - see his blog.

The fact that he might favor different solutions from you does not mean that he is pushing some kind of hidden agenda.

If you think that Common Workflow Language is a better solution, you are free to explain in a blog why you think this.

Are you saying that the reproductive challenge poses a difficulty to Common Workflow Language? If this is so, would that not rather support Hinsen's point - without implying that what he suggests is already a perfect solution?

neuromantik8086 · on Aug 24, 2020

I never said that Konrad Hinsen's agenda was hidden; in fact, it's not at all hidden (which is why I linked the abstract). It's just that this context isn't at all clear in the Nature write-up, and it's relevant to take into account.

I haven't taken the time to seriously contemplate the merits of CWL vs Leibniz, although my gut instinct is that we don't really need another domain-specific language for science given the profusion of such languages that already exist (Mathematica, Maple, R, MATLAB, etc). That's the extent of my bias, but again, it's a gut instinct and not a comprehensive well-reasoned argument against Leibniz.

neuromantik8086 · on Aug 25, 2020

I never answered your last question so here goes:

> Are you saying that the reproductive challenge poses a difficulty to Common Workflow Language?

I don't actually understand how the reproducibility challenge undermines the validity of using CWL / flow-based programming as an approach to promoting reproducible analyses. There certainly wasn't anything in the article that made me think that CWL was challenged, but Hinsen explicitly called out CWL in the abstract, which implies that for some reason he thinks, a priori, that it's a non-solution. He never justifies this implied assumption further, and as near as I can tell, none of the attempted replications used a flow-based language.

If Hinsen really aimed to argue against the viability of CWL/flow-based programming as an approach to reproducibility, he would have done a systematic comparison of historical analyses that used a flow-based system (like National Instruments' Labview or Prograph) vs analyses that are more similar to the approach that he seems to favor (i.e., analyses using Mathematica or Maple).

While I find the challenge interesting to follow, and the retrocomputing geek in me finds it fun, I don't actually understand what it really accomplished other than being a fun diversion. Assuming that an analysis was written in a Turing-complete language and you didn't use non-deterministic algorithms, you should theoretically be able to reproduce the results exactly on modern hardware, and using non-deterministic algorithms I would imagine that a result would be "close enough" within some kind of confidence interval. You may need to go to great lengths (in terms of emulating instruction sets, ripping tapes, etc), but I think a visit to any retrocomputing festival or computer history museum would have made that pretty obvious from the outset.

tetron · on Aug 25, 2020

There seem to be some misunderstanding here.

CWL is intended for stringing together other programs. It is useful for reproducibility in that it attempts to provide a fairly specific description of the runtime environment needed to execute a program, and also abstracts site-specific details such as file system layout or batch system in use. CWL platforms such as Arvados also generate comprehensive provenance traces which are vital for going back and reviewing how a data result was produced.

Leibniz seems to be a numerical computing language for describing equations, which is more similar to something like NumPy or R. It seems like an apples-and-oranges comparison.

The original call-out is weird, because CWL did not exist 10 years ago so you can't yet answer the question yet of whether it facilitates running 10 year old workflows.

neuromantik8086 · on Aug 24, 2020

Guix is one of several solutions that has been touted as a solution. Another one that is quite popular in HPC circles is Spack (https://spack.readthedocs.io/en/latest/).

At my institute, we actually tried out Spack for a little bit, but consistently felt like it was implemented more as a research project rather than something that was production-level and maintainable. In large part, this was due to the dependency resolver, which attempts to tackle some very interesting CS problems I gather (although this is a bit above me at the moment; these problems are discussed in detail at https://extremecomputingtraining.anl.gov//files/2018/08/ATPE...), but which produces radically different dependency graphs when invoked with the same command across different versions of Spack.

I've since come to regard Spack as the kind of package manager that science deserves, with conda being the more pragmatic / maintainable package manager that we get instead . Spack/Guix/nix are the best solution in theory, but they come with a host of other problems that made them less desirable.

jnxx · on Aug 24, 2020

> Spack/Guix/nix are the best solution in theory, but they come with a host of other problems that made them less desirable.

I would be quite interested to learn more what these problems are, in your experience. I've only tried Guix (on top of Debian and Arch) and while it is definitively more resource-hungry (especially in terms of disk space), I don't percive it as impractical.

yjftsjthsd-h · on Aug 24, 2020

As someone coming from the computing side of things, I found nix to be quite difficult to grok enough to write a package spec, and guix was pretty close, at least in part because of the whole "packages are just side-effects of a functional programming language" idea. At least nix also suffers from a lot of "magic"; if you're trying to package, say, an autotools package then the work's done for you - and that's great, right up until you try to package something that doesn't fit into the existing patterns and you're in for a world of hurt.

Basically, the learning curve is nearly vertical.

rekado · on Aug 24, 2020

> guix was pretty close, at least in part because of the whole "packages are just side-effects of a functional programming language" idea

This must be a misunderstanding. One of the big visible differences of Guix compared to Nix is that packages are first-class values.

yjftsjthsd-h · on Aug 24, 2020

You're right; on further reading I can see guix making packages the actual output of functions. I do maintain that the use of a whole functional language to build packages raises the barrier to entry, but my precise criticism was incorrect.

neuromantik8086 · on Aug 25, 2020

I can only speak to Spack in particular, but the main issue that I found with it was balancing researcher expectations for package installation speed with compile times. For most packages, compile times aren't a huge problem, but compilers themselves can take days to build, and it isn't unheard of for researchers to want a recent version of gcc for some of their environments.

In theory this isn't an issue with Spack (assuming that you have a largely homogeneous set of hardware or don't use CPU family-specific instruction sets), since you can set up cached, pre-compiled binaries on a mirror server (similar to a yum repo) and have people install from there.

Spack, however, has a lot of power/complexity. A lot of untamed power that means that bugs can sometimes be more likely than in other, more mature (or mature-ish) package managers. Namely, Spack allows you to not only specify the version number of a package, but also the compiler that you use to make that package, specific versions of dependencies that you want to use, which implementation of an API you want to use (i.e., MPICH or OpenMPI for MPI), and compiler flags for that package. When you run an install command / specify what you want to install, Spack then performs dependency resolution and "concretizes" a DAG that fulfills all of the constraints.

The issue that I ran into was that if you don't specify everything, Spack makes decisions for you about which version of a dependency, which compiler, etc to use (i.e., it fills in free variables in a space with a lot of dimensions). This would be great and dandy normally, although the version of Spack that I used occasionally constructed totally different graphs for the same "spack install gcc" command (if I recall correctly; take all of this with a grain of salt b/c I might be misremembering). This meant that it wouldn't use cached versions of gcc that had already been built, and ended up rebuilding minor variants of gcc with options I didn't care about.

At National Labs and larger outfits, the trade-offs between this kind of complexity/power and the accompanying bugginess (Spack has yet to hit 1.0) seem to favor complexity/power while accepting these sorts of bugs, but I don't work at a larger outfit and my group didn't need that level of power/control over dependencies and rather needed something that "just worked" and would allow researchers to be able to install packages independently of us (IT people). conda (mostly) fit the bill for this. I still think that Spack is the future and it has a special place in my heart, but it will have to be more stable for me to want to use it in production.

neuromantik8086 · on Feb 5, 2020

The COO majored in Music Technology at Oberlin. That's quite a bit more technical than most people realize. TIMARA (the music tech program at the Oberlin Conservatory) involves a decent amount of programming and/or audio engineering. To put that in perspective, the founder of Macromind/Macromedia (Marc Canter) is also an alumnus of TIMARA.

neuromantik8086 · on Oct 11, 2019

I prefer the following:

"It felt like a yuppie aquarium."

https://www.reddit.com/r/finance/comments/c93agd/wework_isnt...

neuromantik8086 · on Oct 11, 2019

As others have pointed out, what you're describing isn't a fundamentally new idea or even that revolutionary. You're basically describing a database filesystem. Onne Gortner attempted an implementation of this concept in 2004 as part of his/her master's thesis (see http://dbfs.sourceforge.net/). Systems like Spotlight are effectively a partial implementation of this concept- OS X essentially has a hybrid setup where there's both a database and conventional filesystem running in parallel. Going back further, locate (first implemented in 1982) could almost be viewed as a proto-Spotlight. Gmail's labels/tags are another example of a mainstream implementation of this.

convolvatron · on Oct 11, 2019

the larger point is more interesting here. enterprise software sucks because it generally has to fit within the well-accepted boxes in order to meaningfully interface with all the other stuff. that implementation might be a little better than the last one...

but what if the overall model/structure is really what sucks?

neuromantik8086 · on Aug 31, 2019

We wouldn't have this problem if people just used application-layer protocols and federated services like the early internet.

sjwright · on Aug 31, 2019

Wait, why wouldn’t we have these problems? Back in the 1980s, if a university campus connection goes down, you can’t telnet in or read your university POP2 email remotely. It’s down.

The only difference between then and now is that we’re online (seemingly) at every waking minute expecting a hundred different services to be functional at any given moment.

neuromantik8086 · on Sept 1, 2019

Modern services such as reddit and Twitter effectively usurp the role that Usenet/NNTP and similar distributed protocols used to fulfill, but without the advantage of decentralization / lack of large single points of failure that such protocols embraced. That's what I was getting at, and maybe I'm full of shit.

In the 80s if a university campus internet connection went down, only that university was affected. Now, when a single AWS availability zone goes down, a much wider swath of users is impacted. Such consolidation / centralization shows a disregard for the spirit of the early internet and design considerations that went into it.

Again, maybe I'm full of shit. Lots of people here seem to think so.

neuromantik8086 · on Nov 12, 2018

Resource Public Key Infrastructure, but ISPs are too cheap to actually implement it.

neuromantik8086 · on Nov 12, 2018

Maybe I'm being obtuse, but doesn't using a configuration management tool to deploy black-box Docker containers eliminate many of the advantages of using config management in the first place?

NickBusey · on Nov 12, 2018

So you’re asking why not simply use Ansible to deploy all this software? Because that would be anything but simple and would negate almost all the benefits of docker like easy updates and immutability. This is the best of both worlds in my opinion. Ansible handles deploying the configuration that docker then uses.

Additionally the plan is to move to Kubernetes soon for multiple node deployment, and that wouldn’t really be possible without Docker.

And to be clear, some software is installed directly by Ansible, where it makes sense to do so.

antocv · on Nov 12, 2018

Yes, lol.