Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Logistic regression from scratch (philippmuens.com)
156 points by pmuens on June 25, 2020 | hide | past | favorite | 61 comments


Very interesting seeing people in the comments debate what is a very basic thing taught in any stats/econometrics class.

The idea behind binary regression ( Y is 0 or 1) is that you use a latent variable Y* = beta X + epsilon.

X is the matrix of indenpent variables, beta is the vector of coefficients and epsilon is an error term that sums the rest of what X can't explain.

Y thus becomes 1 if Y* >0 and 0 otherwise.

Seeing how Y is binary, we can model it using a Bernoulli distribution with a success probability P(Y=1) = P(Y* >0) = 1 - P(Y* <=0) = 1 - P(epsilon <= -beta X) = 1 - CDF_epsilon(-beta X)

Technically you can use any function that maps R to [0, 1] as a CDF. If its density is symmetrical then you can directly write the above probability as CDF(beta X). The two usual choices are either the normal CDF which gives the Probit model or the Logistic function (Sigmoid) which gives the Logit model. With the CDF known you can calculate the likelihood and use it to estimate the coefficients.

People prefer the Logit model because the coefficients of the model are interpretable in terms of log-odds and all and the fucntion has some nice numerical properties.

That's all there is to it really.


fyi, generally when explaining things to non-practitioners, it's a detractor to add qualifiers like

- this is a very basic thing

- that's all there is to it

because although it is a basic thing to you, it's not a basic thing to someone who hasn't spent the same time studying all the concepts beforehand.

This is generally why you have to take a class to grok stats rather than just read some reference material and definitions.

It's similar to programming environments when the senior dev says some thing that requires a lot of context is just really simple.

Monads are Simply Just monoids in the category of endofunctors, after all, they are really Simple, that's all there is to it! (what the subject hears: This is so simple to me, why aren't you smart enough to get this simple concept? You should know this already—I shouldn't even have to make this comment! Why haven't you been properly educated, like I have? )


The author of the parent explains their view well I think (not directed at statistics students). But I would like to make the more general complaint that this attitude seems pervasive in maths. People often fail to include enough detail in their derivations of proofs or their explanations of things and the students are left confused and unsure because they don't see how they got from step 3 to step 4.


I wholeheartedly agree with you, in my case I was typing my comment on my phone while waiting on the train. I wanted to get into as many details as possible but not having access to easily typed math notation made my task harder.


I totally agree but I was explaining that to practionners, or at least people who consider themselves data scientists. I agree that it is by no means a beginner friendly explanation.


Is that what some people call the 'mansplain' method?


no. all explanations must make some assumption about what readers already know, or else every explanation would begin with teaching english.

also, mansplaining generally refers to assuming people (women, usually) have less pre-existing knowledge than they actually do.


Thanks for clarifying. So condescending is not necessarily mansplaining. But mansplaining IS condescending.


so I literally teach statistics to undergraduates...a couple thoughts:

1) agree with others comments that stating with 'very basic' ain't going to work

2) logistic regression actually isn't a common topic in an undergraduate statistics course. I cover it on the second to last day as an 'here is where you can extend it'

3) this isn't a very effective explanation.

4) a big background factor is the misconceptions about statistics that are ingrained in human beings. Learning statistical tests is hard, especially for many engineers and similar STEM types because it is typically the first (and sometimes only) class that deals with concepts of error and variance. Everything else presents an absolute theoretical model. On average, it breaks their brain...


I've always found it a little simplistic that the default cut-off, in most statistical software, for whether something should be 0 or 1, is 0.5.

(i.e. > 0.5 equals 1, and < 0.5 equals 0).

This seems to be a "rarely-questioned assumption".

Is there a reason why this is considered reasonable? And is there a name for the cut-off (i.e., if I were to want to change the cut-off, what keyword should I search for inside the software's manual?)?


statistical software never does this.

Almost all statistical models give you probabilities and it is up to the domain to determine the cutoff. You can clearly see this in logistic regression it's giving you a probability not a 1 or 0.

When a clinician gives me a dataset I build the model and leave it to them to do the cut off. It is not my domain or expertise to tell them where the cut off is. Plus it gives away the responsibility. I'm not responsible for it unless they don't know and are asking me for a reasonable cut off.

An example is people opting for artificial in semenation because of problem. I believe the cut off is above 60% because of the trade off.

You can read more about it in Regression Modeling Strategies by Dr. Frank Harrell.


From a stats perspective the cutoff is included in the coefficients. If you use a design matrix (add a column of 1s to your variables) you get in a non matrix notation (beta_01 + beta_1 X_1 +...) So the threshold can be considered beta_0.

In the software, you can get classification models to output class probabilities instead of class labels. You can then use whatever threshold you like for to transform those probabilities to labels.

You may see it refered to as "discrimination threshold". Varying that threshold is how ROC curves are constructed.


The threshold would be beta_0 on every case or only when you have subtracted the mean from your data?


You don't want to demean your dependent (response) binary variable. So you almost always want to keep beta0 to control for any imbalance in your dependent var.


I meant demeaning the independent variables. My understanding is that the beta_0 will have the meaning the curiousgal attach it only if you demean your independent variables.


I see. But I think after demeaning X, beta0 will just have a special meaning... log odds of the average case. Nothing more.


My understanding of Logistic Regression is that it's linear regression on the log-odds, which are then converted to probabilities with the sigmoid/softmax function. This formulation allows one to do direct linear regression on the probabilities, without the unpleasant side effects of just using a linear model as-is. A mathematical justification for doing this is given by the generalized linear model formulation.


From what I recall this is a bit off -- not a bad mental model but the math plays out different.

Linear regression has a closed form solution of X projected onto Y: \hat{\beta} = (X'X)^{-1} X' Y

It is equivalent to the Maximum Likelihood Estimator (MLE) for linear regression. However, for logistic regression, MLE would estimate different from MLE for the log odds output.

Linear regression on {class_inclusion} = XB gives the linear probability model, which has limited utility. The required transform is covered by another commenter.


You're right, my model was a bit off. Thanks for pointing that out, I forgot about the fact.


There is a different mathematical justification as well in terms of Bayesian reasoning.

The claim is that “evidence” in Bayesian reasoning naturally acts upon the log-odds, mapping prior log-odds to posterior log-odds additively. To see this, calculate from the definition,

    Odds(X | E) := Pr(X | E) / Pr(¬X | E)
                 = Pr(X ∧ E) / Pr(¬X ∧ E)
                 = (Pr(E | X) × Pr(X)) / (Pr(E | ¬X) × Pr(¬X))
                 = LR(E, X) × Odds(X),
Where LR is the usual likelihood ratio.

So when we take the logarithm of both sides, we find that new evidence adds some quantity—the log of the likelihood ratio of the evidence—to our log of prior probability, in this phrasing of Bayes’ theorem.

I sometimes tell people this in a slightly strange language, I say that if we ran into aliens we might find out that they don't believe the things are absolutely true or false, but instead measure their truth or falsity in decibels.

So another perspective on what logistic regression is trying to do, is that it is trying to assume linear log-likelihood-ratio dependence based on the strength of some independent pieces of evidence. You can weakly justify this in all cases, using calculus and assuming everything has a small impact. You can further justify it strongly for any signal where twice as large of a measured regression variable ultimately implies twice as many independent events at a much lower level happening and independently providing their evidences for the regression outcome. So like, I come from a physics background, I am thinking in this case of photon counts in a photomultiplier tube or so: I know that at a lower level, each photon is contributing equally some small little bit of evidence for something, so when I count all the time up together, this is the appropriate framework to use.


Putting logistic and linear regression into the generalised linear model framework is the right way to think of it and compare them.

From this point of view linear regression would be using GLM with identity link function, logistic regression uses the logit function as the link function.


It's better to think of linear regression and logistic regression as special cases of the Generalized Linear Model (GLM).

In that framework, they are literally the same model with different "settings" - Gaussian vs Bernoulli distribution.


I have to disagree with you. While assuming Gaussian disturbance terms results in a linear regression, the linear regression framework is more general. It makes no assumptions about the distribution of the disturbance terms. Instead, it merely restricts the variance to be constant over all values of the response variable.


Both things can be true.

Linear regression is extra-special because it's a special case of several different frameworks and model classes.

I should have written that it's better (in my opinion) to think of logistic regression in the context of GLMs, at least while you're learning.

Edit: yes logistic regression is a special case of regression with a different loss function. But it's not nearly "as special" as linear regression.


As above, I would strongly agree with you. Both linear and logistic regression can be special cases of frameworks that are more general and far less parametric than GLM. But they also have very intuitive or hands-on explanations, especially logistic regression, which GLM doesn't have.


Well, then I am gonna say it's even better to think of linear and logistic regression as special case of M Estimators or GMM.

Joke aside, the truth is that logistic regression can be understood based on several assumptions.

Above we have the latent variable explanation. Then there is a Bayesian version. There's even a "random utlity" formulation, where one models explicitly choices of an agent with a probabilistic error. That one is good to explain hierarchical logit models and many of the "issues" with logit such as IAA.

GLM on the other hand I don't feel like it adds much except parameterizing the procedure, which ain't even a good thing. Nowadays we appreciate the semi parametric nature of regression a lot, which is why GLM has declined in use.


The main benefit of the GLM formulation is the observation that your model implies a particular probability distribution for the target, whether you like it or not. And that your point predictions are in fact conditional means. In my opinion, this is an important aspect of modeling that is glossed over or omitted by a lot of introductory material.


I agree on the point of conditional means, however I'd say other statistical frameworks emphasize that point even more.

The point about the probability distribution is reasonable, but I am not sure if it is taken seriously by everyone applying GLM either. And again, if it is not necessary to assume such a distribution, then I would prefer a semi parametric approach, such as in linear regression.


Linear regression uses MSE loss. Logistic regression uses log-loss. Both loss functions behave differently.

Its not just the underlying model, but the loss function is also different.


> it's linear regression on the log-odds

Almost - logistic regression assumes that the function is linear in the log odds, i.e. log(p/(1-p)) = Xb + e. The problem is that you can't compute the log-odds, because you don't know p.


This is not quite correct. The log probabilities are

log p(y=1 | x; beta) = beta * x - log Z(x; beta)

where

Z(x) = p(y=0 | x; beta) + p(y=1 | x; beta)

Thus, you can think of it as linear regression, but with an additional term log Z(x; beta) in the log likelihood.


Logistic regression can learn some quite amazing things. I trained a linear function to play chess: https://github.com/thomasahle/fastchess and it manages to predict the next moves of top engine games with 27% accuracy.

A benefit of logistic regression is that the resulting model really fast. Furthermore, it's linear, so you can do incremental updates to your prediction. If you have `n` classes and `b` input features change, you can recompute in `bn` time, rather than doing a full matrix multiplication, which can be a huge time saver.


Isn't 27% worse than flipping a coin?


A typical chess position has 20-40 legal moves. The complete space of moves for the model to predict from has about 1800 moves.

For comparison Leela Zero gets around 60% accuracy on predicting its own next move.

With this sort of accuracy you can reduce the search part of the algorithm to an effective branching factor of 2-4 rather than 40, nearly for free, which is a pretty big win.


I don’t understand the comment about Leela. Why isn’t own move prediction deterministic?


Because Leela (like fastchess mentioned above) has two parts: A neural network predicting good moves, and a tree search exploring the moves suggested and evaluating the resulting positions (with a second net).

If the prediction (policy) net had a 100% accuracy, you wouldn't need the tree search part at all.


Got it. Thanks for clarifying. Let me restate.

Part one of Leela ranks several chess moves. Part two picks among those.

60% of the time part 2 chooses the #1 ranked move.


That works :-)

One addition: The second part can be run for an arbitrary amount of time, gradually improving the quality of the returned move.

The 60% figure comes from the training games, which are played very quickly, and so don't have a lot of time for refining, thus increasing the prediction accuracy.

In real games, tcec-chess.com/ this "self accuracy" would probably be a bit lower.


You haven’t mentioned any nondeterministic behaviour, therefore Leela is supposed to predict it’s own moves with a 100% accuracy.


It's not non-determinism, it's partial information. The NN part guesses the best move that will be found by search X% of the time. If you just ditched the search part, Leela would be faster and lose out on (1-X)% of the better moves.


Only if you have only two legal moves.


No, uniform random would be bounded by 1/16. However you cannot move ever piece in every configuration, so it's greater than that. Actually would be an interesting problem for figure out...


There are only 16 pieces, but in most board positions, many pieces can make more than one legal move.


Weight of evidence binning can be helpful feature engineering strategy for logistic regression.

Often this is a good 'first cut' model for a binary classifier on tabular data. If feature interactions don't have a major impact on your target then this can actually be a tough benchmark to beat.

https://github.com/oli5679/WeightOfEvidenceDemo

https://www.listendata.com/2015/03/weight-of-evidence-woe-an...


How do we differentiate between econometrics and machine learning? Logistic regression seems like it fits into econometrics better than machine learning to me. There's no regularization. I guess there's gradient descent which can be seen as more machine learning. In the end it's semantics of course, still an interesting distinction.


Econometrics is the application of statistical techniques on economics-related problems, typically to understand relationships between economic phenomena (e.g. income) and things that might be associated with it (e.g. education).

Machine learning is typically defined as a way to enable computers to learn from data to accomplish tasks, without explicitly telling them how.

Both fields can use logistic regression, regularization, and gradient descent to accomplish their goals, so in that sense there's no distinction.

But IMO there is a difference in their primary intention: econometrics typically focuses on inference about relationships, machine learning typically focuses on predictive accuracy. That's not to say that econometrics doesn't consider predictive accuracy, or that machine learning doesn't consider inference, but it's usually not their primary concern.


So you're going with the only difference being who's building the model. Interesting take, can't say I disagree much. Although I would say that regularization in econometric models is a bit rare because it distorts the coefficients which as you pointed out is the primary goal of econometrics.


Econometric models tend to be hand-fit and focus more on explanation/hypothesis testing than prediction, so automated variable selection is less common (and sometimes frowned upon).


this is a great explanation. Thank you


From my experience, econometricians and ML practitioners mostly pretend like the other group doesn't exist.


Well, what do you define as machine learning?

Logistic regression is clearly a classifier. And you need data to train it. So it's a supervised learning algorithm.


I'm trying to have a conversation so I can figure it out. Pretty confident that being a classifier does not make it machine learning, econometrics has classifiers too. Econometric models also need data to train them, so I'm not sure your second point is helpful either. Unless you're claiming the difference is nothing but whether the model is used by an economist.


What is machine learning then?


Statistics as practiced by computer scientists :)

More generally, I recommend Breiman's two cultures article for some insight into the similarities and differences.

If you need a really simple explanation, then machine learning is a tool for generating predictions, while statistics is a method for performing inference about causes.


The correct bucket for logistic regression should be "statistics", under "generalized linear models".


I think you'll enjoy this delightful thread: https://stackoverflow.com/questions/4205105/whats-the-differ...


Logistic regression can use both L1 and L2 regularization


And ols can too. That doesn't make it machine learning. This implementation doesn't involve any regularization.


I am not really sure why you are trying to redefine the term machine learning with random references to regularization. OLS is taught in every machine learning course. The machine learns parameters via gradient descent for a wide variety of loss functions.


Ah, the old "regression from scratch" post that is mandatory for all blogs


It looks like he’s working through a lot of algorithms:

https://github.com/pmuens/lab




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: