We keep paying more and more per student for public education. And what do we get in return? Robo-grading and prison-like treatment of students (closed campuses, zero tolerance taken to ridiculous extent). And again we come around to the notion that if only we'd spend more everything would get better, which has worked so well every time it's happened in the past.
The increasing reliance on automated tools and standardized tests is just one of many symptoms of the increasing detachment between teachers and students. We have to stop doubling down on the prussian style educational system, it stopped returning dividends decades ago. More so we have to stop acting as though we can just solve our educational problems by throwing more money at it. We need to empower good teachers and parents, we need to encourage and support students. Right now all of the incentives in the system run counter to those things.
The problem with "more money" is that the educational system seems to have a sort of hull speed when it comes to adding money.
I'd guess that the same or less is being spent on actual education today and the excesses, whatever they may be, are simply siphoned off by the parasites in the system.
It doesn't seem to matter how much $ per student gets added, teachers still end up buying classroom supplies with their own money. The system just has more $180k "administrators" now than it used to.
University education has been going up at 10%/year for decades now. Way above inflation.
It is clear that it is going on an internal dynamic until those dynamics finally conflict with reality. That collision will be..painful. Given the crushing levels of student debt in this country it may happen sooner rather than later.
If I were in charge, I'd institute a simple change. Improve financial aid. Beef up the federal loan guarantees. Only make either available at schools whose tuition was in the bottom 70% when you applied. With that incentive, I think that a lot of schools would find ways to reduce tuitions towards what (inflation adjusted) was more than sufficient several decades ago.
> University education has been going up at 10%/year for decades now.
Accounting for inflation and number of students, university budgets have not been going up 10%/year; at some universities, the real per-student cost of higher education has actually been declining. What's been going up at 10%/year is the tuition "sticker price" for students who receive no scholarships or need-based aid. But this is offset by two contrary trends: 1) a decrease in the percentage of students who are actually paying the sticker price, especially at private universities; and 2) large decreases in state funding for public universities.
Consider the University of California system (all figures below in 2012 dollars). In 1990, it spent $21,000 per student (dividing its total budget by its total enrollment). Today it spends $16,500 per student--- a decrease in the cost of education of about 25%. Tuition has nonetheless gone up, because the state-funded portion of its budget has declined even faster: from $16,000 per student in 1990 to $9,500 per student today, and possibly to $8,500 per student in the coming year. As a result, the student-funded portion has risen from $5,000 to $7,000 and soon $8,000 per student on average (with a sticker price a bit over double that).
If you want the UC system to be able to run on tuitions that were sufficient in 1990, you'd have to also return state funding to what it was in 1990...
Cost is orthogonal to educational quality. As I said, this isn't an example of reducing costs so much as an example of the continual dehumanization of the education system. Any cost savings will just be gobbled up by the entrenched interests in the system, bureaucrats, administrators, etc. And any efficiencies making it easier to treat students like cattle and push "education" more and more toward useless rote memorization (teaching to the test) and busywork. These are not good things.
I've heard this (anecodotal) argument alot. Is there any empirical backing for it, and within what parameters does it hold true? My mother taught at a very expensive private school, which enabled me to switch from a lousy public school . The difference in the quality of teaching was dramatic. Private school teachers don't get paid alot more than public school teachers, and often less, although they are compensated with fringe benefits like I enjoyed, and prestige. In any case, elite private schools certainly get better results, which is why rich people pay alot of money to send their children to elite private schools. (Lots of this is attributable to the selection process, which is part of what parents are paying for.)
In contrast, public inner city schools can't choose their student population, and lack basic things like, books, paper, and air conditioning in the middle of the summer. I'm sure that proper and equitable funding of public education wouldn't solve all the problems in US public schools, but starving them of much needed resources certainly exacerbates the problems they do have.
The countries that perform well on international standardized tests generally spend much less money per student, but a teaching job is much more prestigious than it is in the U.S.
I personally think that explanation fits both the "cost is orthogonal to education quality" argument of the parent, and your point that elite private schools have significantly better education quality, though I do not know offhand of studies about this.
If it's ok for them to use an algorithm to grade essays then it would be ok for students to use an algorithm to write the papers no? How is it different from using a calculator? it's a tool right? mathematics used to only be only pen and paper. (I know it says it would still be graded by people, I'm making a point)
Writing is an artform and as such should be treated and graded as one. Sometimes you need to use language that is grammatically incorrect to get a point across and software is not nuanced enough to pick up on this. I can see this helping to turn out students that are good grammaticians but terrible writers - optimizing for the wrong outcome.
I would recommend downloading the dataset[1] and taking a look at some of the essays. There is a tendency for an educated audience to (justifiably) worry about exactly what you've articulated, but when you read through the essays, you get an appreciation of how valuable a tool like this could be. These are elementary and middle school essays rife with rudimentary mistakes (which are at times extremely charming) and generally not filled with subtle arguments prone to misinterpretation.
If algorithms can help every seventh grader in the country write a solid 5 paragraph essay, that's progress.
I participated in the Kaggle competition, but I'm less optimistic about the pedagogic value of these systems. A black box algorithm won't explain to a student what their rudimentary mistakes are.
Great write up! I hadn't seen it, thanks for sharing.
Two things:
1) Agree about the explanation bit. In fairness, that was not part of the competition. But I do believe that, given the way the features were constructed, many of the algorithms could be modified to also provide "explanation" of their scoring. The value is highly dependent on the features though (length, for instance, would probably be a prominent "explanation"), and I haven't actually tried it.
2) The "This would grade Dickinson/Hemingway/DFW/etc poorly" argument is true but, again, read the essays. These are serious edge cases - ones that should be addressed (by, for instance, correlating algorithmic scoring with human scoring), but this argument doesn't diminish the applicability of the algorithm to the vast majority of essays and students.
Will this help the nearly-illiterate masses? I think so. Will it hurt those who are one standard deviation or more above the norm? I think that's a problem, too. The solution then is to use the algorithm on the first set and more individual-oriented grading methods for the other set. Effort should be spent finding those in the other set and that isn't restricted to a single course or assignment. This challenges the assumption that students are equal and deserve equal grading, but I don't really see another way forward that doesn't hurt any of the best, mediocre, or poor writers.
For the standardized test discussed in the article, it says the human graders gave three minutes, on average, to reading each essay. These tests are not wanting art. Writing one of these essays is no more art than writing a basic linked list implementation. It's not concerned with how good a writer a kid is, but that the kid can write at all.
Edit: basically, the sheer scale of the grading required for a standardized test means that it's never going to be a real test of writing skills. It wouldn't surprise me if overly skilled writing is actually a negative even with the human grader—I could easily see a rushed grader misreading something clever and giving it a low score.
(The wisdom or effectiveness of these tests in the first place is a whole different discussion.)
"Art" is not even a question - these tests typically apply to high school kids, and you're lucky if bare-bones basic communication is possible there. Most of these kids are straight-up idiots, and I say this having spent several years teaching them.
I'd guess that the algorithm "map number of Microsoft Word reported spelling errors from (0 -> 800) to (20 -> 400)" does a pretty damn good job of guessing students' SAT Writing scores. Add in a factor for essay length, unique word cound, and grammar errors, and you'd probably get pretty damn close to perfect, within a reasonable statistical tolerance.
Children don't produce art, almost all of them produce garbage, right up to the top 1%. Make no mistake, all that standardized tests aim to measure is how stinky the shit that they produce is. I'm fine with missing a bit of brilliance at the top; those students will do just fine anyways. It's ranking the relative crappitude of the middle that's important here, and I think the standard (if crappy) grading procedures do a pretty good job here.
>I'd guess that the algorithm "map number of Microsoft Word reported spelling errors from (0 -> 800) to (20 -> 400)" does a pretty damn good job of guessing students' SAT Writing scores.
The best part about that is the fact that people misspell words even with spellcheck! Come on, how hard is it to notice the squiggly lines and correct your error by picking the correct spelling from the drop down list?
And I wouldn't say what I produce is shit, but then I would have said that when I was twelve. And I mostly have to double time on projects and essays, so they probably are shit.
EDIT: And since losethos (Schizophrenic dude who's hellbanned.) asked, the boss; it's pretty clever to use someones tendency for pedantry against them.
Jaded much? Maybe I'm overly optimistic of the quality of students the school system produces? (I never went to high school, I took the GED and scored in the 100th percentile across the board and went directly to college at 16)
I'm sorry, but this is absurd, and it makes me very angry. If I got a whiff of this attitude from you as my teacher, you would not get an ounce of my respect as a student. Why should I spend time trying to produce good work to impress you if you've already decided that what I produce will be shit? If you take the ridiculous view that my attention to grammatical detail, essay length, and variety of vocabulary used are the most important things about my writing, and enough to determine my grade?
I say this as someone who graduated at the top of my high school class - if you can't get me to work hard for you, you're pretty screwed with the average student. It's no wonder that the stuff they wrote for you was crap.
I upvoted you for the interesting line of thought, although I disagree with the premise that acceptability of grading algorithmically implies acceptability of writing algorithmically.
In CS, we often grade assignments automatically, although we expect students to write the assignments by hand. Although there are many classes where this would be a bad idea, there are some where it works quite well - if the problems are strictly specified, grading an assignment with what are effectively unit tests is reasonable.
Grading and doing an assignment are different classes of problems - roughly mapping to decision problems and search problems. Just because the grading is computationally tractable or acceptable does not necessarily imply that the act of writing the assignment is.
I agree with your second paragraph by and large. There is a time and place for automatic grading, and most writing should not fall into that category.
One difference with CS algorithmic grading is that it's much more specific and causally connected to outcomes. It's not looking for statistical correlates of the input text and a notion of "good program", but is actually finding things like compile errors and failed testcases. And, crucially, it can also report those to the student: your program is wrong because when fed 5 it gives an output of 23 when the output should be 25.
It's probably possible to grade student programs like this essay-grading system works: instead of feeding the programs to a compiler and running against test-cases, just turn the source code into a bag-of-words representation and feed it to a statistical classifier. There are almost certainly some statistical correlates between the source code of programs and whether they're good/bad submissions. But the results would be both less reliable and less informative as feedback.
Good point; I agree this makes code much more naturally machine graded than essays, and a statistical classifier for source code sounds disturbing (although it might do better on grading style than the usual autograding does).
The evaluations are on "how well algorithms submitted by professional data scientists and amateur statistics wizards could predict the scores assigned by human graders." As long as some humans are always included as a control, this shouldn't be too big of a problem.
Giving students close access to a program that evaluates essays many times would be the worst possible thing to do. People are generally very good at adapting to systems, and the algorithm would break down pretty quickly as people find out which factors it favors and which ones it penalises (while human graders are arguably less strict in their criteria and can hence recognize these attempts at writing to the letter).
If students en masse merely ran their shitty essays through spell and grammar check, the quality of even college-level essays would be improved by a huge margin.
For 99% of the high school students I've worked with, their essays would be better if they had to pass Microsoft's spelling and grammar checks before they were submitted. The exceptions were all exceptional enough that they knew that they no longer needed to listen to those automated filters.
This inevitably leads to an arms race, with the students and software getting better and better. At some point, sufficiently advanced gaming of essay scores is indistinguishable from students who naturally write well by following the rules which lead to good writing. As long as creativity and originality aren't sacrificed, is it really a problem?
You may be overestimating the sophistication of these algorithms.
At least for this training set, my algorithm rewarded the length of the essay most of all (something like 65% of the total prediction). The only other significant factors were misspellings and prevalence of certain parts of speech.
That model matched the accuracy of human graders and several commercial essay grading packages.
Students reverse-engineering comparable algorithms won't necessarily have to write well to score well.
currently, if a student writes nonsense, there's a fairly significant chance that they will be caught and penalised. a human can detect nonsense in three minutes.
in contrast, i suspect algorithmic approaches can be gamed more easily because they don't adapt in the same way. they're not solving the hard ai problem; they're grading essays (currently) written for a human reviewer.
for example, what happens if a child learns an existing text by heart and then substitutes appropriate nouns and verbs to suit the context? say they learn "We hold these truths to be self-evident, that all men are created equal" and then, for an essay on their favourite pet, they hand in "We hold these kittens to be furry, that all kittens are created hungry". That's good grammar; it's got suitable references to the subject; it's clearly nonsense.
No it's incorrect. Compare "these truths... that all men are created equal, (...)" with "these kittens... that all kittens are created hungry". The that in the second sentence is wrong.
In all probability, anyone setting out to actually defeat these algorithms could easily do it with a couple of hundred repetitions of the same sentence.
Or, if it's slightly cleverer than that, certainly you could defeat it by producing a single, perfect-length, stylistically fantastic essay... which would be regurgitated word-for-word regardless of the subject matter.
I think, mind you, that software does have a place in analyzing student essays. If I could scan in an essay and have it spit out a word count, highlight any spelling errors or potentially problematic turns of phrase, and [most importantly] analyze for plagiarism, that would be valuable.
It does not necessarily lead to an arms race. Once the students learn to write to the gradind software, the schools can claim the students are getting better, and point to an 'objective, unbiased' measuring stick. Everyone wins except for society, i.e. sll of us.
I'm assuming there will be competition among the companies who develop such software where, periodically, schools will re-evaluate their provider and select the best one. Company A will demonstrate their competitive advantage by showing how Company B's software incorrectly grades as "Excellent" an essay that looks like English but reads as gibberish.
Company B will correct their software, but meanwhile Company C has come along and introduced a sophisticated analyzer which is able to grade the presence and quality of the logical argument being made in the essay. Company C will demonstrate how Company A and B's software grades as "Excellent" an essay which is correct English and isn't gibberish but makes no logical sense.
After an indeterminate number of iterations with the software getting better and better at finding issues, the only way to game it will be to write a great essay.
You are forgetting the negative effect of the ridiculous ranking systems that determine teachers performance evaluations. Why would a school select software which will make thier grades go down? If your grades go down it decimates the schools ranking because these are based on year on year improvements. You would select the software which makes it easiest to teach to the test so the kids 'do well'
I think this says more about the assignment and the human grading procedures than anything else. Even the supposedly creative sections in standardized tests are surprisingly narrow in what they consider to be good.
Exactly what I was thinking. The ability to reliably predict human scores, says more about the terrible way we "standardize" creative writing, than about the quality of the algorithms.
If the grading curve of an essay is a checklist how is this finding a surprise?
This article is coming to the wrong conclusion from the facts given. That the grades given by these algorithms matched the grades given by human essay reviewers on standardized tests almost exactly doesn't mean that our algorithms are great and that they should be rolled out to normal school environments to help students learn to write. It means that our method of grading essays on standardized tests is absolute garbage, because the algorithms can't judge semantic meaning or depth of insight. We're basically grading people on how prepared they are for writing About.com content farm articles.
I was told by the head search engineer for a well-known job board site that they'd trained a bag-of-words classifier to find badly written resumes that would benefit from professional editing.
They'd found it was one of the easiest text classification problems that had ever been tried. With a very small training set they'd attained off-the-charts accuracys.
How well did the algorithms detect insightful analysis, deep understanding beyond the immediate subject matter, factual correctness, salience and an ability to write to a specific audience?
When reading articles on the Internet I look out for superficial factors including:
‣ Misuse of U+0022 typewriter double quotation marks in place of U+201C and U+201D double quotation marks
‣ Misuse of U+002D hyphens in place of U+2013 en dashes and U+2014 em dashes
‣ Poor understanding of comma and semicolon usage
‣ Choice of typeface, use of ligatures and micro-typographical features, amount of leading used…
These factors can assist with analysis of writing. An author can generate credibility in certain contexts by using triangular Unicode bullets instead of ASCII asterisks. Use of lengthy and complex words in place of simple English can also be mistaken for credibility. Analysis of these techniques will show that the writer is often attempting to belittle readers. Documentaries utilise similar tactics that the untrained eye will likely mistake for credibility. Soft filter effects, use of bookcase backdrops for expert interviews and silence have profound manipulative effects on viewers.
Grading algorithms will likely reward use of manipulative and belittling writing techniques and penalise honest superficial mistakes. I would rather read the honest opinion from an author that misuses their and there than have to carefully analyse the carefully crafted writings of a dishonest linguist.
As an aside, I recommend bookmarking The Browser[1] (Writing Worth Reading). It would be interesting to see the results of algorithmic essay grading against this curated collection of articles from across the Internet.
How well did the algorithms detect insightful analysis, deep understanding beyond the immediate subject matter, factual correctness, salience and an ability to write to a specific audience?
I'm far from well informed, but my understanding of standardised tests is that the standard specifies the algorithm, which already ignores your good points above to achieve standardised grading. All that really changes is whether a human or a robot executes the algorithm, the human insight has already been squeezed out of the system.
I wonder how the state-of-the-art in algorithmic essay grading compares to the state-of-the-art in scoring the quality of online content. Does Google have content-quality-estimating algorithms that could improve algorithmic essay grading?
Myself and a colleague entered this competition. We came 9th.
We were doing this for fun, and aren't experts in the domain, but I think our score was within 'diminishing returns' of the better teams.
There are a couple of things to realize about this challenge:
I wouldn't conceptualise the challenge as trying to find features of good essays. It is more about trying to find features that are predictive of the essay being good.
This is a subtle but important distinction. One example is that the length of essay was hugely predictive of the score the essay would get - longer meant better.
Is a longer essay really a better one? No. But, at the level the students were at, it just so happened that the students who were able to write better, also were able to write longer essays.
While you get good accuracy using techniques like this, its debatable how useful or robust this general approach is - because you aren't really measuring the quality of the essay, so much as you are trying to find features that just happen to be predictive of the quality.
Certainly, it would seem fairly easy for future students to game, if such a system was deployed.
This isn't a general attack on machine learning competitions - but I wonder, if for situations that are in some sense adversarial like this (in that future students would have an incentive to game the system), whether some sort of iterated challenge would be better? After a couple of rounds of grading and attempting to game the grading, we'd probably have a more accurate assessment of how a system would work in practice.
There is another important feature of this essay grading challenge, that should be taken into consideration. There were 8 sets of essays, each on a different topic. So, for example, essay set 4 might have had the topic 'Write an essay on what you feel about technology in schools'. To improve accuracy, competitors could (and I would guess most of the better teams did) build separate models for each individual essay-set/topic. This then increased the accuracy of, say, a bag of words approach - if an essay-set 1 essay mentioned the word 'Internet' then maybe that was predictive of a good essay-set-1 essay, even though the inclusion of 'Internet' would not be predictive of essay quality, across all student essays.
Its important to remember this when thinking about the success of the algorithms. The essay grading algorithms were not necessarily general purpose, and could be fitted to each individual essay topic.
Which is fine, as long as we realize it. The fact that it was so easy to surpass inter annotator agreement (how predictive one of the human graders scoring was of the other human graders scoring) was interesting. Its just important to realize the limits of the machine learning contest setup.
I would guess that accuracy would down on essays of older, more advanced students, or in an adversarial situation where there was an incentive to game the system.
Is a longer essay really a better one? No. But, at the level the students were at, it just so happened that the students who were able to write better, also were able to write longer essays.
It could also be the case that length is one of the features your human graders are using to grade essays. I.e., it might really be causal, rather than merely correlated.
In my (anecdotal) experience, teachers certainly do this. While in college I developed the skill of utilizing excessively long and verbose language while elucidating simple points simply to incrementally increase the length of essays [1].
Luckily a great prof in grad school (thanks Joel) beat this bad habit out of me.
[1] In college I learned to pad my essays with verbose language.
Its possible.
More generally, its also possible the human graders were doing a bad job; the ML system can only learn 'essay quality' to the extent that the training data reflects it.
However, the kaggle supplied 'straw-man' benchmark, which worked solely based on the count of characters and words in the essay, had an score of .647 with the training data. (The score metric used isnt trivial to interpret - it was 'Weighted Mean Quadratic Weighted Kappa' - but for reference the best entries had a score of ~.8 at the end)
The score of .647, just using length, is quite high.
For length to have this powerful a causal predictive effect, the human graders would have to be weighting for length, as a feature, very heavily.
I can't rule that out; but I think its highly likely a major component of the predictive effect of length was correlative, rather than causal.
While you get good accuracy using techniques like this, its debatable how useful or robust this general approach is - because you aren't really measuring the quality of the essay, so much as you are trying to find features that just happen to be predictive of the quality. Certainly, it would seem fairly easy for future students to game, if such a system was deployed.
I'm not able to dig up the name, but there's a named effect in statistics (especially social-science statistics) describing exactly that. When you find a correlate of a desired outcome that has predictive value, a common result if you then set the correlate as a metric is that a substantial part of the correlation and predictive value quickly disappears, because you've now given people incentives to effectively arbitrage the proxy measure. You've said, I'm going to treat easy-to-measure property A as a proxy for what-I-really-want property B. Now there is a market incentive to find the cheapest possible way to maximize property A, which often ends up being via loopholes that do not maximize property B. A heuristic explanation is that proxies that are easier to measure than the "real" thing are also easier to optimize than the real thing. At the very least, your original statistics aren't valid anymore, because you measured in the context where people were not explicitly trying to optimize for A, but now they are doing so, so you need to re-measure to check if this changed the data.
Aha, almost it; I was thinking of the very similar Campbell's law, which your mention of Goodhart's law led me to. Somehow no combination of search terms got me to either of those when I was trying to come up with the name, though...
As long as we're training algorithms to recognize correlates of "high-quality writing" rather than high-quality writing itself, why not use as many predictive features as possible? I'll bet parental income and education level, average home price in the school district, and the percentage of students at the school receiving free or reduced-price lunches, are incredibly correlated with writing quality.
My instinct is that most algorithms to test this may end up optimizing for essay length and word complexity, rather than being able to assess the content, however I imagine this is also how a lot of teachers grade, given that this article states that they receive 3 minutes of attention each.
At best, these algorithms can only pick on obvious spelling, grammar, punctuation and style errors. They have no concept of semantics. Therefore the students will have to polish the form and abandon the content.
To prove this point, I suggest setting up a testing competition, this time for algorithms to generate meaningless 'essays'. I expect their easy success obtaining full marks.
So, what will be the educational outcome of this? In three simple words: more dumbing down.
Is this another example of the obsession of officials everywhere with the pedestrian form at the expense of original content? I think it tells us a lot about their own limited mentality.
The increasing reliance on automated tools and standardized tests is just one of many symptoms of the increasing detachment between teachers and students. We have to stop doubling down on the prussian style educational system, it stopped returning dividends decades ago. More so we have to stop acting as though we can just solve our educational problems by throwing more money at it. We need to empower good teachers and parents, we need to encourage and support students. Right now all of the incentives in the system run counter to those things.