They do give an example of a question, in which the model chose an incorrect answer in the adversarial setting:
"The condition of the air outdoors at a certain time ofday is known as (A) friction (B) light (C) force (D)weather[correct](Q) joule (R) gradient[selected](S)trench (T) add heat"
I assume this might be characteristic for other questions as well, although I don't know anything about the Regents Science Exam and whether there are multiple questions about closely related topics.
It is a well-worded question for its purpose. The whole point is that, of all the options given, only one is justifiable (and it does not require a tendentious stretch to justify it, either.) Even “light” (which was not chosen) only applies half the time, on average. This is a valid test of natural language understanding.
Remember when IBM went on Jeopardy? There was a question about which Egyptian pharaoh. A human with some knowledge of history might mix up Ramses and Seti, or whatever, or just not know the answer, but know that they didn't know. Watson answered "What are trousers?" with supreme confidence.
Jeopardy is fun and games and it was great for the blooper reel, but they're trying to sell this stuff to diagnose cancer and guide police efforts. Failure modes are kind of important.
"The condition of the air outdoors at a certain time ofday is known as (A) friction (B) light (C) force (D)weather[correct](Q) joule (R) gradient[selected](S)trench (T) add heat"
I assume this might be characteristic for other questions as well, although I don't know anything about the Regents Science Exam and whether there are multiple questions about closely related topics.