First things first: hats off on a concise, accurate and efficient implementation in python.
But the Cython parser is not doing labeled parsing.. a significant number of feature templates are relevant to labeled parsing.
To reach Nivre and Zhang's 2011 results I think you'll find labeled parsing will degrade performance significantly and you'll need more complex code
1. you have more than an order of magnitude more transitions (80 vs 4)
2. you need to add beam search, with early update for training. beam size = 64 to be consistent with zpar
3. an order of magnitude more feature templates, with more complex features
So now you'll be evaluating 20x more transitions, on ~64x more candidates, with ~10x more feature templates. You can't multithread properly in python, you can maybe get some more juice by using some form of SIMD (w/ or w/o GPU). Your model is much larger in memory, so any luck you had in L2/L3 caching is probably gone. Python has limits and this is the point you start hitting them.
Edit: I see you've updated your comment, I guess we're on the same page. It's nice to see a concise implementation. I wrote one in golang for my masters thesis. I achieved perfect parity with zpar, but it took two order of magnitude more LOC.
Sorry for the lack of clarity. The Cython parser results that I quoted uses labelled parsing with beam search. There are various versions in the repo, but there's definitely a configuration that runs the Zhang and Nivre (2011) experiment verbatim, with exactly the same feature templates, and the same results that they get. Replicating those results was essential to implementing the parser correctly.
Those are Unlabelled Accuracy score results, but Redshift is quietly computing the dependency labels, while parser.py does not. Running Redshift in unlabelled mode gives very fast parse times, but about 1% less accuracy. The labels are really good, both as features and as a way to divide the problem space.
The data sets are the _Stanford_ labels, where the main results in Zhang and Nivre refer to MALT labels. Z&N do provide a single Stanford accuracy in their results, of 93.5% UAS.
Sentences per second should be just over 100. I use k=8 with some extra features referring to words further down the stack, where Z&N use k=64. Right at the bottom of the post, you can find the commit SHA and the commands for the experiment.
But the Cython parser is not doing labeled parsing.. a significant number of feature templates are relevant to labeled parsing.
To reach Nivre and Zhang's 2011 results I think you'll find labeled parsing will degrade performance significantly and you'll need more complex code
1. you have more than an order of magnitude more transitions (80 vs 4)
2. you need to add beam search, with early update for training. beam size = 64 to be consistent with zpar
3. an order of magnitude more feature templates, with more complex features
So now you'll be evaluating 20x more transitions, on ~64x more candidates, with ~10x more feature templates. You can't multithread properly in python, you can maybe get some more juice by using some form of SIMD (w/ or w/o GPU). Your model is much larger in memory, so any luck you had in L2/L3 caching is probably gone. Python has limits and this is the point you start hitting them.
Edit: I see you've updated your comment, I guess we're on the same page. It's nice to see a concise implementation. I wrote one in golang for my masters thesis. I achieved perfect parity with zpar, but it took two order of magnitude more LOC.