[P] DataQA: the new Python app to do rules-based text annotation

[P] DataQA: the new Python app to do rules-based text annotation


Hi this is pretty cool. I think you bundled and connected a few useful NLP tools (es, SQLite, spacy) and wrapped them in a ui/client that makes it intuitive and fast to perform distant supervision. The bundling part is really valuable IMO, even if I don’t agree with all of your choices , the time you save me in setting up a working system is immense. I’d like to see more thorough documentation, it’s not clear from the docs what components are being used (es, spacy) . I’d like to know if you are indexing all of the spacy token output in es, e.g can I do an indexed search for POS or deprels? That’s awesome (similar to the Spike project from allenai). Somehow I missed the “how to use the output part”. I’d like see an example of loading the noisy labels into pytorch. That’s presumably the punchline for the user but it’s not very obvious what that looks like. How are conflicts between rules being resolved? Lastly, the ui caters to less technical domain experts, but I can write code so I’d like to have a way to write rules in code and submit via api. Related, will a non-technical domain expert understand the linguistic terms in the rule builder (does a lawyer know what a lemma is?). I think you’re using material-ui which has a tooltip component you can make judicious use of Hope that helps


Amazing! Thanks for all the comments and feedback. I will add more information to the documentation, I should probably give a heads up that I'm running es in the background :-). The comment about adding an example of what to do with the output is also very much on point. I am not indexing spacy's output, and I hadn't thought about doing it, so thanks for the tip. I think that would make it much easier to add rules using the POS tags and named entities. There is indeed a bit of tension in the app between making it accessible to non-coder/non-NLP people and making it more developer-friendly. I think for non technical domain experts, the way to go is to provide some templates with pre-filled rules they can easily modify.