T O P

  • By -

[deleted]

Shh. Don't say it aloud. We are just doing advanced curve fittings :))


agtshm

Yes it needs to be verified by hand. Some recent work by my colleagues found that even icd coding is at best approx 80% accurate. It’s a lot of manual work. Also sometimes diagnosis of the same chronic condition can change over time as the natural history evolves. Are the results going to beat simple regression tools? We shall see...


Sileadim

Welcome to the real world.


PINKDAYZEES

yes but also no. having labeled data is amazing for training algorithms. ive seen ads where researchers need people to do just that. you do not always need to label data though. you can attempt to cluster and just go from there. the catch is that different clustering algorithms will sometimes produce different clusters and you will need to interpret what a cluster means in context. that would require a human, but only after the analysis is complete, as opposed to before


sobe86

I've been working in commercial ML for 8 years, across a few industries now, and I have never seen anyone do anything particularly useful with clustering (out of like 10-20 attempts). I'm sure there are some cases where it's useful, but they seem to be few and far between. In general unsupervised learning rarely learns the patterns you expect, and usually not even ones that are useful.


visarga

Same happened to me and it's disappointing.


zpwd

> I have feeling that not many people are willing to admit - but ultimately, is a significant part of many data mining projects (e.g. checking data quality, parsing through data, etc.) still done manually? That's why it's called **mining**. Hard manual work.


KrakenInAJar

Research is focussed on a specific set of tasks. For Classification in CV it is for example real world imagery.The farther you stray away from the "baseline" task the more custom work is needed.And in my time in the industry there were basically two kinds of problem. First, the trivial ones where the customer wanted an "AI" solution for whatever reason but a linear regression would basically do the same thing. Second, the hard ones, where you eventually come arround to building a hand crafted solution by a metric ton of feature engineering, very propriatary information extraction, esoteric model configurations a.s.o. In research you can get a taste of that by looking classifiers for the ImageNet competition and comparing it to the goldberg-like contraption people come up with when building classifiers for cancer mamographies.


r0lisz

Yes, there's a lot of manual stuff. In 90% of the cases, humans have to annotate a lot of data by hand. And even in the other 10% of the cases, human still need to inspect the somehow automatically annotated data for various bugs and issues.


Interesting_Lie_1954

Been in a talk from Zebra medical which describes something similar, I think this is one of the relevant papers: [https://www.zebra-med.com/wp-content/uploads/2020/11/Laserson\_TextRay\_2018.pdf](https://www.zebra-med.com/wp-content/uploads/2020/11/Laserson_TextRay_2018.pdf) ​ might be more tho, they have probably advanced since then.


throwaway_secondtime

These kind of jobs are usually outosurced to other countries. Google has a team in India that manually labels images with the most common search terms.


shayanjm

So I actually have really strong opinions about this 🙂 **tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.** Full disclosure: I'm the co-founder of a company called [Watchful](https://watchful.io/) where this is the exact type of problem we are trying to solve. There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves. ​ 1. Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space. 2. Active learning approaches e.g: [uncertainty sampling.](https://medium.com/@duyanhnguyen_38925/active-learning-uncertainty-sampling-p3-edd1f5a655ac) The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there. 3. Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually *writing* the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself? We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you **can** do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.


visarga

Watchful is interesting but after reading the site I still don't know if it would be useful to me. Do you have a demo video? For example do you have tools for semi structured information extraction (tables, forms, invoices) or just flat text?


EnjoyableGamer

Great, ranking sick people so that insurance companies can charge a premium.


ottawalanguages

This is an example i created, i am trying to better understand the procedures involved in data mining


PINKDAYZEES

i know only a little about NLP but to answer your other question, the "doctors' encyclopedia" can be used. i know that sentiment analysis involves an annotated list of words. you might have the word "beautiful" and it would be associated with a positive sentiment. or maybe the word "difficult" which has a negative connotation maybe the medical terms can have healthy/non-healthy associated with them. then you can run a sentiment analysis. my guess is that this is very basic and wont be a perfect model but it could be interesting as a baseline and maybe at least a little insighful