No code, wrapped: Our ML experiment concludes, but did the machine win?

itlnstln · Aug 15, 2022

Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly

Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.

oatmeal · Aug 15, 2022

Good article. I feel like this would have been more powerful if you had better data. I know that's always the challenge in the real world, but if the test is to figure out how well you can model and predict while being an amateur it would have been nice to remove that variable. As for Canvas vs. Studio, it may be a good comparison to Dataiku -- which is marketed as being for both 'coders and clickers.' They blend both pro and amateur data scientists into the same tool, to work alongside each other. I have no experience with it, but it might make a great part 4.

Incomprehensible · Aug 15, 2022

Thanks for the interesting read-thru's. Aside from the nitty gritty details, the aspect that stands out is the time/effort it took you to do the experiment.

I suggest taking your statement, "useful for hammering out quick models for assessing the more workaday analytic problems facing an organization" one step further. The value is not just for workaday problems, this level of rapid experimentation with relatively few "data scientist" trained resources can be useful across a wide range of business activities.

With a few caveats... 1) that leaders understand an experiment is not the final work, and 2) that results should be reviewed with more knowledgeable resources (actual data scientists) and 3) positive reviewed results should lead to more thorough analysis.

Team Tardigrade · Aug 15, 2022

It's not a small data set. It's a hand-crafted *artisanal* data set.

C64 raids Bungling Bay · Aug 15, 2022

How do these results compare with other commercial software or performing the analysis in R?

KT421 · Aug 15, 2022

There's a fair bit you can do to incorporate missing data into your model, but it can be pretty tricky and I've never tried imputing or accounting for missing data in SageMaker (honestly, never used SageMaker at all).

I will say, there's no such thing as a low-code or no-code solution for AI/ML. There's just code that you didn't write yourself. That makes it easy to crank out a quick model even with minimal training, but it also makes it easy to make an absolutely crap model without realizing it.

password123 · Aug 15, 2022

itlnstln said:
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

Click to expand...

As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly

Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.

As a non-data scientist, I took some optimism in Smith's quote to the effect that a non-data scientist could do some proof on concept for some XYZ feature implementation with a particular data analysis approach and know (with some accuracy) whether to continue (or not) with said approach with actual data scientists and spending effort to get more data, to make the XYZ feature real. Then I think about this a little more, and "nope" ain't gonna happen for any reliable proof of concept, for the reasons given in your second paragraph, now bolded. Sure these tools are quite powerful, but from my perspective I literally have no clue how much half-assing I would be performing trying to get some desired answer out of such a proof of concept: is my small dataset good or bad, have I modeled the problem properly, and how can I even tell?

Yes that quote is very "rosy glasses", but of course these data analytics giants are going to tell you that, they want to sell their services. Excellent article, extremely fascinating!

Jerdak · Aug 15, 2022

itlnstln said:
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

Click to expand...

...

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

.....

Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.

Like the "scarce data scientists" above, my team said things like "what's the ROI, you need more data, etc." In response, the BD folks went and got their tools and answers. The problem was (1) Precision/Recall on 100 rows of data is, I would argue, almost meaningless. 1 discrepancy has an outsized impact. and (2) They didn't understand over fitting or test sets so the "90% accurate model" they trained was barely better than a guess when applied to unseen data. But I get their frustration. It's hard not to see DS teams as being elitist when they don't immediately try to provide answers. I'm still trying to figure out how to temper expectations in a way that leaves the door open and doesn't make anyone feel ignored.

Xerxex · Aug 15, 2022

I'd love a follow up/continuation where you compare this to arguments from Kahneman, etc. that straightforward (relatively) algorithms created by experts are generally more effective/efficient to answer a lot of questions that companies often use ML instead for.

Fast Turtle · Aug 15, 2022

Begs the question: Is it a Nail or Screw? If it's a Nail, then a hammer is the correct tool for the job while if it's a Screw, then you need a screw driver to tackle it.

In other words, use the right tool for the job.

Good Series of Articles though on one of the available tools. I could see using this as an ameture to get a better understanding of some things with the risk of screwing things up by using the wrong data.

Qwertilot · Aug 15, 2022

Xerxex said:
I'd love a follow up/continuation where you compare this to arguments from Kahneman, etc. that straightforward (relatively) algorithms created by experts are generally more effective/efficient to answer a lot of questions that companies often use ML instead for.

That would fascinate me, yes. ML does some amazing things in a number of domains.

Yet all my semi applied experience (sundry Uni research projects) has ended up favouring quite simple approaches. So many ways to for that to happen. You often have not enough data, often enough it really isn't really rich enough to support machine learning, sometimes the company really doesn't want anything complex etc etc.

wasbee56 · Aug 15, 2022

interesting experiment. of concern to me is the 'black box-ness' of the process. fine when you can compare given results with actual outcome, but.... to me it's a more elaborate version of gigo in many ways. no harm in the 'desktop lab' but too often projected as a statistical certainty into RL not taking into account the interrelationships that evolve over time in complex datasets.. pretty cool tho

crmarvin42 · Aug 15, 2022

password123 said:
itlnstln said:

Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

Click to expand...

As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly

Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.

Click to expand...

As a non-data scientist, I took some optimism in Smith's quote to the effect that a non-data scientist could do some proof on concept for some XYZ feature implementation with a particular data analysis approach and know (with some accuracy) whether to continue (or not) with said approach with actual data scientists and spending effort to get more data, to make the XYZ feature real. Then I think about this a little more, and "nope" ain't gonna happen for any reliable proof of concept, for the reasons given in your second paragraph, now bolded. Sure these tools are quite powerful, but from my perspective I literally have no clue how much half-assing I would be performing trying to get some desired answer out of such a proof of concept: is my small dataset good or bad, have I modeled the problem properly, and how can I even tell?

Yes that quote is very "rosy glasses", but of course these data analytics giants are going to tell you that, they want to sell their services. Excellent article, extremely fascinating!

Not a data scientist, though I do my best at times.

I read it the same way you did. Pilot analysis to justify a REAL analysis later. Reminds me of the new professor conundrum from an AFRI grant seminar a few years ago.

AFRI is the agricultural equivalent to NIH and their preferred style of granting (A few really large and expensive grants over many more smaller grants) is patterned after the NIH system. Since they are laying out generally over a million dollars per project, they want a lot of assurances that the money won't be wasted, which means lots and LOTS of preliminary trial data. So much so, that at times it feels like the actual grant is borderline unnecessary, as we already know what is going to happen (not really, but it feels that way at times).

This is fine for established professors with solid reputations, good funding, and plenty of prior data to rely on. Not so much for new professors with just a bit of start-up money, no real prior data (they just started after all), but a pretty overt instruction from their department that AFRI grants = tenure, no AFRI grants = a swift boot in the pants in a few years. The guy overseeing the grant program pretty much admitted it was fucked up, but also declined to do anything about it. A few professor friends of mine were pretty non-plussed by the whole thing.

Now I'm not in academia, but allied industry. So itlnstln's concerns touch on a VERY common problem I run into. Management similarly wants assurances that something is successful BEFORE they support it, but paradoxically if it becomes successful without the support, then the support was unnecessary and you may not get it even after the product sees some success. If I can generate a proof of concept, then management may make a decision based on that proof of concept and skip the confirmatory analysis to save money. If it works, he's a great manager, if it fail, then you should have done a better job with your proof of concept analysis.

Fatesrider · Aug 15, 2022

Sean, as an example of applying ML/AI to a problem, this was an exceptionally detailed and informative series, with usable results in the outcome. More data was needed for analysis. That's to be expected in ML, because you can't really predict when, or if, the "tipping point" in the analysis will hit an acceptable level of accuracy.

I don't know if you've had any experience in the medical field, but I had 20 years, and knew from the beginning that there wasn't enough data for this to produce a result that had anywhere near an acceptable level of accuracy. It's not at all because I know how ML/AI works. It's because humans wouldn't be able to figure this stuff out if they were limited to just the data fed into the AI.

A physician undergoes years of internship and residency, and will see THOUSANDS of patients in that time. A specialty fellow will spend more years in their specialty and see hundreds to thousands of people who fall into that specialty every year. Diagnosing a heart attack is as much art as it is science, because people aren't numbers on a screen. Their physicality is difficult for a machine to grasp, and the human brain's intuitive capabilities to draw accurate conclusions from inconclusive data (aka: Experience) isn't quite as quantifiable as it would need to be. And when in doubt, which happens a HELL of a lot in medicine (spoiler!), older, wiser, more experienced comrades are called upon to help.

Also, everyone who presents with chest pain that can't be explained by physical examination (x-ray, auscultation of the heart sounds, lungs and airways, blood pressure, pulse and physical presentation such as diaphoresis, peripheral edema and/or cyanosis, etc.) or the history (type of pain, location, quality, onset, duration, cause such as trauma, etc.), will undergo a cardiac work-up since that's a precaution non-specific chest pain merits. A reliably accurate AI diagnosis MIGHT prevent someone from being admitted who didn't need to be admitted, but I'm inclined to think that wouldn't happen, either.

Applying AI works better to discover things people miss, especially in imaging and other exams that have a visual or even aural component. Heart attacks CAN be missed, but that's usually when someone has none of the typical symptoms, and their lab and EKG results are negative (Yes, it can happen, but not that often). That's where they're used now, IIRC.

AI/ML works better to find patterns rather than binary results, though. There's always a level of uncertainty involved in pretty much everything, which is why doctors have to make a decision. AI's can aid in pointing out data, but I'd think that making binary decisions like "treat or don't treat" should fall on the shoulders of those who have better processing capabilities because the quality of data is higher in what a human can perceive than what an AI is likely to understand - at least for the foreseeable future.

Still, a great series for explaining how to train your AI.

ZhanMing057 · Aug 15, 2022

KT421 said:
There's a fair bit you can do to incorporate missing data into your model, but it can be pretty tricky and I've never tried imputing or accounting for missing data in SageMaker (honestly, never used SageMaker at all).

I will say, there's no such thing as a low-code or no-code solution for AI/ML. There's just code that you didn't write yourself. That makes it easy to crank out a quick model even with minimal training, but it also makes it easy to make an absolutely crap model without realizing it.

Multiple Imputations by Chained Equations (MICE) is, in my opinion, the gold standard, depending on the type and structure of the missingness. I use it as one of my go-to paired interview coding questions - there are a lot of interesting issues regarding implementation.

OllieJones · Aug 15, 2022

Obviously the data points in this study represent real patients. They're precious and hard-won. They're expensive and difficult enough to obtain that models based on them will not be useful for mass screening.

And (hopefully) the cardiologists at the Cleveland Clinic and elsewhere wouldn't rely exclusively on any such model for diagnosing their patients.

A data set of Apgar scores (easy-to-measure information about newborn baby health) and outcomes might be interesting to use to build an experimental model to see how well it does. Almost all newborns (in the USA anyhow) get screened this way, so there will be plenty of data.

poltroon · Aug 15, 2022

Jerdak said:
itlnstln said:

Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

Click to expand...

...

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

.....

Click to expand...

Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.

Like the "scarce data scientists" above, my team said things like "what's the ROI, you need more data, etc." In response, the BD folks went and got their tools and answers. The problem was (1) Precision/Recall on 100 rows of data is, I would argue, almost meaningless. 1 discrepancy has an outsized impact. and (2) They didn't understand over fitting or test sets so the "90% accurate model" they trained was barely better than a guess when applied to unseen data. But I get their frustration. It's hard not to see DS teams as being elitist when they don't immediately try to provide answers. I'm still trying to figure out how to temper expectations in a way that leaves the door open and doesn't make anyone feel ignored.

"Making decisions through data" is very sexy now in all the business domains, along with "north star metrics" but it's kind of alarming how overtrusting these groups can be about conclusions based on very small numbers or kind of iffy assumptions. Just because it's a number doesn't mean it's objective. Just because it's a number or easy/possible to measure doesn't mean it's actually measuring what matters to you.

When people around me get too attached to "north star metrics" I remind them that the north pole and magnetic north and the north star are in slightly different places and that you should course correct as you go along your journey, and not blindly follow that metric to the end, lest you find yourself on some ice floe.

JohnDeL · Aug 15, 2022

What I would point out to the people using data analysis tools at my old job is that the tools were intended to supplement a person's judgement, not replace it. As Dr. Gallagher was at pains to point out, there are simply times when the tool will be wrong. (In my old job, false positives were far more important than false negatives; here the opposite is the case.) As I used to tell the end-users, when in doubt, trust yourself and ignore what the computer says.

Unfortunately, I foresee insurance companies and upper management using these tools to deny insurance claims/necessary but expensive treatment. ("It can't be a heart attack; the computer said it wasn't!")

Sheep Disorder · Aug 15, 2022

wasbee56 said:
interesting experiment. of concern to me is the 'black box-ness' of the process. fine when you can compare given results with actual outcome, but.... to me it's a more elaborate version of gigo in many ways. no harm in the 'desktop lab' but too often projected as a statistical certainty into RL not taking into account the interrelationships that evolve over time in complex datasets.. pretty cool tho

I find that people refering to black box do not understand statistics and modeling involved. It is not a black box because we do know how these things work (otherwise it cannot be calculated). But it is a black box is due to the lack of simple explanation that you get with basic linear regression.

Once you get beyond a simple model, it is not trival to explain any statistical methodology especially with categorical data as used here. Sure, I can say an increase in age will increase the risk of having a heart disease. But that does not mean that as someone gets older then they must have a heart disease. It is a little more complex if we add sex as older men may have increased risk of heart disease than say young women. Then starts getting insanely difficult if you include multiple levels of interactions with factors like food and work. Thus, you end up with a black box because you cannot explain happen is going on because these machine learning approaches are typically involving all possible interactions.

The only question of relevance is does some particular characteristics like old, retired and man increase the risk of heart disease or not?
Thus, the focus on how well does the model predict the outcome. It is not important that older retired men have higher risk of a heart disease than some other combination. If we really wanted to know the importance of actual factors then we need a completely different approach and probably experiment.

jamesdedon · Aug 15, 2022

C64 raids Bungling Bay said:
How do these results compare with other commercial software or performing the analysis in R?

It's pretty much the same, with R simply reusing the same packages (sklearn, etc.) and performing the same tasks. The only difference is the syntax used to load the data and run the packages.

TheOldChevy · Aug 15, 2022

Thanks for this precise article.

What I take from it is that AI can get you better results and faster than a non-specialist in data analysis, and for a much lower cost, but the time and the experience of the trained professional is expected to do better than the "low cost AI" - even if this is not valid by the article.

And that is what is frightening: having "managers" going for the fast and low cost solution to get rapid answers and dismiss the effort to go for the expert. It is fine for many tasks that are not critical and that are globally supervised by people who are professionals or at least experienced enough to flag when the AI provides crazy results. But I am afraid that it will be used where it shouldn't be used.

SuperAce99 · Aug 15, 2022

C64 raids Bungling Bay said:
How do these results compare with other commercial software or performing the analysis in R?

Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.

I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.

In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.

Deleted member 221201 · Aug 15, 2022

Jerdak said:
itlnstln said:

Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

Click to expand...

...

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

.....

Click to expand...

Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.

Like the "scarce data scientists" above, my team said things like "what's the ROI, you need more data, etc." In response, the BD folks went and got their tools and answers. The problem was (1) Precision/Recall on 100 rows of data is, I would argue, almost meaningless. 1 discrepancy has an outsized impact. and (2) They didn't understand over fitting or test sets so the "90% accurate model" they trained was barely better than a guess when applied to unseen data. But I get their frustration. It's hard not to see DS teams as being elitist when they don't immediately try to provide answers. I'm still trying to figure out how to temper expectations in a way that leaves the door open and doesn't make anyone feel ignored.

Given the data sparsity, the classifier used above is a good choice.

You are basically using a Decision Tree ensemble that is gradient descent boosted/optimized

I'm somewhat surprised it cost $1000, but since its for a series of articles & Sean has posted the metrics etc, it is much appreciated.

Another possible approach would be to use a Random Forest & set the grid search to reasonable values after looking at the dataset and imputing any missing values etc

The final set of models could be then checked against XGB to see if you gain any lift over the prior model, but since this was a no-code exercise, the experiment achieved it's goal and it was very well written

crmarvin42 · Aug 15, 2022

SuperAce99 said:
C64 raids Bungling Bay said:

How do these results compare with other commercial software or performing the analysis in R?

Click to expand...

Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.

I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.

In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.

I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)

Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.

poltroon · Aug 15, 2022

crmarvin42 said:
SuperAce99 said:

C64 raids Bungling Bay said:

How do these results compare with other commercial software or performing the analysis in R?

Click to expand...

Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.

I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.

In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.

Click to expand...

I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)

Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.

There are MOOCs that teach it that you may find of use.

My daughter learned R as an undergraduate and even without any "real" (ie non-statistical) coding experience, came up to speed on it quite quickly, ending with a signature project on a real data set sourced from the wilds. The key I think is to start with some straightforward canned problems that resolve properly and/or with some guidance so you can understand the tricks of the trade.

FreshAir · Aug 15, 2022

KT421 said:
There's a fair bit you can do to incorporate missing data into your model, but it can be pretty tricky and I've never tried imputing or accounting for missing data in SageMaker (honestly, never used SageMaker at all).

I will say, there's no such thing as a low-code or no-code solution for AI/ML. There's just code that you didn't write yourself. That makes it easy to crank out a quick model even with minimal training, but it also makes it easy to make an absolutely crap model without realizing it.

In the same vein as "baking pies from scratch".... if you want to hand roll your own AI solution, you have to start by writing your own assembly language.

low code / no code solutions means you can trust that someone else correctly implemented XGBoost or an SVM. You can further trust that someone else can correctly set up a 5 fold cross validation. What NONE of these tools will do is "imagine" for you, or make logical decisions about the suitability of the data to the task. This is what I mean by suitability:

The mistake I see new AI/ML engineers do the most, is a variant of training/serving skew specifically on the "TIME OF DATA AVAILABILITY" front. In essence, the dataset you have is of course historical, but very often, those data points came into existence at different times. Chest pains (if recorded in the affirmative) will almost always precede lab workups and EKGs because there is more than a correlative relationship between them. And carrying that notion a little further, we can state that nearly all causative relationships have SOME time offset between the cause and effect. But I have never once seen a data set where EVERY event was properly "effective dated" such that I can reconstruct the timeline as it unfolded. Knowing this problem exists and taking steps to mitigate it is what separates a lot of successful projects from failures.

I can give you a concrete example of this point: I have a data set where sometimes, the event I want to predict happens BEFORE the timestamp of the preceding event is recorded in the database (the time delta between the event that tells me I need to make a prediction and the timestamp of the event I wanted to predict can be shorter than the business process that causes that data to be recorded). But to look at the data itself, you would never know this issue existed. So you can build a model and confidently say "I can predict this value with an R^2 of .95" but in reality, you are failing to even make a prediction in many cases because the total system latency prevents you from doing so.

For any model that you wish to operationalize within a business, you're going to have to tap the business data, and that is where the real challenges lie. Even after working around business process issues, tapping the data warehouse puts you at risk of latency issues, and tapping the ODS pisses off the DBAs and LOB systems guys (putting analytical loads on operational data stores is a no-no). In the case of the heart attack model, the latency is probably in the realm of weeks or months, where the usefulness of the information probably only spans days or maybe weeks. (I know, I know, this is the promise of all the IOT devices, but that just leads to a new problem... trying to integrate your algorithm on thousands of different platforms, none of which has a full picture of all the data you just trained your model on in the first place)

Personally I think the job of a data scientist is the study of this data "in the wild", Jane Goodall style, to uncover these hidden relationships. That is what keeps you from falling into the trap of "my R^2 looks really good, why doesn't it work when we turn it lose on live data?"

tigerhawkvok · Aug 15, 2022

Most ML algorithms by default run variations on least squares (or area under curve/log loss) as their "loss" algorithms -- they optimize by attempting to minimize that value. This is usually correct.

For a question like this where one classification (or type of value is nonstandard, or whatever), you probably want to adjust your loss function (or roll your own) to find a different optimum. For example, if you were running log loss, your loss function was similar to this: https://scikit-learn.org/stable/modules ... _loss.html

You could take the log loss function and, say, multiply the first time by two to say that a true positive is twice as valuable as a true negative. It would probably decrease your overall accuracy (the model would tend to err on 'positive' and pick up bad values) but probably _increase_ your target metric. (Or, of course, do something more complex to balance the probabilities, but this is a toy example).

(my job role is kind of a crunchy data scientist; a data scientist who is heavy on the code and models and math. An IRL example for me is that we usually care about _median_ error rather than mean error, so I usually have to do mangling on default algorithms)

Jerdak · Aug 15, 2022

madmax559 said:
Jerdak said:

Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.

....

Click to expand...

Given the data sparsity, the classifier used above is a good choice.

You are basically using a Decision Tree ensemble that is gradient descent boosted/optimized.

That speaks directly to my point about tools giving a false sense of security and proficiency. Few outside the technical fields know about or understand decision trees and it isn't a 'Set it and Forget it' algorithm. The naïve user won't know about limiting tree depth or pruning. They won't understand the effect multicollinearity has on required sample size. Nor would they understand the importance of trimming features. Users have given me data with more columns than rows several times or categorical data with missing values that can't be imputed. I digress.

I'm not averse to technological simplification. I'm a fan of tooling myself and don't want to spend hours hand writing the same ML/DS boilerplate. But to quote the great Dr. Ian Malcolm: "Your scientists were so preoccupied with whether they could, they didn't stop to think if they should." Amazon will certainly make this even easier but I think most Ars readers have seen what happens when we cede more control and understanding to black boxes.

CityZen · Aug 15, 2022

The first confusion matrix and the text around it don't seem to agree. At least, it's leaving me confused.

JMTronicHobbyist · Aug 15, 2022

itlnstln said:
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."

Click to expand...

As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.

Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.

A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly

Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.

I like data wrangler helping piece together bad data - I don't like the sound of that - the ML p-hacking its way to result for the user's pleasure perhaps (a gross simplification to be sure)? Icky icky icky!

ZhanMing057 · Aug 15, 2022

crmarvin42 said:
SuperAce99 said:

C64 raids Bungling Bay said:

How do these results compare with other commercial software or performing the analysis in R?

Click to expand...

Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.

I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.

In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.

Click to expand...

I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)

Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.

I'm not really sure what the exact issue is. I've taught courses in statistical programming at the grad level, and R programming specifically at the undergraduate level, and while the learning curve is steeper than, say, Python, I don't think I've ever run into anyone who can't get it to work at all.

Code is just like language, some are easy to learn, others harder, but with very few exceptions they're pretty much the same thing under the hood. It doesn't matter if it's R, or Python, or C++ or even something a bit less hand-holding such as Fortran, the core principles are not so different, and if you know one, you should be able to code functionally (if not efficiently) in any of the others with a bit of retooling.

Practically, there are millions of pieces of demo code for statistics problems in R floating around. If you want to solve a particular issue (and isn't on the very cutting edge of applied statistics), chances are someone has done essentially this exact thing, and their code gets you 90% of the way there. On the proficiency side, I'd start with Hackerrank and work through their entire repository of R exercises - it's pretty small relative to what they have for Python IIRC.

cripes · Aug 15, 2022

As an economist with decades of stats/econometrics training and experience, not sure what to make of all this. I can see a bunch of notebooks that examine this dataset on Kaggle. All look like copies with minor variation but good example: https://www.kaggle.com/code/namanmancha ... 0-accuracy.

I am struck by:

1. the fact that the model with the highest accuracy (90% to be precise) is the logistic regression. So much for ML when the old school stats modeling approach wins out! There's not even a model selection loop in one of the books I looked at that achieved this result, just all drivers thrown in and model fit called once.

2. The accuracy result is highly dependent on what mother RNG or dear researcher decides is the training vs testing data. Highly fishable results in the hands of motivated amateurs...

3. The art of statistical model building with or without ML tools is deciding on what the candidate drivers should be and transforming them appropriately. That's where domain knowledge and theory come in. Otherwise you have idiots or computers throwing in sunspots as controls in your heart attack prediction models. Granted you can get away with more naivety with much bigger datasets using ML tools but then we are talking big data not 300 observation clinical trials.

Deleted member 221201 · Aug 15, 2022

One of the images above has "Create an Autopilot experiment" as the title.......which is funny....given the ML context

Deleted member 221201 · Aug 15, 2022

@Sean,

I forgot to ask a couple of questions earlier, about the data selection/split

1. Did you use a fixed seed so the testing was deterministic ?

2. The binary classes were not 505/50 in the data set so were any weights added to account for class imbalance ?

or you can do that in XGB directly

From online docs

Code:

model = XGBClassifier(scale_pos_weight=100)

The XGBoost documentation suggests a fast way to estimate this value using the training dataset as the total number of examples in the majority class divided by the total number of examples in the minority class.

scale_pos_weight = total_negative_examples / total_positive_examples

ZhanMing057 · Aug 15, 2022

cripes said:
As an economist with decades of stats/econometrics training and experience, not sure what to make of all this. I can see a bunch of notebooks that examine this dataset on Kaggle. All look like copies with minor variation but good example: https://www.kaggle.com/code/namanmancha ... 0-accuracy.

I am struck by:

1. the fact that the model with the highest accuracy (90% to be precise) is the logistic regression. So much for ML when the old school stats modeling approach wins out! There's not even a model selection loop in one of the books I looked at that achieved this result, just all drivers thrown in and model fit called once.

2. The accuracy result is highly dependent on what mother RNG or dear researcher decides is the training vs testing data. Highly fishable results in the hands of motivated amateurs...

3. The art of statistical model building with or without ML tools is deciding on what the candidate drivers should be and transforming them appropriately. That's where domain knowledge and theory come in. Otherwise you have idiots or computers throwing in sunspots as controls in your heart attack prediction models. Granted you can get away with more naivety with much bigger datasets using ML tools but then we are talking big data not 300 observation clinical trials.

I have a rather dim view of Kaggle because (1) it's a waste of time for anyone who is even slightly proficient at ML, since there are any number of real projects to work on, either related to your hobby or quick research/consulting jobs that will actually pay you money and (2) it takes attention away from the business-relevancy question, which in almost all cases is much harder than building a model that predicts stuff out of sample.

Take heart attacks. If you screened every single person that so much as had a tight chest, you'd reduce prevalence of heart attacks but at a disproportionate cost. Even in this relative simple question, there is a fairly tricky false positive/false negative trade-off. Where on the ROC you want to end up at is a complicated issues involving domain-specific knowledge, resource constraints and, occasionally, ethics. For example, to make the policy useful, you're implicitly assuming that some unforeseen heart attacks are okay, to lower the number of unproductive screenings.

With more complex problems, my experience is that parametric methods, even the very smart ones (elastic net, fancy variable selection MCMCs) tend to underperform neural nets when it comes to predictive performance. It depends on what you want - in many cases, it's perfectly fine to not know exactly what is driving the outcome, so long that the model is sufficiently robust to inputs.

EternalStudent07 · Aug 15, 2022

Having only read the intro I would guess "over fitting" of your data is a likely problem if worried about hitting 90% accuracy. Meaning it was good at that specific data set, but would fall down with new data. You're supposed to split your data (can't remember if multiple times or not). Use some data to train, and other data you didn't see before to confirm the result on further new data.

I'm not a data scientist either, but did some reading and that was what stuck

Hope you had fun poking those tools, and maybe I'll return to see if I can learn something.

crmarvin42 · Aug 15, 2022

ZhanMing057 said:
crmarvin42 said:

SuperAce99 said:

C64 raids Bungling Bay said:

How do these results compare with other commercial software or performing the analysis in R?

Click to expand...

Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.

I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.

In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.

Click to expand...

I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)

Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.

Click to expand...

I'm not really sure what the exact issue is. I've taught courses in statistical programming at the grad level, and R programming specifically at the undergraduate level, and while the learning curve is steeper than, say, Python, I don't think I've ever run into anyone who can't get it to work at all.

Code is just like language, some are easy to learn, others harder, but with very few exceptions they're pretty much the same thing under the hood. It doesn't matter if it's R, or Python, or C++ or even something a bit less hand-holding such as Fortran, the core principles are not so different, and if you know one, you should be able to code functionally (if not efficiently) in any of the others with a bit of retooling.

Practically, there are millions of pieces of demo code for statistics problems in R floating around. If you want to solve a particular issue (and isn't on the very cutting edge of applied statistics), chances are someone has done essentially this exact thing, and their code gets you 90% of the way there. On the proficiency side, I'd start with Hackerrank and work through their entire repository of R exercises - it's pretty small relative to what they have for Python IIRC.

The problem is likely with me. I don't like to write code. JMP offers a UI that builds the code as you select options. Sure you can always access the code, and I do on occasion, but most of the time it is unnecessary to get what you want out of the system quickly.

I'm sure R can do everything I do in JMP, but it would either require me finding existing code to modify, or remembering the language well enough to craft it from scratch each time I needed to do something. Either one is quite a bit slower.

I had to take all of my early stats classes using SAS, and I hated statistics. Once I found jump, I learned that I actually enjoy statistics. What I hated was being a shitty programer. Looking for that missing semi-colon, or trying to parse the instructions for a module to figure out how to make it do what I wanted. Never sure if it wasn't working because it was a misapplication of the module, or because I'd just gotten the syntax wrong.

cripes · Aug 15, 2022

ZhanMing057 said:
With more complex problems, my experience is that parametric methods, even the very smart ones (elastic net, fancy variable selection MCMCs) tend to underperform neural nets when it comes to predictive performance. It depends on what you want - in many cases, it's perfectly fine to not know exactly what is driving the outcome, so long that the model is sufficiently robust to inputs.

Well my point was that none of this stuff works on small clinical trial sized data sets and classical methods work as well or better by not over fitting.

Re Kaggle, I only mentioned because they were the first hits when trying to find the study data and conveniently had Jupyter notebook code right there, which I have no trouble reading. Also appears to be one source of the 90% accuracy data.

cripes · Aug 15, 2022

Btw, for anyone curious this blog post, and moreso the comment section, are a great discussion of classical statistics vs. ML methods. https://statmodeling.stat.columbia.edu/ ... al-models/

And this particular comment gives an example where ML methods are often oversold compared with classical statistics methods
https://statmodeling.stat.columbia.edu/ ... nt-1043199

The tech community have basically stumbled into the field of predictive inference and are only slowly learning the lessons that statisticians learned decades ago, just on bigger datasets.

No code, wrapped: Our ML experiment concludes, but did the machine win?

Ars Centurion

Ars Praefectus

Wise, Aged Ars Veteran

Ars Centurion

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Centurion

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Praefectus

Ars Legatus Legionis

Ars Praefectus

Ars Praetorian

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Seniorius Lurkius

Ars Tribunus Militum

Wise, Aged Ars Veteran

Deleted member 221201

Guest

Ars Praefectus

Ars Tribunus Militum

Ars Centurion

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Centurion

Ars Tribunus Militum

Ars Praefectus

Ars Centurion

Deleted member 221201

Guest

Deleted member 221201

Guest

Ars Praefectus

Ars Praetorian

Ars Praefectus

Ars Centurion

Ars Centurion