In the finale of our experiment, we look at how the low/no-code tools performed.
Read the whole story
Read the whole story
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.
Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.
A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly
Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
...
Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.
.....
I'd love a follow up/continuation where you compare this to arguments from Kahneman, etc. that straightforward (relatively) algorithms created by experts are generally more effective/efficient to answer a lot of questions that companies often use ML instead for.
Not a data scientist, though I do my best at times.Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.
Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.
A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly
Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.
As a non-data scientist, I took some optimism in Smith's quote to the effect that a non-data scientist could do some proof on concept for some XYZ feature implementation with a particular data analysis approach and know (with some accuracy) whether to continue (or not) with said approach with actual data scientists and spending effort to get more data, to make the XYZ feature real. Then I think about this a little more, and "nope" ain't gonna happen for any reliable proof of concept, for the reasons given in your second paragraph, now bolded. Sure these tools are quite powerful, but from my perspective I literally have no clue how much half-assing I would be performing trying to get some desired answer out of such a proof of concept: is my small dataset good or bad, have I modeled the problem properly, and how can I even tell?
Yes that quote is very "rosy glasses", but of course these data analytics giants are going to tell you that, they want to sell their services. Excellent article, extremely fascinating!
There's a fair bit you can do to incorporate missing data into your model, but it can be pretty tricky and I've never tried imputing or accounting for missing data in SageMaker (honestly, never used SageMaker at all).
I will say, there's no such thing as a low-code or no-code solution for AI/ML. There's just code that you didn't write yourself. That makes it easy to crank out a quick model even with minimal training, but it also makes it easy to make an absolutely crap model without realizing it.
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
...
Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.
.....
Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.
Like the "scarce data scientists" above, my team said things like "what's the ROI, you need more data, etc." In response, the BD folks went and got their tools and answers. The problem was (1) Precision/Recall on 100 rows of data is, I would argue, almost meaningless. 1 discrepancy has an outsized impact. and (2) They didn't understand over fitting or test sets so the "90% accurate model" they trained was barely better than a guess when applied to unseen data. But I get their frustration. It's hard not to see DS teams as being elitist when they don't immediately try to provide answers. I'm still trying to figure out how to temper expectations in a way that leaves the door open and doesn't make anyone feel ignored.
I find that people refering to black box do not understand statistics and modeling involved. It is not a black box because we do know how these things work (otherwise it cannot be calculated). But it is a black box is due to the lack of simple explanation that you get with basic linear regression.interesting experiment. of concern to me is the 'black box-ness' of the process. fine when you can compare given results with actual outcome, but.... to me it's a more elaborate version of gigo in many ways. no harm in the 'desktop lab' but too often projected as a statistical certainty into RL not taking into account the interrelationships that evolve over time in complex datasets.. pretty cool tho![]()
How do these results compare with other commercial software or performing the analysis in R?
How do these results compare with other commercial software or performing the analysis in R?
Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
...
Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.
.....
Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.
Like the "scarce data scientists" above, my team said things like "what's the ROI, you need more data, etc." In response, the BD folks went and got their tools and answers. The problem was (1) Precision/Recall on 100 rows of data is, I would argue, almost meaningless. 1 discrepancy has an outsized impact. and (2) They didn't understand over fitting or test sets so the "90% accurate model" they trained was barely better than a guess when applied to unseen data. But I get their frustration. It's hard not to see DS teams as being elitist when they don't immediately try to provide answers. I'm still trying to figure out how to temper expectations in a way that leaves the door open and doesn't make anyone feel ignored.
I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)How do these results compare with other commercial software or performing the analysis in R?
Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.
I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.
In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.
I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)How do these results compare with other commercial software or performing the analysis in R?
Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.
I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.
In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.
Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.
There's a fair bit you can do to incorporate missing data into your model, but it can be pretty tricky and I've never tried imputing or accounting for missing data in SageMaker (honestly, never used SageMaker at all).
I will say, there's no such thing as a low-code or no-code solution for AI/ML. There's just code that you didn't write yourself. That makes it easy to crank out a quick model even with minimal training, but it also makes it easy to make an absolutely crap model without realizing it.
Easier data science tooling is rarely a bad thing but I worry when business analysts without the requisite background are let loose on data. I recall once our business development folks asked me to tease out some answers on a very small data set, ~100 rows.
....
Given the data sparsity, the classifier used above is a good choice.
You are basically using a Decision Tree ensemble that is gradient descent boosted/optimized.
I like data wrangler helping piece together bad data - I don't like the sound of that - the ML p-hacking its way to result for the user's pleasure perhaps (a gross simplification to be sure)? Icky icky icky!Smith continued: "And the business analyst is like, 'If I knew that, I wouldn't need your help!... I've got this data, I've got a business problem. I've got questions for this data. I don't know what it's worth until we actually do the analysis.' So the business analyst is thinking about just accelerating to the point where they can actually argue for or justify their use of a scarce data scientist."
As someone who has been in the BI/“big data” space for about 20 years, what this really looks like it’s some half-baked analyst thinking they can do better than the actual data scientists/ML engineers, producing *some* results in SageMaker, and the business (not knowing any better) being impressed and using the models.
Ultimately, this leads to disaster because less experienced practitioners don’t have the background knowledge or experience to ensure they clean, accurate data or understand model accuracy. This leads to bad decision making (our good decision making in bad data). We won’t even get into retraining, model drift, biases, etc.
A good example of this is Zillow where their model got out of control, leading them to over invest in properties leading to cuts and layoffs. Those models were most likely produced by actual (and even good) data scientists or ML engineers. There’s a ton of good that can come from these tools in the right hands, but even experienced practitioners make massive mistakes. I’ve seen time and again where a well-meaning analyst/business person tries some of these “black box” tools without the broader knowledge it needs to be successful and turn out horribly
Awesome series, I’d like to see more of this. That particular quote raised my alarm bells, though.
I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)How do these results compare with other commercial software or performing the analysis in R?
Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.
I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.
In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.
Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.
model = XGBClassifier(scale_pos_weight=100)
The XGBoost documentation suggests a fast way to estimate this value using the training dataset as the total number of examples in the majority class divided by the total number of examples in the minority class.
scale_pos_weight = total_negative_examples / total_positive_examples
As an economist with decades of stats/econometrics training and experience, not sure what to make of all this. I can see a bunch of notebooks that examine this dataset on Kaggle. All look like copies with minor variation but good example: https://www.kaggle.com/code/namanmancha ... 0-accuracy.
I am struck by:
1. the fact that the model with the highest accuracy (90% to be precise) is the logistic regression. So much for ML when the old school stats modeling approach wins out! There's not even a model selection loop in one of the books I looked at that achieved this result, just all drivers thrown in and model fit called once.
2. The accuracy result is highly dependent on what mother RNG or dear researcher decides is the training vs testing data. Highly fishable results in the hands of motivated amateurs...
3. The art of statistical model building with or without ML tools is deciding on what the candidate drivers should be and transforming them appropriately. That's where domain knowledge and theory come in. Otherwise you have idiots or computers throwing in sunspots as controls in your heart attack prediction models. Granted you can get away with more naivety with much bigger datasets using ML tools but then we are talking big data not 300 observation clinical trials.
The problem is likely with me. I don't like to write code. JMP offers a UI that builds the code as you select options. Sure you can always access the code, and I do on occasion, but most of the time it is unnecessary to get what you want out of the system quickly.I have tried several times to get into R for statistical analysis. First time was probably 20 years ago, then 12 years ago, then about 8 years ago, and finally just last year I tried again. Never could get much past getting it up and running, and using a couple of dummy data sets. First time I tried analyzing a fresh dataset of my own it has always fallen apart and I end up resorting to rather expensive software from SAS (JMP specifically)How do these results compare with other commercial software or performing the analysis in R?
Good question. For a dataset this size, I would absolutely start with R or similar. I accept that those require deeper knowledge/skills than included in this brief. Something like SPSS would be good, but it's been captured and squeezed by IBM.
I'd recommend jumping into R Studio (rebranding to Posit soon) if you want to get into the R ecosystem. They're doing a decent job expanding into Python, but Anaconda is still more Python-native if you want to go that direction. Anaconda's desktop installer has a nice launcher that can get you up and running with Jupyter notebooks, Spyder, etc pretty quickly.
In general, there's zero reason to spend $1000 on AWS bills before you've started to run out of memory or patience on your laptop... and that is not going to happen with tiny datasets like this.
Open source, cross platform, every growing library of capabilities. Seems great on paper, but I've never been able to get the hang of the coding. And R-studio doesn't really make that any easier. It is more of a tool to make certain things easier for people already intimately familiar with R than it is a tool to make R easier to get into.
I'm not really sure what the exact issue is. I've taught courses in statistical programming at the grad level, and R programming specifically at the undergraduate level, and while the learning curve is steeper than, say, Python, I don't think I've ever run into anyone who can't get it to work at all.
Code is just like language, some are easy to learn, others harder, but with very few exceptions they're pretty much the same thing under the hood. It doesn't matter if it's R, or Python, or C++ or even something a bit less hand-holding such as Fortran, the core principles are not so different, and if you know one, you should be able to code functionally (if not efficiently) in any of the others with a bit of retooling.
Practically, there are millions of pieces of demo code for statistics problems in R floating around. If you want to solve a particular issue (and isn't on the very cutting edge of applied statistics), chances are someone has done essentially this exact thing, and their code gets you 90% of the way there. On the proficiency side, I'd start with Hackerrank and work through their entire repository of R exercises - it's pretty small relative to what they have for Python IIRC.
With more complex problems, my experience is that parametric methods, even the very smart ones (elastic net, fancy variable selection MCMCs) tend to underperform neural nets when it comes to predictive performance. It depends on what you want - in many cases, it's perfectly fine to not know exactly what is driving the outcome, so long that the model is sufficiently robust to inputs.