In the first part of a new series, we look at matching the problem to the tool.
Read the whole story
Read the whole story
Ars has given me data on over 5,500 headline tests over the past four years—11,000 headlines, each with their rate of click-throughs.
Either way, it should at least be entertaining.
You mean can Ars get an AI to make click-batier headlines?
Headline writing is not a problem. Headline choices are.
Here is a task that some Ars writers are exceptionally good at: writing a solid headline. (Beth Mole, please report to collect your award.)
Are you sure that the Ars audience clicks quicker on the "clickbaity" headlines than the "good" ones in A/B testing?Ars has given me data on over 5,500 headline tests over the past four years—11,000 headlines, each with their rate of click-throughs.
Garbage in garbage out as they say. If the only input is # of clicks-throughs to an article, you are going to end up with clickbait headlines. I hope you have more data than that, curious to see what you come up with.
This isn't a matter of clickthrough percentage for a story. This is the difference in how many people click through on headline A vs. headline B for the same subtitle and image.I wonder if the images associated with headlines have anything to do with the success rate of clicking the headline?
As I recall in previous discussions of this, the A/B process is only carrier out for a short while - an hour perhaps? Then the winning headline is used. The idea being, you're trying to get the most clicks. It doesn't make sense to continue running the test once you know which of the two is more likely to convert to a click-through.I performed a similar experiment years ago, only I scored the title appeal based on the number of social media shares. You have A/B data, which is different in that you can discriminate between different wordings. But I had very similar titles from different media sources, with wildly different engagement.
The main problem is that most of the traffic to any site comes from social media, and readers there read them based on their own criteria, which is partly based on temporal context, in other words, what is happening at the time. A story about some event may be very attractive regardless of the title, but after the media saturation, readers just won't click it no matter how well written the title is.
I then concluded the experiment with a mental note to repeat it some day, but figure out how to pass the temporal context in the classifier as well.
As I recall in previous discussions of this, the A/B process is only carrier out for a short while - an hour perhaps? Then the winning headline is used. The idea being, you're trying to get the most clicks. It doesn't make sense to continue running the test once you know which of the two is more likely to convert to a click-through.I performed a similar experiment years ago, only I scored the title appeal based on the number of social media shares. You have A/B data, which is different in that you can discriminate between different wordings. But I had very similar titles from different media sources, with wildly different engagement.
The main problem is that most of the traffic to any site comes from social media, and readers there read them based on their own criteria, which is partly based on temporal context, in other words, what is happening at the time. A story about some event may be very attractive regardless of the title, but after the media saturation, readers just won't click it no matter how well written the title is.
I then concluded the experiment with a mental note to repeat it some day, but figure out how to pass the temporal context in the classifier as well.
Couple of minor suggestions
1. You could probably get away with using google colab depending on your dataset size
2. You could run the headline through a word2vec & strip stopwords (or not) & then use a simple
randomforest classifier for a starter model
I'm assuming your target is number of clicks which you can threshold 0/1 for binary or bin it
3. If you really need lexical analysis then a bidirectional lstm will do the job, but depending on datasize aws or just let it run for a few hours to train on colab
Again this depends on how many layers you are stacking up & I suggest using Keras as a wrapper for tensorflow
4. If you want a pre-made model then look at BERT.
5. Up-sample/down-sample as needed or adjust class weights if needed & run the sklearn classification report at the end
Have fun![]()
Are you sure that the Ars audience clicks quicker on the "clickbaity" headlines than the "good" ones in A/B testing?Ars has given me data on over 5,500 headline tests over the past four years—11,000 headlines, each with their rate of click-throughs.
Garbage in garbage out as they say. If the only input is # of clicks-throughs to an article, you are going to end up with clickbait headlines. I hope you have more data than that, curious to see what you come up with.
Clickbaity doesn't automatically mean it gets more clicks in any given audience. It's a style, that I think most here at Ars despise.
This IS Ars...Couple of minor suggestions
1. You could probably get away with using google colab depending on your dataset size
2. You could run the headline through a word2vec & strip stopwords (or not) & then use a simple
randomforest classifier for a starter model
I'm assuming your target is number of clicks which you can threshold 0/1 for binary or bin it
3. If you really need lexical analysis then a bidirectional lstm will do the job, but depending on datasize aws or just let it run for a few hours to train on colab
Again this depends on how many layers you are stacking up & I suggest using Keras as a wrapper for tensorflow
4. If you want a pre-made model then look at BERT.
5. Up-sample/down-sample as needed or adjust class weights if needed & run the sklearn classification report at the end
Have fun![]()
Yeah, I'm several weeks into this game right now and I wish we had talked sometime in May,![]()
I disagree that more clicks inherently means something is clickbait. For something to be clickbait, it has to be sensationalist/misleading/overpromising/etc the actual content of the article.Are you sure that the Ars audience clicks quicker on the "clickbaity" headlines than the "good" ones in A/B testing?Ars has given me data on over 5,500 headline tests over the past four years—11,000 headlines, each with their rate of click-throughs.
Garbage in garbage out as they say. If the only input is # of clicks-throughs to an article, you are going to end up with clickbait headlines. I hope you have more data than that, curious to see what you come up with.
Clickbaity doesn't automatically mean it gets more clicks in any given audience. It's a style, that I think most here at Ars despise.
More clicks = clickbaity regardless of audience
It remains to be seen if the typical Ars audience prefers a different form of clickbait headline than what is typically considered clickbait![]()
I can't wait for this upcoming Ars headline:
"SpaceX says: "just one low price to put users of new iOS Trump video game into space for a million years, without Facebook, Twitter, or windows"
The headline "Scientists Invent Faster-Than-Light Travel" is going to get clicks. If it's actually reporting on such work then it's not clickbait. If it's followed by "...in a new Netflix series" when you click on the article then it's clickbait.I disagree that more clicks inherently means something is clickbait. For something to be clickbait, it has to be sensationalist/misleading/overpromising/etc the actual content of the article.Are you sure that the Ars audience clicks quicker on the "clickbaity" headlines than the "good" ones in A/B testing?Ars has given me data on over 5,500 headline tests over the past four years—11,000 headlines, each with their rate of click-throughs.
Garbage in garbage out as they say. If the only input is # of clicks-throughs to an article, you are going to end up with clickbait headlines. I hope you have more data than that, curious to see what you come up with.
Clickbaity doesn't automatically mean it gets more clicks in any given audience. It's a style, that I think most here at Ars despise.
More clicks = clickbaity regardless of audience
It remains to be seen if the typical Ars audience prefers a different form of clickbait headline than what is typically considered clickbait![]()
Couple of minor suggestions
1. You could probably get away with using google colab depending on your dataset size
2. You could run the headline through a word2vec & strip stopwords (or not) & then use a simple
randomforest classifier for a starter model
I'm assuming your target is number of clicks which you can threshold 0/1 for binary or bin it
3. If you really need lexical analysis then a bidirectional lstm will do the job, but depending on datasize aws or just let it run for a few hours to train on colab
Again this depends on how many layers you are stacking up & I suggest using Keras as a wrapper for tensorflow
4. If you want a pre-made model then look at BERT.
5. Up-sample/down-sample as needed or adjust class weights if needed & run the sklearn classification report at the end
Have fun![]()
Yeah, I'm several weeks into this game right now and I wish we had talked sometime in May,![]()
X = np.array(df[features].tolist())
Y = your_trained)model.predict(X) <-- super fast vectorized op
df['prediction'] = pd.Series(y[:,0])
The headline "Scientists Invent Faster-Than-Light Travel" is going to get clicks. If it's actually reporting on such work then it's not clickbait. If it's followed by "...in a new Netflix series" when you click on the article then it's clickbait.I disagree that more clicks inherently means something is clickbait. For something to be clickbait, it has to be sensationalist/misleading/overpromising/etc the actual content of the article.Are you sure that the Ars audience clicks quicker on the "clickbaity" headlines than the "good" ones in A/B testing?Ars has given me data on over 5,500 headline tests over the past four years—11,000 headlines, each with their rate of click-throughs.
Garbage in garbage out as they say. If the only input is # of clicks-throughs to an article, you are going to end up with clickbait headlines. I hope you have more data than that, curious to see what you come up with.
Clickbaity doesn't automatically mean it gets more clicks in any given audience. It's a style, that I think most here at Ars despise.
More clicks = clickbaity regardless of audience
It remains to be seen if the typical Ars audience prefers a different form of clickbait headline than what is typically considered clickbait![]()
The trouble is the subheading and the picture are the same regardless of the A/B heading. So are you suggesting including that in the training set as part of the data for each title? That could only help in choosing a winner for contextually-based models. Otherwise they add data without any differentaition.I see a couple of things here. On the data, on commenter mentioned the picture associated with the article. I'd add the short text under the title and the author's name. I actually look at those as well to decide what I'll read. And that takes me into the data analysis part...
Since you have the full count of clicks, one analysis would be to simply plot the data to see how far apart the winner and loser are from each other and use that as a new data point. Titles that are far apart are more interesting and should take greater weight than those close to each other.
I'm going to watch this series with interest because I'm not sure how well this is going to work out. I suspect that phrase structure is probably going to be important. How often does Beth's punny titles generate a click? An algorithm looking solely at words will miss the crafting of word placement and relationship.
I do have to keep reminding myself that this project is a title scoring system and not a title generator.
Couple of minor suggestions
1. You could probably get away with using google colab depending on your dataset size
2. You could run the headline through a word2vec & strip stopwords (or not) & then use a simple
randomforest classifier for a starter model
I'm assuming your target is number of clicks which you can threshold 0/1 for binary or bin it
3. If you really need lexical analysis then a bidirectional lstm will do the job, but depending on datasize aws or just let it run for a few hours to train on colab
Again this depends on how many layers you are stacking up & I suggest using Keras as a wrapper for tensorflow
4. If you want a pre-made model then look at BERT.
5. Up-sample/down-sample as needed or adjust class weights if needed & run the sklearn classification report at the end
Have fun![]()
Yeah, I'm several weeks into this game right now and I wish we had talked sometime in May,![]()
The trouble is the subheading and the picture are the same regardless of the A/B heading. So are you suggesting including that in the training set as part of the data for each title? That could only help in choosing a winner for contextually-based models. Otherwise they add data without any differentaition.
I can't wait for this upcoming Ars headline:
"SpaceX says: "just one low price to put users of new iOS Trump video game into space for a million years, without Facebook, Twitter, or windows"