I have a test file of article headlines (test$title) and their total social shares (test$total_shares). I can find most used trigrams using say:
library(tau)
trigrams = textcnt(test$title, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
head(trigrams, 20)
However, what I would like to be able to do is to calculate the top trigrams by average shares not by number of occurrences.
I can find the average shares of any specific trigram using grep eg
HowTo <- filter(test, grepl('how to create', ignore.case = TRUE, title))
Then use:
summary(HowTo)
to see average shares for headlines with that trigram.
But this is a time consuming process. What I would like to do is calculate the top trigrams from the dataset by average shares. Thanks for any help.
Here is a sample dataset. https://d380wq8lfryn3c.cloudfront.net/wp-content/uploads/2017/06/16175029/test4.csv
I tend to remove non-ascii characters from titles using
test$title <- sapply(test$title,function(row) iconv(row, from = "UTF-8", to = "ASCII", sub=""))
Right, that was a bit tricky. I broke it down into manageable chunks and then stringed them up, which means I might have missed some short-cuts, but at least it seems to work.
Oh, forgot to say. If using textcnt()
as you did, trigraphs will be made that consists of the end of one headline and the beginning of the next. I assumed this was undesirable and found a way to circumvent it.
library(tau)
library(magrittr)
test0 <- read.csv(paste0("https://d380wq8lfryn3c.cloudfront.net/",
"wp-content/uploads/2017/06/16175029/test4.csv"),
header=TRUE, stringsAsFactors=FALSE)
test0[7467,] #problematic line
test <- test0
# test <- head(test0, 20)
test$title <- iconv(test$title, from="UTF-8", to="ASCII", sub=" ")
test$title <- test$title %>%
tolower %>%
gsub("[,/]", " ", .) %>% #replace , and / with space
gsub("[^a-z ]", "", .) %>% #keep only letters and spaces
gsub(" +", " ", .) %>% #shrink multiple spaces to one
gsub("^ ", "", .) %>% #remove leading spaces
gsub(" $", "", .) #remove trailing spaces
test[7467,] #problematic line resolved
trigrams <- sapply(test$title,
function(s) names(textcnt(s, n=3, method="string")))
names(trigrams) <- test$total_shares
trigrams <- do.call(c, trigrams)
trigrams.df <- data.frame(trigrams, shares=as.numeric(names(trigrams)))
# aggregate shares by trigram. The number of shares of identical trigrams
# are summarized using some function (sum, mean, median etc.)
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df, sum)
# more than one statistic can be calculated
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df,
FUN=function(x) c(mean=mean(x), sum=sum(x), nhead=length(x)))
trigrams_share <- do.call(data.frame, trigrams_share)
trigrams_share[[1]] <- as.character(trigrams_share[[1]])
# top five trigrams by average number of shares,
# of those that was found in three or more hedlines
trigrams_share <- trigrams_share[order(
trigrams_share[2], decreasing=TRUE), ]
head(trigrams_share[trigrams_share[["shares.nhead"]] >= 3, ], 5)
# trigrams shares.mean shares.sum shares.nhead
# 37588 the secret to 42852.75 171411 4
# 43607 will be a 24779.00 123895 5
# 44945 your career elearning 23012.00 92048 4
# 31454 raises million to 21378.67 64136 3
# 6419 classroom elearning industry 18812.38 150499 8
In case the connection should break
# dput(head(test0, 20)):
test <- structure(list(
title = c("Top 3 Myths About BYOD In The Classroom - eLearning Industry",
"The Emotional Weight of Being Graded, for Better or Worse",
"Online learning startup Coursera raises $64M at an $800M valuation",
"LinkedIn doubles down on education with LinkedIn Learning, updates desktop site",
"Create Your eLearning Resume - eLearning Industry",
"The Disruption of Digital Learning: Ten Things We Have Learned",
"'Top universities to offer full degrees online in five years' - BBC News",
"Schools will teach 'soft skills' from 2017, but assessing them presents a challenge",
"Top 5 Lead-Generating Ideas for Your Content Marketing",
"'Top universities to offer full degrees online in five years' - BBC News",
"The long-distance learners of Aleppo - BBC News",
"eLearning Solutions for Business",
"6 Top eLearning Course Reviewer Tools And Selection Criteria - eLearning Industry",
"eLearning Elevated",
"When Teachers and Technology Let Students Be Masters of Their Own Learning",
"Aviation Technical English online elearning course",
"How the Pioneers of the MOOC Got It Wrong",
"Study challenges cost and price myths of online education",
"10 Easy Ways to Integrate Technology in Your Classroom",
"7 e-learning trends for educational institutions in 2017"
), total_shares = c(13646L, 12120L, 8328L, 5945L, 5853L, 5108L,
4944L, 3570L, 3104L, 2841L, 2463L, 2227L, 2218L, 2210L, 2200L,
2117L, 2039L, 1876L, 1861L, 1779L)), .Names = c("title", "total_shares"
), row.names = c(NA, 20L), class = "data.frame")