name is Naveed Ahmad. I am a senior director for
data engineering and machine learning at Hearst Newspapers. My responsibilities are to
build the data pipeline, data warehousing, and using that
data to build data products and predictive data science. You might not have
heard about Hearst, but you might have heard of
names like “San Francisco Chronicle” or “SFGate”
or “Houston Chronicle,” or magazine names like
“Esquire” or “Cosmopolitan” or “Men’s Health.” So the company behind
this is Hearst. It’s one of the biggest
media companies in USA. I’m here to talk
about the work we’ve done with Hearst Newspapers. Hearst Newspapers employs
about 4,000 people across the country. It focuses on local news, and
it has about 40-plus web sites, and it’s continuously growing. And you can find more
information about Hearst at this link. And these are some
of the name brands that I just talked about. You might be more familiar
with “San Francisco Chronicle” newspaper in this
California area. So this is sort of like our data
strategy at Hearst Newspapers. It might vaguely look like
the Maslow hierarchy of needs. So data is the essential thing. Without data, you
can’t do data science or business intelligence. So our strategy was to first
build a data warehouse. And I’ll take some
minutes to talk about what we did with BigQuery. On top of it sits
business intelligence, which helps inform people
in marketing, editorial, and product to make
decisions on historical data. And then using the same
data that you can use for– to look at historical data,
you can do predictive analysis and then also build
products using that data. Let’s talk a little
about a BigQuery. So why centralized data? If you have all the
data in one place, you can connect the dots. So if you have
datasets of newsletters and Google Analytics and
your content database, you can easily make connections. Versus if they were sitting
in different databases, it becomes difficult. And
it’s efficient for users to go into one place
to get that data. There’s no duplication. You’re not getting
reports from one place and then two different
places and those numbers don’t match up. And it’s efficient
in data warehousing. You use the same
project management product, engineering, and QA
and consolidate your efforts. And most importantly
relevant to this talk is it’s the
foundation for ML. And why did we use BigQuery? Before this, people used
to either pull reports from different systems,
email them out. There were different databases,
different data warehouses. So a couple of years
ago, we thought that we need to consolidate
everything into one database, and we chose BigQuery. And these are some of the
reasons why BigQuery was [INAUDIBLE]. It’s based on Dremel technology. And this stuff is
on their website– terabytes in seconds
and petabytes in minutes to be able to query that data. It’s fully managed. SQL– lots of people
know SQL, so easy to get people onto BigQuery. And now with yesterday’s
release of BigQuery ML, you can even do machine
learning on top of a BigQuery. So this is the technology
stack for our BI platform. We’re using ETL, using Airflow,
BigQuery as our data warehouse, and Looker. And I think there have
been other talks talking about how people have used
Looker with BigQuery to build user journeys and stuff. So we’re also doing that. That could be a separate
talk, but I’m just mentioning that we are also doing this. And these are some of the data
sources we have in BigQuery– Google Analytics, DoubleClick,
subscriptions, newsletters. And this is continuously
growing, the number of datasets in our data warehouse. So getting two machine learning. So I see machine learning in
two big buckets, one machine learning that’s prepackaged
that Google has done your data science work for
you, and they’ve exposed APIs to be able to do. And these are natural
language processing; AutoML, which was also released
in this conference; speech; video intelligence; and vision. And the other half is
where you’re data science, and you’re doing machine
learning in house using TensorFlow/CloudML,
BigQuery ML, and using DataProc and Spark. So there was a case
study published back in November of how
Hearst Newspapers is using natural
language processing and the different use cases. So the two pieces that we’ve
used from natural language processing is to be able to
extract entities and categories from our text. There’s a part of speech
APO that we haven’t used, but this alone, we have
about six use cases using these features. So I took an article from
“San Francisco Chronicle”– it’s about this movie,
“Black Panther”– and ran it through the
web interface from Google. And you can see that
it has correctly detected “Black Panther”
as a work of art. It has a Wikipedia link. It also has the knowledge ID
and also the saliency score. And naturally, the
entities are broken down into different categories. There is work of art, places,
organization, locations. And then, that is
further broken down into proper and improper nouns. And then, it also correctly
detected the right category for it and also given
a confidence score. This is all good, but
how do we use this thing? It looks very simple. So I’m going to go through
some of these use cases across as how to use Google NLP
from a simple use of displaying these tags on our CMS system
to building a recommendation system, and also matching
ads to the right content. So I’ll go one at a time. And this is sort of like
the high-level architecture of how everything connects. So when content is ingested
into our CMS system, we make these NLP calls, get
the metadata back, and store it into our database. And then when the
article is rendered, these tags are part of
the metatag of the HTML. And from that, we have a
third-party customer data platform, which is extracting
these NLP tags into its system. And at the same time, there’s
a JavaScript call that– for– to DoubleClick
for Publishers to be able to render ads. And also, this data is
pushed into BigQuery. Our CMS content, along
with the NLP tags, are also pushed into
BigQuery, upon which we can do BI reporting
and then also build a recommendation system on it. So first use case–
segmentation. Segmentation is that be able
to identify a group of users, say which group of
users are sports readers versus like
food and wine readers. Our CDB platform, it
has a built in mechanism to be able to use these
tags to create segments. And how the segments
are used is that you can push marketing messaging. Like, let’s say if you know
there is a segment about people who like to read
tennis news, you can push a newsletter
using this tool. The other use cases– DFP ad targeting. So I had mentioned
that we are making– in our key value
pairs, we can pass in a key of the NLP category
and the actual category that was for that page. So what this does is
DFP, over a month, it can collect all this data. And then one can run
reporting on it telling how users and how ads were
displayed across our content. And then if another
customer says, hey, I want to have a campaign
to target Olympics web pages; put my ad on Olympic
web pages, they can create a campaign
with a certain criteria. And now this vendor’s
ads or the partner’s ads can be displayed on
Olympics content. So this is just a screenshot of
how the rules are set in DFP. Since this is a BI report,
it uses Google Analytics WCM content and NLP data to come
up with some useful numbers. Basically in this report, it’s
showing for this category, these are the number
of users who visited, and these are the number of
articles in that category. And a simple ratio
could show that– a higher ratio
would mean that this is something editorial should
focus on this category, write more content about it. And there’s numerous
different things you can do if you have
Google Analytics, NLP, and content data. For example, you can
create trend graphs about a certain topic or certain
personality, how their fame– are they getting more famous
or less famous over the years, like content-wise? And this is another
analysis of– so we get content
from third parties. Which third parties give
what type of content? Does this source
give us more sports content versus the other one? And again, you can
build different kind of reports using this. So I’ll get into
recommendation systems. So why a recommendation
system important? You might have read
Netflix values its– their recommendation
at $1 billion. So in our context, if we could
reuse older content which is just sitting
still and nobody’s using it or looking at it,
this could increase engagement. People would stay on
the website longer, which means more ad revenue. And eventually, people
might even subscribe. Right now, people tend
to go to the home page and whatever links
over there are– that’s the content they can see. But if we can use our
recommendation system to explore older content,
that would be very useful. So these are three
different types of recommendation systems that
are supported by the Google Cloud [INAUDIBLE]. So we did a Content to
Content recommendation system, a personalized recommendation
system, and a Video to Content
recommendation system. So since we had the NLP data
with the content data sitting in BigQuery, it was a
very low-hanging fruit to build a
recommendation system. And a core concept of
this is any two articles which have high
overlapping NLP entities are related to each other. And this is
essentially a big SQL with certain rules and
conditions that recreate it. And it runs on a periodic
basis, like every 20 minutes. As new content comes
in, we run this SQL. Let me show the
diagram for this. We run that SQL, and BigQuery
stores that in Cloud SQL. And it’s fronted by a
Kubernetes web service layer to serve these recommendations. And from the front end,
there’s a JavaScript call to render these recommendations. And that API call, they
pass on the content ID. You find all the
related content that’s already pre-computed in
Cloud SQL, a Postgres database, and render down. So continuing on this
concept– so this was actually a Hack Day idea. We had a Hack Day, and
I did this prototype to see if we could extract
anything useful from video. So we took our videos,
convert, extracted the sound using a FFmpeg tool
that’s open source, and used that sound
to make a call to the Speech API,
which gives you back the text for that video. And then again, you can
run NLP on that test, as well as combine that with
the metadata for the video, and you’ll get NLP tags back. Again, just like
our text content, when we ingest
the video content, we’re storing the meta– the transcript as well as the
metadata tags along with it. Let me back up. And what we did was that– since we have these
NLP tags, this powers another
recommendation system to recommend videos to text. Since our text
already has NLP tags, and we’ve extracted these
NLP tags from video, now we can build an in-house
recommendation system. So this saved our
company some money. Instead of buying a
video-to-text recommender, we just build using NLP
and search technology to build our own
recommendation system. So this project we
worked in collaboration with our TV department. This is using TensorFlow. Talk a little bit about
what TensorFlow is. You might have already heard it. Neural networks are back. And the current deep
learning revolution is because of deep
neural networks. They’re bigger and hierarchical. And many of the Google
products, especially in the AI, are based on TensorFlow. And Cloud ML is a managed
version of TensorFlow. So we built a in-house
personalized recommendation system. That could be like one whole
talk, talking about how the algorithm and how it works
and the full architecture of it. But essentially in summary,
it’s a using scalar-vector decomposition, basically
a collaborative filtering algorithm. And that algorithm
is something that you can fit into TensorFlow,
because you can solve it through using gradient descent. And TensorFlow library
helps you solve algorithms that can be put into
a gradient descent problem. And it’s basically looking
at a user’s history and also the history of people who
have similar taste to it. And there’s an open source
implementation Google released using Google
Analytics, CMS content to build recommendations. And I’ll encourage
you to take a look, and it’s a bit similar
to what we did. And the high-level
architecture– as we were reading our content from
content and Google Analytics data from BigQuery–
and you see there’s another advantage that all the
data’s sitting in BigQuery. Do some preprocessing, and
then run this TensorFlow model. And then this TensorFlow
model gets stored– that’s the output
of the TensorFlow, and it’s fronted by
TensorFlow Serving. TensorFlow Serving
is a RPC layer that helps you deliver the
recommendations on this model. And then, it’s fronted by a
RESTful web service layer. Some of the other use
cases for a TensorFlow is to do propensity
modeling, forecasting, content virality prediction,
build customer content, classification of content. So I actually also
had churn modeling. And that’s a use case that
I built in BigQuery ML, and I’ll be talking about it. So now, Google
offers you a variety of ways of doing
machine learning. So it’s a question
of, what’s the most– what makes sense is–
you have to figure out what makes sense
to use TensorFlow versus BigQuery ML versus
AutoML and all these other API. So I gave a talk about
BigQuery ML yesterday. So why BigQuery ML
for Hearst Newspapers? As I already told
you that all our data is already sitting in BigQuery. So this was– made
a lot of sense to just do this using BQML. It enables anyone familiar
with SQL to get on board and start doing
machine learning. And the alternate would be
to first learn R or Python, learn a framework like
SciKit Learn or TensorFlow. But over here using
BQML, you don’t need to learn any of those. You don’t need to ETL data out,
do a machine learning outside, and then ETL stuff back in. Everything gets done in place. And goodies, like normalization
and one-hot encoding, it just does it for you. And then you have
other SQL syntax to get evaluation
of your machine learning model
right in BigQuery. So it’s a relevant
churn prediction. It’s a relevant
use case for media. Because you might have
heard people have choices, and it’s hard to keep
them– keep the subscribers. So if we could figure out
a way or have some insight into the future of which
subscribers are going to cancel, we could
[? say that. ?] So I put a proverb, “Money
saved is money made,” or our subscribers
say it is money made. And I thought of
two more yesterday, so I’ll tell you another one. So “Prevention is better
than cure,” and “One in hand is better than two in the bush.” So this just proves that
I passed my English test. [LAUGHTER] This gives you insights
into the future of cancellation of subscribers. It’s two-class, so it’s a
binary logistic regression, people who cancel
and who didn’t. And we took one year of data
subscription, newsletter, demographic, web browsing. All of that data, again,
is sitting in BigQuery. So the architecture
is really simple. It’s really two nodes,
BigQuery and Looker. But I put in a few
steps of what happens. So we’re ETLing all our
datasets using Airflow, doing a little bit of preprocessing
of our Google Analytics data, like making some summary tables,
especially of how subscribers end up browsing. And then third step is the
real machine learning step. You just do create
model, the model name, and then you give
it a table with all of the columns
with the features. And one column is
the label column. And the label in our case is,
did this person cancel or not? And then when you
run this query, it takes about four
or five minutes. You go and grab a cup of tea or
coffee, and when you’re back, the query number four
is you run select-star from Predict on a table which
has existing subscribers– existing current subscribers. And for those subscribers,
it will give you a score of– a probability score of how
likely they are to cancel. And then, we built a bunch– a dashboard and a
bunch of looks in it which showed the
output of the results and also the output of our
machine learning matrix, such as ROC, or
precision and recall. So this is a snapshot
of a dashboard. So the first look
is basically showing all the subscribers are
sorted by how likely they are to cancel their subscription. And somebody in the
subscription retention team can take a look
at it and at least have an idea of what
this is predicting, like do some forecasting. This number on the right
side is the AUC curve. It’s called Area
Under the Curve score. And it’s essentially the
area under this other graph. And it’s a data sciencey
thing, but what it means is that if this AUC
score was 50%, then our machine learning model
hadn’t learned anything, meaning that this
line over here that’s a lot of false positive
versus true positive would just be a horizontal line. But this shows A, that our data
has predictive power and two, that BQML is learning something. And on the right, we’re
plotting the same graph, which is the true positive rate
and the false positive rate. So another problem
you have to figure out is what probability
threshold above which you say that this person
is a churner or not. Do you say like 50%
above or 30% and above? So this graph over here on
the bottom on the right side shows a plot of true positive
and false positive rate. And we want to have
a threshold which gives a decent true
positive and also have low false positive rate. So we chose about 0.3,
which gives us about 18% false positive and 80%
true positive rate, which I feel is fine. Because even if
you send out emails to people who you think
are going to churn and if they don’t end up
churning, it doesn’t hurt. You just sent out extra emails. So you can also get the
weights of your learned model to get an idea of
what it’s learning. So this is a plot of the
weights of [? this thing. ?] So the features
that are on the left are positively
correlated to churn, while the features on
the right are negatively correlated to churn. And it gives us a sense that,
let’s look at this feature there might be some clue
in our historical data that we might be able to
make some business decisions or change the way how
we’re doing business and reduce our churn. And it has two
types of features. One is just numerical features. And the others are
categorical features. So you have to do a UNNEST
in your BigQuery SQL to be able to get
those other features. And you can look up
at the BQML tutorials with examples of other use
cases to do predictive modeling. So AutoML text was released
in this conference. So while this was
in alpha, I’ve been working with the product
manager and built a model for ourselves. So we have two main
use cases that– the discussion about DFP,
being able to match ads. We also want to enhance
some of our categories. For example, we take the
sensitive subject category and break it down into
more granular ones. Or like if there is
a new sport that’s not covered by the
default categories, we want to train our own. So that’s one use case. The other use cases
that I worked over here is to be able to detect
evergreen content. Evergreen content is content
that has a longer lifespan. For example, a review of
a restaurant or a museum, we’ll call it evergreen content. But a story about some accident
that happened at a mall, it only has a few days
or a few weeks’ life. So we want to be able
to differentiate content that’s evergreen or not. So initially, I tried this open
source dataset from StumbleUpon from Gaggle. Very initially, I tried
to write my own TensorFlow code using LSTM. I just quickly prototyped
it, felt that it– the data had predictive power. But I had some hunch
that Google was working on something like this. So then, we created a
dataset of our own dataset to label evergreen content. So it was sort of
like internally crowdsourced through to our CMS,
where editors basically tagged some of this content
as evergreen. And to use AutoML
is really easy. You create a CSV file,
give it first column as the content and
the second column as the label Evergreen
or Not Evergreen, and then upload it to
AutoML and have it run– learn from this data. So you see this– we had about 3,000 articles
for evergreen and another 3,000 for non-evergreen. And you see it has
done a pretty good job in learning this thing. I was surprised that
it could do so well. And you see the
precision recall score. And it also shows you
the confusion metrics of what it predicted. 91% of the time
non-evergreen content, it predicted as non-evergreen. And only 9% of the
time it made a mistake for non-evergreen content. And for evergreen,
it did a perfect job. And this is only on
their test set database. So I took this, and
I played around. In the AutoML console, you
can put in any random text and see how your
prediction model does. So I tried it on
our Hearst articles. I tried it on CNN articles. And it seemed to work very well. And then I even
went to Wikipedia. So I picked up an
article about New York so– that you
think is evergreen. And 70%, it predicted
this is evergreen. And then again from
Wikipedia, I picked an article about Mexico elections, and it
predicted it as non-evergreen. So it’s basically the combined
knowledge of [? editorials ?] and what they think is
evergreen and non-evergreen this AutoML has learned and
is able to apply on text that it hasn’t seen before. So ideally, we want to
use BigQuery for most of our analytic process. Anything that can be
formulated as SQL, I tend to want to use BigQuery,
because it’s distributed. And it’s more cost effective
to do things in BigQuery. But there are some
use cases that it’s– like very few use cases that
you have to do things outside of BigQuery. So one of the use cases that
we have duplicate content, content which either the
body looks very similar or the headline is tweaked,
and essentially it’s the same article. And we don’t want to
recommend articles that are similar to each other
in a list of recommendations. And what we wanted
to do initially was a big read like create
a word to each article. And then do a cross
join with itself and figure out the distance
between those articles. And doing this in
BigQuery was not possible because that query
wouldn’t return. So what we did is
in this use case, we used Spark,
basically spawned up a cluster of 10 machines
for a couple of hours. And we wrote up by Spark
job to compute these vectors and basically compute the cosine
distance between articles. And articles that had– were very close
to each other, we eliminated them as duplicates. This helps our
recommendation system to remove any duplicate
content and a use case of Spark and DataProc in our data. So I’m getting close– this is actually my last line. So what’s the future
of Hearst Newspapers? So we want to build
more predictive models using BigQuery and ML. Because it makes a lot of sense
to since have all the data, and it’s very easy
to use BigQuery ML. Things like propensity
modeling, like the reverse of churn modeling,
is to figure out, who are the subscribers that
are likely to subscribe? Or who are the visitors
on our web pages who are likely to subscribe? And this could help our
marketing systems focus on those. And actually in
yesterday’s BQML demo, the demo was about figuring
out those people that are likely future customers. We want to productionalize the
taxonomy I was talking about. We already have
requirements to– in AutoML, we want to
create our own datasets and train it to enhance
the NLP taxonomy. There’s been lots of research
using deep neural networks for recommendation systems. There’s actually
one article which lists all the different
research papers using deep neural networks
for recommendation system. And we’ve been prototyping
a few different approaches. There’s hybrid approaches. There’s all sorts of different
flavors of doing recommendation systems using deep learning. So that research is continuing,
and hopefully, our version two will be a more– even more
advanced recommendation system. There’s still more juice to
be taken out from Google NLP, that we haven’t fully
utilized the NLP, especially some of
the use cases are being able to build personalized
newsletters or topic pages using Google NLP. So we are thinking
on these other use cases and how to
productionalize this. And also, we have a
large corpus of images– haven’t yet reached that point
to build a product around it. So that’s one of the things
that I have for the future. [MUSIC PLAYING]

Tagged : # # # # # # # # # # # # # # # # # # # # # # # # # #

Dennis Veasley

2 thoughts on “How Publishers Can Take Advantage of Machine Learning (Cloud Next ’18)”

Leave a Reply

Your email address will not be published. Required fields are marked *