Python offers some amazing tools to do text based searching, sorting and analysis.
In this blog I wanted to look at grabbing a large group of news headlines and then do some Natural Language Processing (NLP) in Python to try and find out “Who’s in the News”.
For this example I’ll be using the Reddit News, but any News feed like Twitter, CNN, BBC etc. could be used.
Reddit is a social media source that is free to use. To pull information from Reddit you will need to create an account and get API client ID and secret. To do this go to: https://www.reddit.com/prefs/apps/ and select edit in the developed applications area.
The Reddit Python library is called Praw, and it is installed by:
pip install praw
An example to connect to the Reddit API and list the newest 4 News headlines would be:
# red1.py - Python Reddit Example
# Get 4 Latest News Headlines
import praw
# Update with your client info
reddit = praw.Reddit(client_id='xQsMxxxxxxxx',
client_secret='X8r62xxxxxxxxxxxx',
user_agent='myreddit', username='yourname', password='xxxx')
i=0
for submission in reddit.subreddit('news').new(limit=4):
i += 1
print(i,submission.title)
Running this code will show something like:
> python red1.py 1 De Blasio Unveils Health Care Plan for Undocumented and Low-Income New Yorkers 2 Kentucky teacher seen dragging student with autism down hall pleads not guilty 3 Homeless man allegedly involved in GoFundMe scam arrested in Philadelphia 4 Government shutdown stops FDA food safety inspections
The output from the program can be checked with the Reddit’s web page:
When you are looking at Reddit it’s important to note that there are a number of different topics that could be queried. For example /inthenews is different than the /news.
Natural Language Processing
There are some amazing tools to allow you to manipulate and view the textual data. Pandas is a fast in memory data management library that supports sorting, querying and viewing of data. Spacy is a NLP tool. These two libraries are loaded by:
pip install pandas pip install spacy
As a first example we’ll look at some text and we’ll use spacy to analyze what each word in the sentence is:
# Spacy test to get work types
#
import pandas as pd
import spacy
# Use the English core small web dictionary file
nlp = spacy.load('en_core_web_sm')
nlp.entity
# load some sample text into Spacy
doc = nlp('Astronomers in Canada have revealed details of
mysterious signals emanating from a distant galaxy')
print(doc,"\n")
# list the text and show the word type
for w in doc:
print (w,w.pos_)
The output from this will:
>python red2.py Astronomers in Canada have revealed details of mysterious signals emanating from a distant galaxy Astronomers NOUN in ADP Canada PROPN have VERB revealed VERB details NOUN of ADP mysterious ADJ signals NOUN emanating VERB from ADP a DET distant ADJ galaxy NOUN
Spacy will identify the words by their word type, like Astronomers NOUN.
The proper nouns (PROPN) like Canada can be even further filtered to the type of noun, in this case location (GPE).
If you are only interested in proper nouns then it is possible to get the actual type of noun, for example: person, location, organization, work of art, date etc. To get the proper nouns the doc.ent object is queried.
# red3.py - Spacy test to get noun types
#
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.entity
doc = nlp('Fred Smith and John Doe live in Toronto and they work for the Toronto Raptors.')
print(doc,"\n")
stuff = []
for w in doc.ents:
print(w.text,w.label_)
The output for this is:
>python red3.py Fred Smith and John Doe live in Toronto and they work for the Toronto Raptors. Fred Smith PERSON John Doe PERSON Toronto GPE the Toronto Raptors ORG
Pandas – to query/group and count
The Pandas library is extremely useful for doing statistical and data manipulation type functions. If we expand our earlier example to include Pandas we can do some querying/grouping and counting
# NLP with Pandas data frames for queries/grouping/counting # import pandas as pd import spacy nlp = spacy.load('en_core_web_sm') nlp.entity doc = nlp('Fred Smith and John Doe live in Toronto and they work for the Toronto Raptors.') print(doc,"\n") stuff = [] for w in doc.ents: print(w.text,w.label_) stuff.append([w.text,w.label_]) # define a struction dflabel = ['keyword','wordtype'] # load a list into a Panda data frame with our structure df = pd.DataFrame(stuff, columns=dflabel) # print our data frame print (df.head(n=50)) # create a new data frame with only the wordtype PERSON, then group and count it names = df.query("wordtype=='PERSON'").groupby('keyword').count().sort_values(by='wordtype',ascending=False) print (names.head(n=50))
The results for this would be:
Fred Smith and John Doe live in Toronto and they work for the Toronto Raptors. Fred Smith PERSON John Doe PERSON Toronto GPE the Toronto Raptors ORG keyword wordtype 0 Fred Smith PERSON 1 John Doe PERSON 2 Toronto GPE 3 the Toronto Raptors ORG wordtype keyword Fred Smith 1 John Doe 1
Getting “Who’s in the News” from Reddit
Now we’re ready to put the pieces together. In this next example we’ll use the /inthenews Reddit section, and we’ll query 750 new items. From the results we’ll look at the people who are making the news.
# red_2_names.py - Get top names in Reddit "inthenews" # import pandas as pd import spacy import praw # Modify for your reddit id reddit = praw.Reddit(client_id='xQsMfXXX', client_secret='X8r62koQxxxxxxxx', user_agent='myreddit', username='yourname', password='xxxx') thedata = "" i=0 for submission in reddit.subreddit('inthenews').new(limit=750): i += 1 #print(i,submission.title) thedata = thedata + submission.title nlp = spacy.load('en_core_web_sm') nlp.entity doc = nlp(thedata) # Create a list of keywords and wordtypes stuff = [] dflabel = ['keyword','wordtype'] for w in doc.ents: stuff.append([w.text,w.label_]) #print(stuff) df = pd.DataFrame(stuff, columns=dflabel) names = df.query("wordtype=='PERSON'").groupby('keyword').count().sort_values(by='wordtype',ascending=False) print ("Who is making the news?\n") print (names.head(n=10))
This example will generate results similar to the following:
python red_2_names.py Who is making the news? wordtype keyword Trump 14 Ocasio-Cortez 6 Donald Trump 4 Cohen 3 Nancy Pelosi 2 Jeff Bezos 2 Ronald Reagan 2 Brown 2 Jayme Closs 2 Jared Kushner 2
The query could be changed to look at other nouns like locations, organization or all topics.
Final Comments
With just a small amount of code it is possible to do some amazing things. However like any statistical project we can improve the data set, for example entire news articles could be imported rather than just the headlines.