Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 24330

PyBites: From webscraper to wordcloud

$
0
0

Living in Belgium, I decided to scrape the Belgian newspaper Het Laatste Nieuws. I wanted to know what kept people busy when reading the news, so I went for a collection of all comments on all articles in the news section.

You can find the full code here.

Index

Requirements

The little Scraper that could

Bypassing the cookiewall

According to cookielaw.org the cookielaw can be described as following:

The Cookie Law is a piece of privacy legislation that requires websites to get consent from visitors to store or retrieve any information on a computer, smartphone or tablet.

It was designed to protect online privacy, by making consumers aware of how information about them is collected and used online, and give them a choice to allow it or not.

This means that if we haven't visited the page before, we will be greeted with a message that will block our access, asking for permission to put the cookies on our computer.

The great Cookiewall of HLN

To get past this 'cookiewall', the server needs to be presented with a cookie, so it knows we have given consent and it can track us without legal implications.

I also set my user agent to the same as the one on my computer, so there would be no differences in the source presented to me based on what browser I was using.

user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'consent_cookie={'pws':'functional|analytics|content_recommendation|targeted_advertising|social_media','pwv':'1'}...defget_pagedata():reqdata=req_session.get("https://www.hln.be/nieuws",cookies=consent_cookie,headers={'User-Agent':user_agent})returnreqdata

Et voila! No more cookiewalls, let the scraping begin!

Getting the articles

Now that I could get all the links, I had to go back to the HTML source of the page to figure out a good way to obtain the articles. I ended up with a pretty conclusive piece of code for the time of writing, and also added a way to get the categories for the articles as labeled by HLN.

All these records were turned into Article namedtuples:

Article=collections.namedtuple('Article','hash headline url categories')
defget_all_articles(reqdata):soup=BeautifulSoup(reqdata.text,"html.parser")article_list=[]html_article_list=soup.findChildren("article")forhtml_articleinhtml_article_list:article_wrappers=html_article.findChildren("div")try:html_indiv_articles=article_wrappers[1].findChildren("a",{"class":"teaser-link--color"})forarticleinhtml_indiv_articles:article_link=article['href']categories=get_categories_from_article(article_link)article_title=article.findChild("h1").textifarticle_titleisNone:article_title=article.find("h1")ifarticle_titleisNone:exit(0)sha1=hashlib.sha1()sha1.update(article_title.encode('utf-8'))article_hash=sha1.hexdigest()article_list.append(Article(hash=article_hash,headline=article_title,url=article_link,categories=categories))exceptIndexError:# these are the divs from the most-read category, we should already have these.continue

I was pretty agressive on errorhandling by either exit()ing completely, or simply ignoring the exception and continuing my loop, but that is because I feel scraping is a precise art and if data is something different than what you expect, you're expecting the wrong data!

Finally, I looped over the article list because I noticed that there were doubles (some articles might be on a 'featured' bar, and it was sometimes hard to distinguish between them)

clean_article_list=[][clean_article_list.append(itm)foritminarticle_listifitmnotinclean_article_list]returnclean_article_list

The Comments, a new challenge

Now that I had a list of all articles and the links to them, I wanted to get started by getting all the comments when I noticed my first scraping run had only gotten me 2 or 3 per article. Knowing there were 100's of comments on some of the articles in my article dataset, I realized something was wrong.

Back at the drawing board, we found the problem. A little thing called Ajax.

Every article loaded a couple of comments, and a link that said 'Show more comments'. Comments

When clicking this link, an Ajax call was made to get the next comments. If there were more after that, a link was also included for the next ajax call.

The solution came with regex, as the Ajax links all were in a very specific pattern!

Still, the recursiveness in the puzzle was a bit challenging.

comment_regex="href=\"(https\:\/\/www\.hln\.be\/ajax\/comments\/(.*?)\/start\/ts_\d*?)\">"comment_rxobj=re.compile(comment_regex)defget_comments_from_page(reqdata,article_hash):comment_list=[]soup=BeautifulSoup(reqdata.text,'html.parser')comments_list_ul=soup.find("ul",{"class":"comments__list"})ifcomments_list_ulisNone:returncomment_listcomments_indiv_li=comments_list_ul.findChildren("li",{"class":"comment"})forcommentincomments_indiv_li:comment_author=comment.find("h2",{"class":"comment__author"}).textcomment_body=comment.find("p",{"class":"comment__body"}).text.replace("\n"," ").replace("\r","")comment_list.append(Comment(article_hash=article_hash,commenter=comment_author,comment_text=comment_body))returncomment_listdefget_all_comments(article_main_link,article_hash):comment_href_list=[]comment_list=[]article_reqdata=req_session.get(url=article_main_link,cookies=consent_cookie,headers={'User-Agent':user_agent})reqdata=article_reqdatacomment_list+=get_comments_from_page(reqdata=reqdata,article_hash=article_hash)whilecomment_rxobj.search(reqdata.text):comment_href_group=comment_rxobj.findall(reqdata.text)comment_href=comment_href_group[0][0]comment_href_list.append(comment_href)time.sleep(sleeptime)reqdata=req_session.get(comment_href,cookies=consent_cookie,headers={'User-Agent':user_agent})comment_list+=get_comments_from_page(reqdata=reqdata,article_hash=article_hash)returncomment_list

Because I didn't want to be sending too many requests in a short span, I also added some ratelimiting to not be too aggressive towards the hln server.

A tiny bit of AI: SpaCy

If we want to know what people are talking about, we have to be able to recognize the different parts of a sentence. If we would simply look at all words, a lot of 'non-descriptive' words would probably show up. It would still give a good image, but a lot of space would probably be taken by stop words or conjunctions (like 'and')

What we really care about is the subjects people are talking about, the Nouns of the lines they write. That's where SpaCy comes in!

SpaCy is an open source library for Natural Language Processing. There are trained models for a set of languages that can be used for a lot of different things.

This allowed us to create a dictionary of 'types' of words:

importspacynlp=spacy.load('nl_core_news_sm')...defget_wordtype_dict(raw_comment_list):wordtype_dict=defaultdict(list)forcommentinraw_comment_list:doc=nlp(comment)fortokenindoc:wordtype_dict[token.pos_].append(token.text)returnwordtype_dict

Outputting the first 5 words for each token:

ADV:
    Eindelijk
    niet
    nu
    nog
    eens
    zo

PUNCT:
    !
    ...
    .
    .
    .
    .
PRON:
    Ik
    dat
    ik
    Ik
    Wat
    dat
VERB:
    ga
    kijken
    Kan
    wachten
    noem
    zie
NOUN:
    Hemels
    nieuws
    ziekte
    Pia
    leven
    toegewenst

The classification was not the most accurate but it was enough for what I wanted to do.

I plan on retraining the model with my new data in the future to hopefully get a more accurate model for Belgian comments with its dialects.

Making things pretty: WordCloud

Now that I had the data, all I had to do was find a way to visualize it. Wordcloud is a library designed to make... well, wordclouds!

The wordcloud library takes a blob of text, and takes turns them in to wordclouds with the size of the words reflecting the number of occurences. After turning the lists of words in our worddict into strings, 2 lines of code were enough to produce a result:

defmake_word_cloud(text_blob):wordcloud=WordCloud(width=1600,height=1200,background_color="white").generate(text_blob)returnwordcloud

For our nouns, this gave us:

Wordcloud Result

You could also use a logo to generate the wordcloud by creating an ImageColorGenerator() and passing it to the wordcloud constructor:

defmake_word_cloud_logo(text_blob):rgb_hln_arr=np.array(load_hln_logo())wcig=ImageColorGenerator(rgb_hln_arr)wordcloud=WordCloud(width=1600,height=1200,background_color="white",mask=rgb_hln_arr,color_func=wcig)\
        .generate(text_blob)returnwordcloud

For ALL the words, this resulted in:

Wordcloud Logo Result

Things for the future

There's room for a lot of improvement in this project, below are some things I really want to get done and you might see me write about in the future:

  • Improve the spacy model for dutch/belgian POS Tagging.

  • This same project but with a more Object Oriented approach

  • Fix the encoding issues in the scraper

  • Sentiment Analysis

  • Merging word groups by combining POS Tagging and Dependency Parsing

--

Thanks for reading, I hope you enjoyed it as much as I enjoyed writing it. If you have any remarks or questions, you can likely find me on the Pybites Slack Channel as 'Jarvis'.

Keep calm and code in Python!

-- Cedric


Viewing all articles
Browse latest Browse all 24330

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>