Whats Trending ?

Chrome Extension / Web-Scraper

Introduction

I have been interested in learning the chrome extension framework for a long time now, so decided I would take a crack at it. I heard Sam Parr on a podcast talk about how he uses RedditList as a tool to see what is trending around the web. So, I decided that I would try to create a new-tab override chrome extension that would display what’s trending around the web in a minimal form. I have also been interested in learning Django and chose to build a Django API, which will handle all the web scraping.

Step 1: Scrape Trending Data

I first needed to decide which sites I wanted to scrape and see if it were plausible with basic tools (and basic knowledge).

Target Sites:

  1. Product Hunt [Success]
  2. Google Search Trends[Success]
  3. Twitter [No Success Yet]
  4. Youtube [Success]
  5. RedditList [Success]

Getting Started

My original plan was to build everything in JS within the chrome extension. However, I quickly found that chrome had a lot of limitations making it difficult to request multiple pages and perform all the tasks necessary. It also became clear that having the client perform all the scraping in the browser every time they clicked new tab would be slow and unnecessary.

Why not get some API experience out of this project as well and create a Django API? By choosing this design I would be able to write all the scrapers in Python and the chrome extension would be able to access the data from the endpoints that I created.

I am going to assume a basic understanding of Django as there are many amazing Django tutorials, and a complete walk-through would be LONG. Instead, I will walk through the basic steps of the process with a few code snippets. The full project can be found on my Github.

Django API

  1. Install and setup the Django Rest Framework
  2. Create an app that will hold all the models : ‘scraping’
  3. Create a model for each element:
  • Models are Django’s abstracted representation of a database table.
  • I chose to write a different model for each of my sites, as they won’t all hold the same features.
scraping/models.py

4. Create a serializer for each element:

  • Serializers act as translators, which help convert our data objects into other text formats such as JSON or XML.
scraping/serializers.py

5. Create viewSet for each element:

  • viewSets act as a simple interface for requests to interact with the models.
scraping/views.py

6. Setup the routers and url patterns:

scraping/urls.py
scraping/urls.py
  • These will be our API endpoints which we will use to access the data from our chrome extension.

7. Setup a postgreSQL DB (will be hosted on Heroku)

8. Deploy to Heroku

9. Create the scraper and set it up as a custom django-admin command.

This part deserves a bit more explanation, as it is an essential part of the project. The whole point of this tool is to show the user what is trending at any given moment. So I need to be able to run my scraper and update the DB every few minutes or so. I explored a few different options for this and started by trying to use a task scheduler tool called Celery. After running into a few hiccups working with celery I read about custom django-admin commands, which I would be able to automate via Heroku. These are very simple tools that create commands which can be executed in the CLI in the same fashion you would run other commands such as :

python manage.py runserver

We can create a script that we can call in the same manner. For this project, I made a script called scrape.py which will run my scraper and update the DB if the scraping was successful.

To run this script I can simply run the following command :

python manage.py scrape

I can then install the Heroku Scheduler add-on and schedule this job to run every x minutes.

Scraper

Now let’s go ahead and take a look at the scraper itself. I ended up making four different scripts:

  1. productHuntScraper
  2. searchTrendsScraper
  3. youtubeScraper
  4. redditScraper

Unfortunately, I couldn't figure out how to scrape Twitter’s Trending page. I believe I might be able to access the trending info from their API but wasn’t able to figure it out.

Two of the sites (Product Hunt, RedditList) were simple enough to scrape with BeautifulSoup, a popular python library for extracting data from HTML or the DOM. The other two (Google, Youtube) required the use of a web driver. A web driver basically acts as a user would by opening a browser window and allowing you to access the DOM as it would be presented to the user. Instead of walking through all four scrapers, I will walk through one of each to touch on the basic ideas.

Product Hunt

Product Hunt is one of my favorite sites and was my first stop to see if we can get access to their trending products list. If you don’t know Product Hunt its a fantastic site that features new products from across the web. The products with the most daily upvotes float to the top. These are the ones I wanted to capture to display on the trending page. Let’s take a look at the site and what its source code looks like.

Excellent, so Product Hunt displays its products in an html list with all the data stored directly in the html. This is exactly what we want to see, now all we need to do with our scraper is request the page, find the <li> tags and try to extract the information that we’re interested in.

We’re going to use BeautifulSoup to parse the html. BeautifulSoup is a great library that simplifies extracting data out of HTML and XML.

Quick BreakDown:

lines 6–15 : Reading ProductHunt html into a Beautiful soup object. Note: topX is the number of elements we actually want to display. So we’re just going to capture the top 10 products.

lines 16–34: Finding the post lists in the html and then getting each list item. Note: in the selector on line 16 ‘ul[class=*postsList”]’ the star is a special notation that allows us to search for ul elements with a class that contains ‘postsList’. More info Here.

lines 38–50: Here we are actually extracting the data we are interested in. We are able to capture the link, name, description, and the number of upvotes.

This is about as straight forward as it can get, thank you product Hunt for being simple. Now let’s take a look at one of the scrapers where we had to use a web-driver.

YouTube

Youtube has a Trending page that shows its top trending videos. I thought these videos could be interesting to include in our trending page as the videos might consist of anything. Let’s take a look at the trending page and its source code.

After a few minutes of looking around in the source code, I noticed that each element was stored in a div with an id= ‘title-wrapper’. This is how I chose to identify each element or video. This scraper is actually pretty simple as well, but it does require a web driver. Again, a web driver allows for the scraper to view the page as it would be presented to a user in a browser. The Selenium web driver tools were developed to help automate testing of web applications. Let’s take a look at the scraper.

lines 7–11 : Setting up the webDriver and requesting the url.

line 13 : Collecting each element (or video) by its id =’title-wrapper’

line 16–24 : extracting the interesting data from each element. Here we are able to extract the video title, info, and link. note: luckily the date/time posted and number of views are included in the video’s info.

New-Tab Override Chrome Extension

The chrome extension itself was extremely simple and due to the amazing documentation was fairly easy to learn the basic framework. This project only contains four files.

  1. manifest.json : the only necessary file which gives chrome all the important information to run the extension similar to a package.json.

2. home.html : The html of the page the extension will be displaying.

3. home.css : The css file for our home page.

4. script.js : The javascript file which will make the calls to our Django API and then populate the tables in our home.html.

manifest.json

This is the important file that tells chrome exactly how the chrome extension should work. In more complex projects there would be many more fields that can specify settings, security details etc. More Here.

  • chrome_url_overrides : This is just specifying that we will override the newtab page with home.html .
  • permissions : Here we are adding our api endpoint to our permissions list.

That’s it, very straight forward for such a simple project. Now let’s take a look at our home.html page and the script which will do all of our work.

home.html

Here we are just setting up our tables for each of the sites we scraped. I’m just showing the first table here for PruductHunt as the other tables are very similar.

script.js

the script file is going to be doing all of the work for us. Now that we have our API set up and we have our home page designed, we need to collect our data and fill our tables. I will

For each one of my tables, I created a render function that takes in the data from the API call and populates the relevant table. This is not the best design as there is a lot of code repetition. I chose to write separate functions for each table for readability and encapsulation purposes. For example, if the data displayed from ProductHunt were to change, I would only have to change the renderPH function without having to worry about any of the other table renderings being compromised.

Here, we are making our API calls and calling our render functions. Again, all of these calls could have been simplified by creating a function that makes the API request and then calls the appropriate render function. I chose to keep them all separate for readability and simplicity.

Conclusion

Looking back on the project it seems a lot more simple than it did when I was starting out. Due to the great Django and Chrome documentation, this was a very approachable project and I am happy I took on. This project is still a work in progress, and unfortunately, I have not been able to get my Selenium web drivers to run properly in the Heroku deployment. There is clearly some more work to be done on the final layout of the page as well. I really wanted to get this post up as I have been busy with other course work. Thank you for reading and please don’t hesitate to ask questions or make suggestions on how I could have made the project better.

Disclaimer : This project is strictly for academic purposes. As many sites do not condone scraping of their information, especially for products, I chose not to deploy this chrome extension. Instead, it lives in my GitHub where you are more than welcome to check out and make a version of your own.

M.Sc Computer Science at Northeastern University | Data Analysis: R, MySQL, Tableau | https://www.linkedin.com/in/brianward1428/