Introduction

Project Goal: Using word embeddings identify company names and stock tickers from natural text.

Assumption: Stock tickers and company names are used in similar context in natural text such as a Reddit post or a tweet.

Under this assumption, word embeddings should be a good fit for identifying these target words as word embeddings are trained by the context in which words are found.

Plan :

  1. Create a vector that can be used to represent a target vector (more below).
  2. Use representative vector to identify target words from Reddit posts.

In this…


Chrome Extension / Web-Scraper

Introduction

I have been interested in learning the chrome extension framework for a long time now, so decided I would take a crack at it. I heard Sam Parr on a podcast talk about how he uses RedditList as a tool to see what is trending around the web. So, I decided that I would try to create a new-tab override chrome extension that would display what’s trending around the web in a minimal form. I have also been interested in learning Django and chose to build a Django API, which will handle all the web scraping.

Step 1: Scrape Trending Data

I first needed to…


Python and C Implementation.

Introduction

There are many different implementations of the circular queue all of which may be better suited for specific applications. This blog post is to help understand how a circular queue works along with its uses and advantages.

Circular Queue

A Queue is a simple data structure that implements the FIFO (First-In-First-Out) ordering. This simply means that the first item added to your queue is the first one out. Just like a line or queue of customers at the deli, the first customer in line is the first to be served. …


Most Hiring Companies, Top Tools & Tech, and More

Introduction

This is an August 2019 update of my original project where I simply aim to explore the job market for data analysts and data scientists in the Greater Boston Area.

These visuals were produced only from job listings posted on Indeed with the search term ‘data analyst’ or ‘data science’ and therefore only represent companies who chose to post on Indeed. Since this is an update of a previous post I will only show the visuals here. …


Introduction

The goal of this post is to share a scripting problem with which I was challenged. Not having much experience with these types of challenges I thought it would be a great opportunity to share and look for feedback.

The Challenge

Write a script that will convert any .RIS file into a well-formed XML document.

  • Include assumptions and detailed notes.
  • The script must not have any dependencies outside base language.
  • The script should be robust against accidental misuse.
  • Each record should be placed within an article tag.

Although I know that other languages are probably better suited to these types of processes…


Working with Corpora, Document-Term Matrices, Sentiment Analysis, etc…

Introduction

This is a quick walk-through of my first project working with some of the text analysis tools in R. The goal of this project was to explore the basics of text analysis such as working with corpora, document-term matrices, sentiment analysis etc…

Packages used

Quick Look at the Data Source

I am using the job descriptions from my latest web-scraping project. Which is about 5300 job postings pulled from Indeed.

We are going to focus on the job descriptions here, as they contain the most text and information. …


Scraping Indeed with Rvest | Data Wrangling with Tidyverse | Text Mining with Stringr & Tidyverse | Visualization with the GGplot2.

Introduction

In this project, I aimed to explore the job market for data analyst and data scientist roles in Boston. I decided this would be a great opportunity to learn about web scraping and decided to build a scraper to pull this information from Indeed and explore the data.

Part-1- Scraping the Data from Indeed

For the scraper I decided to use 2 different job titles in 3 different cities, producing 6 different search terms.

  1. Data Science in New York
  2. Data Analysis in New York
  3. Data…


Introduction

In this project, I aimed to practice different hypothesis tests in R; while exploring data from the 2017 MLB season. I will briefly walk through the data exploration and cleaning but will focus on the statistical tests. After exploring the data, I chose the following questions to ask, leading to four different statistical tests.

  1. Do the high-paid players win more games? : Z-Test
  2. Is the hometown advantage real? : Chi-squared
  3. Do players who strike out more also hit more home runs? : T-Test
  4. Does years of experience affect a players rbi? ANOVA
  5. Do more people attend night games or day…

A cheat-sheet walk through

What is Tidyverse?

Tidyverse is a collection of packages for R that are all designed to work together to help users stay organized and efficient throughout their data science projects. The core packages of Tidyverse consist of the following 8 packages:

1. readr: for data import.
2. tidyr: for data tidying.
3. tibble: for tibbles, a modern re-imagining of data frames.
4. dplyr: for data manipulation.
5. stringr: for strings.
6. ggplot2: for data visualisation.
7. purrr: for functional programming.
8. forcats: for dealing factors.

See more on the tidyverse site.

Tidyverse and Rstudio have put out extremely helpful…

Brian Ward

M.Sc Computer Science at Northeastern University | Data Analysis: R, MySQL, Tableau | https://www.linkedin.com/in/brianward1428/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store