Menu

Big Data from scratch: Env setup (Part 1)

Recently I have been busy juggling with big data. It is such a deep ocean with wide range of domain knowledge and was particularly difficult for someone who does not have a comsci background. Although there are a lots of resource floating around the internet but not all of them are intuitive. After all the filtering and trial and error I have done through the journey of learning big data ecosystem, thought I should write a comprehensive tutorial not only as a way for my own documentation, but also for people who are in the same boat I was because I never found one single learning resource where it explains things in detailed from the ground up and from scratch! .Hope the stuff I share here would be of use for someone starting the big data journey, as the tech world is moving fast for sure.  I remember when I finished my undergraduate degree, we were using SPSS as a “database” and things certainly have changed drastically with new technalogy. Not sure what the future looks like but it is very exciting and scary at the same time.

Some may argue as a statistician or data scientist, one does not need to learn the engineering side of things. Indeed, of course in the real world these tasks will be divided amongst, software engineer, system architect , system admin, ML engineer, data engineer, ETL people, database administrator ,data scientist. However, learning some or even all of them will give you a cohesive view of the entire eco system: e.g. from a single meta data to a fully deployed ML solution. Which might come in handy sooner or later.

Let’s dive in!  (Github repository to my work can be found here)

(more…)

Text mining classified advertising contents using R, SQL and phpmyadmin (Part I)

wesearch.co.nz was launched back in 2010 as a personal project for learning about the web. I was really proud of the fact that I took something from imagination to execution. However, due to various constrains and other personal commitments at the time, unfortunately the site/idea didn’t receive the maintenance and improvement it deserves. Although the site initially received a fair amount of attention from the public , but there have only been around 300 listings posted to date.

I realized I never had the chance to look at the actual classified ads data stored behind the site. Up until this week, I decided learn some text mining techniques and thought it would be a great idea to extract ads data from the site and use it as sample data.

Disclaimer: This is my first attempt , so by no means I’m a text mining expert, if there are errors or better way to do things please let me know. Below is what I have done:

(more…)

Using R for plotting earthquake data and it was shocking

note: plot produced using R, plate boundary and earthquake data(01/Jan/2014-20/Dec/2016) with minimum magnitude of 5 or greater

P.S…never really looked into earthquake data until the recent Kaikorua event happened in NZ and it elevated my curiosity in this space. In a nutshell, although it was obvious that NZ and South America are sitting on several fault lines and that they are in the deforming plate boundary zone. but looking at the number of earthquakes illustrated graphically gave me a bit of a shock.

Have anyone previously applied machine learning in catastrophe/risk management, more specifically, in attempting to predict earthquakes (e.g. using a large number of time series data to put into machine learning algorithms to output a predicted classification)? please let me know