Text mining classified advertising contents using R, SQL and phpmyadmin (Part I)

wesearch.co.nz was launched back in 2010 as a personal project for learning about the web. I was really proud of the fact that I took something from imagination to execution. However, due to various constrains and other personal commitments at the time, unfortunately the site/idea didn’t receive the maintenance and improvement it deserves. Although the site initially received a fair amount of attention from the public , but there have only been around 300 listings posted to date.

I realized I never had the chance to look at the actual classified ads data stored behind the site. Up until this week, I decided learn some text mining techniques and thought it would be a great idea to extract ads data from the site and use it as sample data.

Disclaimer: This is my first attempt , so by no means I’m a text mining expert, if there are errors or better way to do things please let me know. Below is what I have done:

Here is what the site looks like, it is of a traditional classified ads style but being utilized as some sort of lost and found site aiming to assist people finding or returning items etc:
LandingP by iverson_zhou on 500px.com

Getting the data

The site is based on PHP and the data is stored in mysql RDBMS, so I firstly use SQL query to extract ads data and export it as csv from phpmyadmin (I’m pretty sure there is API or you can connect to mysql root remotely(not localhost) and get the data directly into R, but I personally feel more comfortable doing some quick glance/initial investigation of the data in excel hence the CSV export)
Textmining by iverson_zhou on 500px.com

I mapped the category id with the category descriptions and then make a frequency plot of number of ads by category.
Number ads by Catergory by iverson_zhou on 500px.com
As shown in the plot above, majority of the ads are from the “Dogs”, “Cats”, “General items”, “Other” and “Pets & animals” category.

Preprocessing

Before the real data processing, I decided to concatenate the ads title with the descriptions. My rationale behind this was that by randomly inspecting a few of the classified ads, I found that sometimes users include the most relevant content they intended to post in the ad title rather than posting them in the description text box. By combing the two we are assured that all the information relating to the ads will be available for data mining in some sort of vectorized format.

Next we look at the preprocessed data. R has a package called tm and we can build a corpus, and specify the source to be character vectors using this package. After building a corpus we make all character to lower case so that the subsequent text mining exercise will have a smoother data.
Pre processed by iverson_zhou on 500px.com
we can examine the ads text by indexing a couple of the meta data in the newly saved text corpus. As we can see, there are quite a bit of data cleansing to do. For instance, for some strange reasons the data contains various HTML tags, spaces, strange symbols and punctuations. Unfortunately, our computer cannot actually read them, rather it will think those are words. Thus, we will need to handle with it for our PC and R.

Luckily, we can utilized regular expression or other functions to remove it. Here we will use patten matching and replacement function gsub to search and replace the “br” tags.

removeBR <- function(x) gsub("(\n|<br />)", " ", x)
wesearchnz_Corpus <- tm_map(wesearchnz_Corpus, content_transformer(removeBR))

The same logic can be applied to remove urls, punctuations, special characters, numbers and common words to better prepare the text for analysis.
Even though it can be a very time consuming and tricky task, but it will pay off in the end and increase the quality of the analysis.
Preprocessing by iverson_zhou on 500px.com

After we’ve done the same for urls, punctuations, special characters and numbers, check the documents again to see if the data is cleaner.

We can create a Term Document Matrix of documents. It will shows the number of times each term in the corpus is found in each of the documents.


wesearchnz_tdm <- TermDocumentMatrix(wesearchnz_Corpus, control = list(wordLengths = c(1, Inf)))
wesearchnz_tdm

Term Document Matrix by iverson_zhou on 500px.com
Because this is a lost and found site we can explore frequency words and association of the term “lost”
Preprocessing by iverson_zhou on 500px.com
We can use the findFreqTerms function to inspect frequent words


(freq.terms <- findFreqTerms(wesearchnz_tdm, lowfreq = 30))

Preprocessing by iverson_zhou on 500px.com

Some of the frequent words used are “reward”, “lost”, “friendly”,”cat”,”white”,”stolen”, “help”, “car”, “call”, “auckland”. The thing I noticed here is some common English words e.g. “a”, “are”, “any” “an” and “have” appear to have been picked up by the findFreqTerms function. Obviously these need cleansing and should not be included in the analysis. We could achieve this by adding stop words and remove these words along with some common words.


myStopwords <- c(stopwords('english'), "fontsize", "body", "accent", "also", "mm","us", "th", "can","d", "de", "fontsiz", "get", "grid", "href", "lenght"
, "lockedfals", "px", "p", "relnofollow", "sansserif", "year", "will", "wlsdexcept", "sinc", "semihiddenfals")
wesearchnz_Corpus <- tm_map(wesearchnz_Corpus, removeWords, myStopwords)

Preprocessing by iverson_zhou on 500px.com
Much better! We are starting to able to get some useful information here. e.g. the count of the term “missing”, “lost” versus the count of the term “stolen”.
“White” color appears to be quite prominent amongst lost items or missing pets.
Rplot Data mining by iverson_zhou on 500px.com
We can plot a wordcloud to visually explore the terms. Appearently, people who used the site to post their classified ads have been really polite as the term “please” is quite prominent(Thanks guys 🙂 ).
In addition, “missing”, “white”, “lost”,”cat” are also very common.
WordCloud by iverson_zhou on 500px.com
One can also check the association between different words. It shows very interesting and sometimes unexpected correlation between the terms
Rplot by iverson_zhou on 500px.com

Obviously text mining can be very time consuming, tedious yet quite fun and cool. There are many more insights one can obtain from text mining such as via sentiment analysis etc. However, given the purpose of the site and the common nature of the information users provided. The data is probably not very suitable to carry out the said analysis.

January 30, 2017 Add Comment

In Data, R, SQL, Statistics, Web

Text mining classified advertising contents using R, SQL and phpmyadmin (Part I)

Getting the data

Preprocessing

Leave a Reply Cancel reply