What I have read, heard or seen on 2016-12-28.

Just read the most amazing answer on StackOverflow! Someone had asked about how to determine the optimal number of clusters in R, and the answer was so graphical and so detailed, it was just awesome!

Here is the answer: Cluster analysis in R: determine the optimal number of clusters

Here is anoter link with a suggestion on how to perform a model-based clustering

Another great post i read was the Zillow presentation on how they estimate real estate prices with their data engine. It is a great presentation about infrastructure: How R is used at Zillow to estimate housing values

Property sales analytics with rbooli .

There is a great R package written by Thomas Reinholdsson as a wraper for the Booli API (Booli is a Swedish site which hosts data from the real estate market in Sweden)

There are many unexplored areas in the real estate data in Sweden, and this post will attempt to present a few visualizations and ideas.

Before getting started, you have to read and accept Booli’s Terms of Use and then finally register to receive an API key by e-mail.

After this, let’s install the rbooli package:

A Christmas tree with ggplot.

How do you not want to have this Christmas tree on your blog! Hahaha, I saw this on the Analytics Lab blog and could not resist!

Thanks!

A year of confusion.

Just wanted to re-post this article. A sad one, but quite true. There is a long way to go when it comes to data literacy - this is a similar issue as it happened 100 years ago with the alphabet literacy - people who could not read and write back then were suffering from easy manipulation.

This article explains how easy it is to nowadays to manipulate opinions of the masses, how easy it is to win votes and elections and how accessible it all is. Here is the article: 2016: A Year of Data-Driven Confusion

My new blog!

I used to blog a lot. Mostly about SQL Server. This was some time ago when SQL Server performance and database development was key. Nowadays I work mostly with data and am trying to focus on data-driven results. Here is why: About Feodor Georgiev.

Here is my old site SQLConcept.com.

And here is where I write for Simple-talk and Redgate.

What I have read, heard or seen on 2016-12-16.

Data mining with Rattle and R: One Page R: A Survival Guide to Data Science with R

More about Rattle and R: Rattle and R book What is Rattle?

A very interesting article on how learning is performed when there is not enough time: Reinforcement Learning

Predictive analytiscs and ML: Predictive analytics and machine learning: A dynamic duo

Here is a great article about the future of our markets: The future is B2B – bot 2 bot And the future is already happening. I have already seen bots which were developed by people so they can iterate through listings on marketplaces for second-hand items with the basic idea to send emails with price offers to interesting items. It is very simple, really: the developer of the bot programs it to look for items of interest, the developer defines the search criteria (brand, price range, etc). Then the bot scrapes the newly posted items and sends an email to the seller. Of course, the bot has some logic to offer a lower, but reasonable, price. In the meantime, the developer of the bot can focus on other things, instead of on reading endless listings. Of course, some might be confused by the web scraping techniques. After all, many companies nowadays are doing a lot of customizations to protect their sites from being scraped. In vain, more or less. After all, nowadays there are software packages to iterate through web pages and save screenshots. And the OCR technology is good enough to convert the screenshots to data. Where is this going? :)

Data catalog for collaboration: http://go.alation.com/how-godaddys-data-catalog-delivered-trust-in-data

What I have read, heard or seen on 2016-12-13.

A nice T-SQl piece of code to split a string with FTparser in SQL Server 2008+: Split string with sys.dm_fts_parser

Also, I learned how to read data from SQL Server into RMarkdown:

What I have read, heard or seen on 2016-12-12.

One of the greatest blog posts I have read recently was the How the Circle Line rogue train was caught with data

It seems like a serious effort was made to solve a real challenge. What was really impressive was the transformation of data to show where trains are between stations and to mark their direction of travel. It is also impressive that the data scientists got to blog about it and to publish the research and their code.

Finally, what really gives me hope is the fact that nowadays the mindset around data seems to be more open than before - some 5 years ago technology and ideas were locked down and much harder to expose and reuse. Nowadays it is great that we have tools like R, Jupyter, Python and various Notebooks which are used and shared by data scientists.

Even Microsoft has made a great turn towards data open-ness. By purchasing Revolution Analytics and implementing R into SQL Server, Microsoft has gained quite some popularity in the field of data research. Here is a wonderful article on predictive analytics with SQL Server 2016: A predictive maintenance solution template with SQL Server R Services

This is a great article, which really makes me think that Microsoft is getting back on the right track. If you look closely in the article, you will see that there is a reference to an entire library of samples on Github, which use Microsoft’s R implementation to solve real business challenges: Machine Learning Templates with SQL Server 2016 R Services

And finally, here is a great visualization of gas prices in Germany. It is awesome that it takes so little code and the data is so accessible: http://flovv.github.io/Gas_price-Mapping/

Blog with RStudio, R, RMarkdown, Jekyll and Github.

A few days ago I noticed a post by John Johnson which inspired me: I Set up new data analysis blog. I sent an email to John, and he was incredibly nice to guide me through the installation process. His answer was to the point:

I basically followed instructions at these two URLs: • http://jmcglone.com/guides/github-pages/ • http://andysouth.github.io/blog-setup/

Overall, what you really do is the following steps:

1: Fork Barry Clark’s Jekyll-now repository over to a specially-named repository on your Github account (specifically, .github.io just like the URL)

2: Sync locally

3: Edit some configuration files and maybe the template files provided

4: Set up some additional folders _Rmd, etc. You may want to .gitignore them

5: Set up an RStudio project in the local directory. You may want to .gitignore the .Rproj file, too

6: Get the function from the second URL above

7: After you compose in RMarkdown and make sure you have status: and published: in your front matter, save in _Rmd and run the function – be aware, I found a bug (really just some missing code) where the status and published updates were not saved.

8: Sync to your repository to publish

Good luck!

Thanks, John!

I am starting my own data blog. Below is the orginal blog post that came with the template.

In the first post of this new blog I’ll outline how I’ve set the blog up.

writing posts in Rmarkdown
converting posts to markdown from R
push to Github where Jekyll renders the markdown
organising all as an RStudio project

What I wanted

I wanted to be able to write about R related things without having to copy and paste code, figures or files. I had used Rmarkdown and knitr before so wanted to use them. I have a wordpress site elsewhere that someone helped me set up a couple of years ago with a blog that I’ve never used. Initially I tried seeing if I could create posts using RMarkdown and put them into that wordpress blog. A brief search revealed that was not straightforward and that Jekyll was the way to go.

What I’ve got

Now I have this blog set up so that I can write all of the posts (including this one) in RMarkdown (.Rmd) and run an R function to convert them to markdown (.md). The blog is hosted for free on Github (you get one free personal site). The site is created using Jekyll on Github, so I didn’t need to install Jekyll or Ruby. I simply edit files locally, then commit and push to Github. I manage the site as an RStudio project, enabling me to edit text, keep track of files and interact with Git all from one interface.

How I got here (steps)

creating Jekyll site on Github

I used Barry Clarks amazing Jekyll-Now repository which you can fork directly from Github and start editing to customize. He gives excellent instructions. What attarcted me to it was that it takes a matter of minutes to set up initially and if you decide you don’t like it you can just delete.

Thanks to Jan Gorecki whose answer on stackoverflow pointed me in this direction and I’ve copied some extra features like the Links and Index pages from his site.

enabling editing of the site from RStudio

I cloned the Github repository for my site using RStudio :

File, New project, Version control, Clone git
Repo URL : https://github.com/AndySouth/andysouth.github.io
Project directory name : andysouth.github.io

setting up so that I can write the posts in RMarkdown

This was the tricky bit for me. I followed inspiration from Jason Bryer and Jon Zelner. I had to tweak them both, the relative paths of figures was my main stumbling block. This was partly because I’m running windows and I couldn’t run the shell scripts that they created. Instead I just run an R function rmd2md which is much the same as Jason’s with some edits to paths and jekyll rendering.

Jason’s function searches a folder that you specify for .Rmd files and then puts .md files into another folder. I set this up so that any plots are put into a third folder. Thus in the root of my site includes these 3 folders.

Folder		Contents
_Rmd		RMarkdown files that I edit
_md		md files created by RMarkdown
figures		plots created by any chunks of R code

This then means that any R plot is automatically generated, saved as a png and it’s address is written into the md document so that the plot is displayed in the blog. This is shown in a simple example below that queries the WHO API to get the number of cases of one of the forms of sleeping sickness in 2013.

What I have read, heard or seen on 2016-12-13.

A nice T-SQl piece of code to split a string with FTparser in SQL Server 2008+: Split string with sys.dm_fts_parser

Also, I learned how to read data from SQL Server into RMarkdown: