The single version of the truth

I almost wee my pants every time someone talks about “single version of truth” in data.

Data and everything that derives from it is subjective, just as is our world which is based on cognitive biases. Just find the version of the truth you like in the most optimal way and run with it.

By now, most people would agree that the single version of the truth is a utopia, however they would react strongly to the statement “find your version of the truth and run with it”. The strong reaction will be triggered by the natural relation between manipulation of facts to create a reality.

This reaction is logical and predictable, especially in people who are not on a CxO level and up or people who have consciously chosen jobs at big PR agencies.

The abovementioned CxOs and PR folk know very well the following:

• there is no way you are going to beat your competitors by having only a fair competitive advantage.

• statistics are used nowadays to create realities (if you torture data long enough, it will confess).

• the human-behaviour-as-a-service is a driving factor in any economy nowadays.

In reality, the executives and PR folk act daily as puppets, pulled by the forces of these rules. Their job is best done when realities are created for people and when ideas are channelled in those realities in a palatable way.

Of course, the rest of the people, even if they are made aware of the puppeteer ropes, they would still insist that decision-making needs to be fact-based to be functional and correct.

But in reality, every person working with data analysis in some way will agree that they are fully capable of bending any data into any shape they want and later on they can prove that what they did was correct. And they can also prove that it wasn’t correct, if they wanted to.

Everyone is aware that data bending is spot on a big problem in the data community right now, and everyone secretly hopes that the data workers would act responsibly.

But the puppet ropes are stronger than pure responsible behaviour.

Here are a few examples: Brexit (I am not arguing it is good nor bad) was a result of skewed statistics, Trump (I am not arguing it is good nor bad) was a result of skewed statistics, most wars nowadays are results of skewed statistics.

Yes, we can all hope that the data workers would have great morals and pure intentions but, in the end, it is not up to them. Give million bucks to a PR agency and they will run a garlic campaign to the skies. And the public will love that garlic chewing gum and will ask for more. Individuals would hate it, but the masses will love it.

Yes, we agree that some ethics in the CxO and PR world would be a nice contribution to society.

But here is the question: what would be the driver for it? Or, let me rephrase: when, how and what would make ethics in data profitable?

Read More

The data security topic

In my view data security is an abstract concept, just as abstract as money, religion and fascination – all devised by humans. And anything that is human-made can be human-destroyed.

When it comes to IT and data security, history has proven that pretty much anything can be cracked, taken, reshuffled, altered, refurbished, reheated and re-served. As long as there is a strong enough incentive, nothing is impossible.

In my experience as data specialist, I have met plenty of security officers and I have heard plenty of stories, ranging from concerns about CPU memory address containing undocumented functions to pure denial of data access due to the risk of terrorism and all of this in the dimension of the more and more popular cloud computing and big data.

There have always been and will be unknown features in hardware. They are not always accidental; indeed, sometimes they are intentional as they are generating profit for their makers on top of the profit generated by the products they are embedded in.

The profit of dumb appliances is not endless (be it CPUs, be it SIM cards with monthly subscriptions, be it smart home appliances) and to maximize profit, the vendors of these products inevitably need to find new angles. Let’s forget about the CPU memory allocations for a minute and let’s take the simpler example of telecom operators.

In the 90s when the mobile phones became mainstream, there was plenty of demand to keep the telecom operators running through subscription fees. Given that there are plenty of new users demanding the services, this is a pretty good income for a while. But as the time passes, the services are bound to get cheaper (due to competition on the market, new emerging technologies etc) and this income is not nearly enough anymore from the vendor’s point of view. At this point, there is a need for innovation, a need for opening new opportunities.

Fast forward to the mid-2000s and we get into the birth of the big data, where the actual SIM cards and phone plans are cheaper, but the data they generate is sold expensively without the owner of the device necessarily being aware of that.

For example, if you have a phone on you and you cross the city, the telecom knows which route you took and they can run endless analysis to find out what drives your choice and how it can be influenced. And if not the telecom, then someone else for sure is very much interested to know this.

Welcome to the era of the human-behavior-as-a-product.

Facebook and others have hit it big with the selling and reselling of the activities and the preferences of the mostly clueless general public.

On the plus side, the telecom data is also used for crowd analytics used for public transport optimization. This, in my view, is a great use case. As long as the bus comes on time right after I get off the train, I am happy to not wait in the rain.

So, about those CPUs undocumented functions – what options are there? The answers differ whether you are the end consumer or a producer/marketer of a product with embedded CPU.

As a producer, you could build your own hardware and document it, which you will regret shortly after because you will want to make a buck on something more than the plain appliance. There is so much more to CPUs… CPUs are not just toasters!

And even if it wasn’t for the CPUs leaks, there is so much more happening on the data scene that it is not really that important what CPUs do behind the scenes. You still need to persist data eventually, you still need to send a message or two, you still need to use networks and wireless devices (and by the way, wireless devices are perfectly capable of keystroke recognition!), you inevitably use ISPs, ISPs inevitably use satellites, and data circles the Universe several times before you get feedback on the message you sent. This list above shows how many possibilities for data break-in there are.

Good luck with data security then. The only way is to dig up a well, hide a disconnected device there and make sure you don’t communicate with anyone. In that case you are pretty safe, but not safe from the well collapsing on itself.

Here is a funny example: at company X, a team of data scientists led by a team of business stakeholders wanted to start a project for predictive maintenance of the company’s appliances, which were spread all over the country in thousands of different locations.

The business case was that it would take a long time to do inspections and maintenance of each appliance, some locations were hard to reach and most of the inspections were not even necessary, so in the end this was a great case for saving money on inspections if they could predict the failures.

For this, getting the geo data and the previous inspection protocols together were essential and throwing in some Machine Learning at the data would give great cost savings to the business.

The InfoSec team, however, was pulling the plug on the project for X consecutive years. The Security concern was that a dataset like this would be very tasty for terrorist organizations, so it should not ever be worked with and not even thought about.

All it took to get the project green lighted was a data engineer to shake the status quo by pointing out to Info Sec that the information needed could be scraped from public sources like Google Earth API, Google street view etc.

So far, we have just scratched the surface of the old-fashioned data security concerns.

Who cares about CPUs, appliances locations and public data if the latest field of war is the AI? There is a whole new unexplored territory when it comes to securing DNNs and ML models from black box attacks. By black box attacks I mean that it is fairly possible to do an attack to a DNN as an external user and after showing it enough pictures of cats to convince it that the next cat is actually a dog.

Good luck securing that! The essence of DNNs is to constantly learn and improve, and this is both their strength and their weakness.

And it is not about cats and dogs. It is about DNNs being used in media, large-scale decision-making, law enforcement and all kinds of industrial applications fields.

I guess the data security concerns need to evolve too.

On a final note, if your security officer is concerned about data encryption in the cloud, just tell them that the answer is quantum computing.

All the cloud vendors have it, and even the most complex encryption key can be broken in hours by quantum computing.

If the security officer laughs, suggest to them to do a test: the security officer encrypts their hard disk and leaves the computer at your place. Ask them to trust you that you won’t try anything.

And watch them sweat. :)

Read More

Key takeaways from the Nordic data science and machine learning summit 2018

Key takeaway #1: always keep in mind what the goal is and use your imagination when it comes to cost and resource optimization.

One of the sponsors of the conference were giving Rubik’s cubes away. When I approached them and took one, they asked me if I could fix it. “Sure”, I said, “but it depends on what tools are allowed to be used.”

At first this answer generated a bit of confusion, but then it all got fairly clear when I used brute force to dismantle the cube and its pieces and then I started to put them back together in order.

And this is what I call “cost and resource optimization”. If it would take me 10 minutes to fix it by computing the sequence of moves, compared to two minutes to dismantle it and rebuild it, then obviously it is more efficient to take the brute force approach.

In both cases there is some time and effort connected to computations; in the first case, however, the computational power needs to be a lot greater since the problem is much more complex (i.e. there are quite a few possible moves deriving from each stage) and in the second case, the need for computational power is fairly low because there is just a pattern matching needed to find the right piece and paste it to the right surrounding context (very few possible moves deriving from each stage).

This is a very important point in the daily life of any data scientist, data engineer, and AI / ML worker: as long as there are no limitations on the tools and approaches to be used, make sure you use the simplest and most efficient ones to reach your goal.

And in this case, there weren’t set limitations.

Key takeaway #2: when you see some technology becoming generally available, consider that the key tools that give it a competitive advantage have evolved and this is what makes this technology generally available. Ask yourself: where is the key competitive advantage now and what is the trend now.

For example, all sponsors at the summit were offering webcam protection devices.It is funny and a bit cynical, especially IBM – weren’t they NSA’s biggest partner? Or am I wrong? Snowden happened 5 years ago, however no significant change in the public’s mindset happened because of that. Most importantly, since then the technology has improved so much that only few entities rely entirely on the old-tech face recognition.

The rule is that when something becomes generally available and mainstream, that means that innovation has moved forward and much better methods are available somewhere else, somehow. The important note is that almost everything that is profitable or important is kept as a secret or in some way exclusive.

Of course, face recognition is not out of business yet, it is still an important part of Facebook’s, Google’s and many PR companies’ business models. The users’ preferences, likes and dislikes are gathered and analysed by the stream of data coming from the precious webcam which is facing each well-connected user. All phones, laptops and devices are looking at the user, face wrinkles and eyes movements are analysed and data is turned into insights on personal preferences. User needs and preferences get bundled and sold to marketing agencies and distributed for further user targeting. There is nothing unusual in this, companies have been trying to get competitive advantage from web cams and face recognition for at least 10 years. But this is old news.

Nowadays we get much more sophisticated technology to uniquely identify users and gather their preferences.

This is a completely different topic, however. Just keep in mind the fact that nothing exclusive is public knowledge. And you will rarely see or hear something exclusive, leading to a competitive advantage at a conference or a summit. You need to read between the lines and mix your own magic business potion.

Key takeaway #3: The AI / ML modelling vs. smart product design: think a lot before doing data science and machine learning. A proper sensor and data collection design will go a long way. Otherwise you will be trying to productionalize a monster.

For example, Hitachi Pentaho had an interesting presentation about their trains and how they reduce delays by running a huge deal of high end computing and data modelling. (Just to clarify: this takeaway is not aimed at Hitachi in any way; I am sure they are doing what they can to solve a challenge. I am using their use case because it is much simpler to understand than other very similar cases I have encountered while working on predictive maintenance projects.)

The use case: the malfunctioning train doors are the #1 cause of delays for commuter trains. The people at Hitachi do a great deal of modelling on data with a lot of variables in order to predict which doors will malfunction and thus avoid costly delays by doing maintenance accordingly.

This is all great news, but when I talked to them it turned out that the door sensor was a very simple binary sensor, i.e. “door closed-door opened” type. To me this seems quite limited, because you will be doing quite some heavy data lifting and looking for predictors in all places before you realize that your model would be trivial if you had the data in a more sophisticated format; for example, who, when and how is actually forcing or obstructing the doors.

In reality, anyone who has been on a train knows that people tend to run for trains that are about to leave and they try to squeeze themselves in. It is not a surprise that those doors are the first candidates for failures.

  • Yes, said Hitachi, but we don’t have that data.
  • Yes, said I, but there are several options:

— You can do a lot of data lifting as you do now

— You could install a better sensor, something that should have been there to start with, especially if you have the luxury to be building your own hardware from scratch. In fact, I believe this should have been considered during the hardware design phase.

— the data is readily available – when someone forces a door there is at least one device which is loaded with sensors that are fully capable of detecting all kinds of details about the event. Yes, I am talking about the mobile device of the person who forces the door! Most of the time you have plenty of other devices with cameras and sound capabilities which are recording the event (other passengers, holding their mobile devices up, passively looking at the success story of someone forcing a door to squeeze in). Of course, this data is not easily accessible, and it is owned by someone else, but I guess it might be worth asking them. The important point is that the data is available somewhere.

— there is a microphone for emergency purposes by each door. That microphone is fully capable of giving just about enough information on when and how the door was forced. The bottom line is that if you have a clumsy design to begin with, you will be doing plenty of data science later.

Which is good for the economy in the end, I mean data scientists need to get paid. On the other hand, however, they could use their time to solve more urgent problems somewhere else.

To mention another example related to this, at a famous game company years ago there was a dedicated team of people driving exactly this – making sure that the data points collected from the code are useful and properly set BEFORE the code goes to production. And no wonder the game was a top seller for years.

Key takeaway #4: social engineering, anyone? The missing puzzle bits in the AI / ML industry - social engineering topics - were nowhere to be seen or heard. But why?

I was amazed that I could not find a single person in the lobby who was game to talk about social engineering or even about social psychology. There were hundreds of people who work with data, but some hadn’t even heard of social physics.

This makes me think that these conferences are there to to mostly promote technology bits and vendors and in that process to repeat and distribute old news. I am starting to believe that rarely something essential will pop up at a conference. Innovation is somewhere else.

It just puzzles me that guilds of AI / ML and data scientists are available to tackle any problem, but those people are lacking understanding of what data does and what the mechanisms are for data to impact societies, groups and opinions.

How big data affects societies and democracy is a whole different topic. I would just encourage data people to be responsible and to read up on social psychology as much as possible.

The summit was great, though. Great networking, great people and great food!

Read More

What I have read, heard or seen on 2016-12-28.

Just read the most amazing answer on StackOverflow! Someone had asked about how to determine the optimal number of clusters in R, and the answer was so graphical and so detailed, it was just awesome!

Here is the answer: Cluster analysis in R: determine the optimal number of clusters

Here is anoter link with a suggestion on how to perform a model-based clustering

Another great post i read was the Zillow presentation on how they estimate real estate prices with their data engine. It is a great presentation about infrastructure: How R is used at Zillow to estimate housing values

Read More

Property sales analytics with rbooli .

There is a great R package written by Thomas Reinholdsson as a wraper for the Booli API (Booli is a Swedish site which hosts data from the real estate market in Sweden)

There are many unexplored areas in the real estate data in Sweden, and this post will attempt to present a few visualizations and ideas.

Before getting started, you have to read and accept Booli’s Terms of Use and then finally register to receive an API key by e-mail.

After this, let’s install the rbooli package:

Read More

A year of confusion.

Just wanted to re-post this article. A sad one, but quite true. There is a long way to go when it comes to data literacy - this is a similar issue as it happened 100 years ago with the alphabet literacy - people who could not read and write back then were suffering from easy manipulation.

This article explains how easy it is to nowadays to manipulate opinions of the masses, how easy it is to win votes and elections and how accessible it all is. Here is the article: 2016: A Year of Data-Driven Confusion

Read More

What I have read, heard or seen on 2016-12-16.

Data mining with Rattle and R: One Page R: A Survival Guide to Data Science with R

More about Rattle and R: Rattle and R book What is Rattle?

A very interesting article on how learning is performed when there is not enough time: Reinforcement Learning

Predictive analytiscs and ML: Predictive analytics and machine learning: A dynamic duo

Here is a great article about the future of our markets: The future is B2B – bot 2 bot And the future is already happening. I have already seen bots which were developed by people so they can iterate through listings on marketplaces for second-hand items with the basic idea to send emails with price offers to interesting items. It is very simple, really: the developer of the bot programs it to look for items of interest, the developer defines the search criteria (brand, price range, etc). Then the bot scrapes the newly posted items and sends an email to the seller. Of course, the bot has some logic to offer a lower, but reasonable, price. In the meantime, the developer of the bot can focus on other things, instead of on reading endless listings. Of course, some might be confused by the web scraping techniques. After all, many companies nowadays are doing a lot of customizations to protect their sites from being scraped. In vain, more or less. After all, nowadays there are software packages to iterate through web pages and save screenshots. And the OCR technology is good enough to convert the screenshots to data. Where is this going? :)

Data catalog for collaboration: http://go.alation.com/how-godaddys-data-catalog-delivered-trust-in-data

Read More

What I have read, heard or seen on 2016-12-12.

One of the greatest blog posts I have read recently was the How the Circle Line rogue train was caught with data

It seems like a serious effort was made to solve a real challenge. What was really impressive was the transformation of data to show where trains are between stations and to mark their direction of travel. It is also impressive that the data scientists got to blog about it and to publish the research and their code.

Finally, what really gives me hope is the fact that nowadays the mindset around data seems to be more open than before - some 5 years ago technology and ideas were locked down and much harder to expose and reuse. Nowadays it is great that we have tools like R, Jupyter, Python and various Notebooks which are used and shared by data scientists.

Even Microsoft has made a great turn towards data open-ness. By purchasing Revolution Analytics and implementing R into SQL Server, Microsoft has gained quite some popularity in the field of data research. Here is a wonderful article on predictive analytics with SQL Server 2016: A predictive maintenance solution template with SQL Server R Services

This is a great article, which really makes me think that Microsoft is getting back on the right track. If you look closely in the article, you will see that there is a reference to an entire library of samples on Github, which use Microsoft’s R implementation to solve real business challenges: Machine Learning Templates with SQL Server 2016 R Services

And finally, here is a great visualization of gas prices in Germany. It is awesome that it takes so little code and the data is so accessible: http://flovv.github.io/Gas_price-Mapping/

Read More

Blog with RStudio, R, RMarkdown, Jekyll and Github.

A few days ago I noticed a post by John Johnson which inspired me: I Set up new data analysis blog. I sent an email to John, and he was incredibly nice to guide me through the installation process. His answer was to the point:

I basically followed instructions at these two URLs: • http://jmcglone.com/guides/github-pages/ • http://andysouth.github.io/blog-setup/

Overall, what you really do is the following steps:

1: Fork Barry Clark’s Jekyll-now repository over to a specially-named repository on your Github account (specifically, .github.io just like the URL)

2: Sync locally

3: Edit some configuration files and maybe the template files provided

4: Set up some additional folders _Rmd, etc. You may want to .gitignore them

5: Set up an RStudio project in the local directory. You may want to .gitignore the .Rproj file, too

6: Get the function from the second URL above

7: After you compose in RMarkdown and make sure you have status: and published: in your front matter, save in _Rmd and run the function – be aware, I found a bug (really just some missing code) where the status and published updates were not saved.

8: Sync to your repository to publish

Good luck!

Thanks, John!

I am starting my own data blog. Below is the orginal blog post that came with the template.

In the first post of this new blog I’ll outline how I’ve set the blog up.

  • writing posts in Rmarkdown
  • converting posts to markdown from R
  • push to Github where Jekyll renders the markdown
  • organising all as an RStudio project

What I wanted

I wanted to be able to write about R related things without having to copy and paste code, figures or files. I had used Rmarkdown and knitr before so wanted to use them. I have a wordpress site elsewhere that someone helped me set up a couple of years ago with a blog that I’ve never used. Initially I tried seeing if I could create posts using RMarkdown and put them into that wordpress blog. A brief search revealed that was not straightforward and that Jekyll was the way to go.

What I’ve got

Now I have this blog set up so that I can write all of the posts (including this one) in RMarkdown (.Rmd) and run an R function to convert them to markdown (.md). The blog is hosted for free on Github (you get one free personal site). The site is created using Jekyll on Github, so I didn’t need to install Jekyll or Ruby. I simply edit files locally, then commit and push to Github. I manage the site as an RStudio project, enabling me to edit text, keep track of files and interact with Git all from one interface.

How I got here (steps)

creating Jekyll site on Github

I used Barry Clarks amazing Jekyll-Now repository which you can fork directly from Github and start editing to customize. He gives excellent instructions. What attarcted me to it was that it takes a matter of minutes to set up initially and if you decide you don’t like it you can just delete.

Thanks to Jan Gorecki whose answer on stackoverflow pointed me in this direction and I’ve copied some extra features like the Links and Index pages from his site.

enabling editing of the site from RStudio

I cloned the Github repository for my site using RStudio :

  • File, New project, Version control, Clone git
  • Repo URL : https://github.com/AndySouth/andysouth.github.io
  • Project directory name : andysouth.github.io

setting up so that I can write the posts in RMarkdown

This was the tricky bit for me. I followed inspiration from Jason Bryer and Jon Zelner. I had to tweak them both, the relative paths of figures was my main stumbling block. This was partly because I’m running windows and I couldn’t run the shell scripts that they created. Instead I just run an R function rmd2md which is much the same as Jason’s with some edits to paths and jekyll rendering.

Jason’s function searches a folder that you specify for .Rmd files and then puts .md files into another folder. I set this up so that any plots are put into a third folder. Thus in the root of my site includes these 3 folders.

Folder   Contents
_Rmd   RMarkdown files that I edit
_md   md files created by RMarkdown
figures   plots created by any chunks of R code

This then means that any R plot is automatically generated, saved as a png and it’s address is written into the md document so that the plot is displayed in the blog. This is shown in a simple example below that queries the WHO API to get the number of cases of one of the forms of sleeping sickness in 2013.

Read More