When trying to fit a machine learning model on a very wide data set, i.e. a data set with a large number of variables or features, it is advisable for a number of reasons to try to reduce the number of features:
- The models become easier understandable, and their output as a result better interpretable, which leads ultimately to results that can be trusted rather than those of a complex black box model.
- The exclusion of strongly correlated features can prevent model bias, as the effect of multiple variables could otherwise gain greater influence on the model as they actually do.
- Similarly, it can help to avoid the curse of dimensionality, in the case of very sparse data.
- Ultimately, the performance of the model can be optimized, as training times are shorter and the models are less computationally intensive.
While developing my talk “Machine Learning, Explainable AI, and Tableau”, that I presented together with Richard Tibbets at Tableau Conference in November 2019 in Las Vegas, I wrote a number of R scripts to perform feature selection and its preliminary tasks in Tableau. Due to the large number of questions I received about those scripts after the presentation, I decided to put together this article explaining what precisely I did there, in an attempt to make the “Tableau Feature Importance Toolbox” – as I’m calling the collection of scripts – available to the interested public. At a later point I will also summarize the contents of our talk in an article here on the blog, but for now you can find details about the scripts in the following, as well as the actual code files on my GitHub repository.
Continue reading →
With the integration of external services such as R, Python, and MatLab into Tableau you can significantly broaden the circle of possible users for your organization’s data science models and workflows, by embedding them into easy-to-use dashboards that are appealing to all types of consumers. That said, if you have ever worked with the integration of external services in Tableau, you will be aware that you can only define one service connection per workbook at a time – either to RServe or to TabPy. The first time this fact became painfully apparent to me was during a workshop where I was showing the integration with both services. Every single time I switched from a worksheet that employed some R code to one embedding Python code, I had to set the connection to RServe. Whenever I moved back over to another worksheet with some embedded Python code, I first ran into an error (since Tableau sent the Python code to RServe, which obviously made the R session choke) and then had to manually reset the connection to TabPy. The same was true for the session “R … You Ready for Python?” me and my colleague Lennart Heuckendorf delivered at Tableau Conference Europe 2018 and 2019. Even worse, I am more and more working with customers whose data science stack is very diverse, so they are using models in all kinds of languages. For them it is imperative to be able to run both R and Python models within the same dashboard, possibly even on one single worksheet. So I was wondering: can this be done?
tl;dr: It’s absolutely possible! This article outlines the process we suggest and guides you from zero to a working environment to do exactly this. It builds heavily on an idea proposed by my colleague Timo Tautenhahn and one of his customers here and here, so I can’t take all the credit. It also makes use of a number of external and open source software packages, so this is neither officially supported by Tableau (don’t try logging a ticket with Technical Support if this doesn’t work for you) nor is this an official Tableau tutorial. If all these caveats didn’t discourage you to try it out yourself, read on! All the code required or referenced here is available on my GitHub repository. Also, you’ll find a video at the end of this blog post to walk you through the full process.
Continue reading →
“R you nuts?” is what my colleague asked me when I once proposed this little hack. He’s not completely wrong, we’ll get to that later…
The task I was presented with was to embed the graphical output from an R package in a Tableau dashboard. Of course it’s possible to run R code from within Tableau Calculated fields, you can read more about it in official Tableau resources here, here, and here and also here on my blog. But part of the game is that there is only one vector of data being returned from the R session via Rserve into a Table Calculation in Tableau. So what about some of the complex graphics R can produce? Sure, you can try to rebuild those natively in Tableau based on the data returned from the code. But what if a) you’re too lazy to do that (and also it’s all just about rapid prototyping something anyways), or b) the visualization is just too complex (think 3D brain models)?
Continue reading →
Tableau introduced the
R integration in version 8.1 back in 2013. That’s awesome because it opens up to Tableau the whole range of analytical functionality
R offers. Most of the time the R code being triggered from within Tableau is rather short, such as a regression, a call to a clustering algorithm or correlation measures. But what happens when the code you want to run out of Tableau is getting longer and more complicated? Are you still bound to the “Calculated Field” dialog window in Tableau? It’s nice but it’s tiny and has no syntax coloring or code completion for our precious
Run R code inline in a Calculated Field in Tableau
Continue reading →
With my background of spatial terrorism analysis I’m always very interested in the statistical analysis of crime data. Scott Stoltzman over at stoltzmaniac.com discovered a great data set by the City and County of Denver. It has data about all the criminal offenses in the City and County of Denver for the previous five calendar years plus the current year to date with plenty of attributes, timestamps and even geographic locations. Scott wrote a series of blog posts (starting here, then here, and here) showing some initial exploratory data analysis (ETA) in R and also some in-depth looks into a few topics that sparked his interest along the way. That’s exactly what Tableau wants to enable people to do, so with this first episode of “Let me tableau this for you” I want to show how easy it is to get to the same interesting insights Scott outlined in his write-up, only without all the coding. I’m not sure how long it took him to get from finding the data to generating all the plots in the article, but my guess would be it took longer than this ~20 minute screencast. Enjoy the video below, and please let us know in the comments section if you have anything to add. Please refer to the original post if you want to know more about the idea behind Let me tableau this for you (LMTTFY).